reading utf16 files?
-
- Posts: 2038
- Joined: Tue Nov 29, 2005 8:28 pm
- Location: latiitude 50N longitude 3W
- Contact:
reading utf16 files?
is it possible to read the contents of UTF16 files with newLISP? I'm just getting a couple of strange characters when I use read-file...
Only UTF-8 encoded files are supported directly. You would have to read the file in 2-byte pieces expand those to a 4-byte Unicode integer using 'unpack' and the "u" format and then convert it to UTF-8 using the newLISP 'utf-8' function.
Lutz
ps: too busy at the moment on the GUI stuff, to give you a solution, remind me next week.
Lutz
ps: too busy at the moment on the GUI stuff, to give you a solution, remind me next week.
In case of conversion can help U (to utf-8 or to something national), U can use iconv() libc call.
Look at http://en.feautec.pp.ru/store/libs/doc/iconv.lsp.html.
Under *nices other than Linux, path to libc may need correction.
Under Win* I'd seen iconv.dll somewhere, but there was slightly different function names.
Look at http://en.feautec.pp.ru/store/libs/doc/iconv.lsp.html.
Under *nices other than Linux, path to libc may need correction.
Under Win* I'd seen iconv.dll somewhere, but there was slightly different function names.
WBR, Dmi
-
- Posts: 2038
- Joined: Tue Nov 29, 2005 8:28 pm
- Location: latiitude 50N longitude 3W
- Contact:
-
- Posts: 2038
- Joined: Tue Nov 29, 2005 8:28 pm
- Location: latiitude 50N longitude 3W
- Contact:
Yeah - I found a few MacOS X libraries, but they didn't seem to work:
libc.dylib
libiconv.dylib
libiconv.2.2.0.dylib
libiconv.2.dylib
libiconv.dylib
It seemed a bit easier to try this:
(exec "iconv -f -t " etc....)
But in the end, I used something else altogether, just to get it done. :-(
Thanks though...
libc.dylib
libiconv.dylib
libiconv.2.2.0.dylib
libiconv.2.dylib
libiconv.dylib
It seemed a bit easier to try this:
(exec "iconv -f -t " etc....)
But in the end, I used something else altogether, just to get it done. :-(
Thanks though...
This may not be the fastest approach, or even the most accurate, but it seemed to work in my tests.
And speaking of Unicode files, does anyone (Lutz ;) know how to open a file with Unicode characters in the path (on Windows)? I tried it using a utf8 string, but open just returned nil. Do I need to dig into the Win32 API on this one?
Edit: Much faster version (2x), and leaves it as unicode (for you to call utf8 if desired).
Edit:
Note the ">u" may also be "<u", depending on whether it is BE or LE encoded.
Code: Select all
(define (utf16->utf8 s)
(join
(map
(fn (c)
(utf8 (append (reverse c) "\000\000\000\000\000\000"))
)
(find-all ".." s)
)
)
)
Edit: Much faster version (2x), and leaves it as unicode (for you to call utf8 if desired).
Code: Select all
(define (utf16->utf32 s)
(append
(join
(map
; (curry pack "u") ;identical speed
(fn (c)
(pack "u" c)
)
(unpack (dup ">u" (>> (length s) 1)) s)
)
"\000\000"
)
"\000\000\000\000\000\000"
)
)
Note the ">u" may also be "<u", depending on whether it is BE or LE encoded.
Last edited by m35 on Wed Jun 06, 2007 2:33 pm, edited 3 times in total.
Thanks Lutz, I gave that a try but still didn't have any luck.
I have a file with the path
"F:\test\梶浦由記\file.txt"
I run the following (in the "test" directory) with the following result.(note: newlispw = UTF8 enabled newlisp)
Hoping it's just a console limitation, I also run thisOpening the "dir.txt" file, again all I see is ????
Finally, trying to read the file
I have a file with the path
"F:\test\梶浦由記\file.txt"
I run the following (in the "test" directory) with the following result.
Code: Select all
F:\test>newlispw -e "(directory)"
("." ".." "????")
Hoping it's just a console limitation, I also run this
Code: Select all
F:\test>newlispw -e "(write-file {dir.txt} ((directory) 3))"
4
Finally, trying to read the file
Code: Select all
F:\test>newlispw -e "(read-file {????\file.txt})"
nil
F:\test>newlispw -e "(open {????\file.txt} {r})"
nil
It works on MacOS X:
don't know whats different on Win32
Lutz
Code: Select all
newLISP v.9.1.7 on OSX UTF-8, execute 'newlisp -h' for more info.
> (print "\230\162\182\230\181\166\231\148\177\232\168\152")
????"\230\162\182\230\181\166\231\148\177\232\168\152"
> (write-file "\230\162\182\230\181\166\231\148\177\232\168\152" "Hello Unicode")
13
> (directory)
("." ".." ".DS_Store" "\230\162\182\230\181\166\231\148\177\232\168\152")
> !ls
????????????
> (read-file "\230\162\182\230\181\166\231\148\177\232\168\152")
"Hello Unicode"
>
Lutz
... the thing is not to 'print', but get the unprinted string to work with. In newLISP you see the raw string where UTF-8 is shown with numbers in the return values. I guess if you do exactly the same thing, I did in MacOS X, it will work for you too on Win32. What it prooves is, that both OSs seem to encode filenames in UTF-8.
Lutz
Lutz
Wow, things look interesting with that same code on windows
I am left with a file named
in the directory.
Code: Select all
newLISP v.9.1.1 on Win32 UTF-8, execute 'newlisp -h' for more info.
> (print "\230\162\182\230\181\166\231\148\177\232\168\152")
梶浦由記"梶浦由記"
> (write-file "\230\162\182\230\181\166\231\148\177\232\168\152" "Hello Unicode")
13
> (directory)
("." ".." "梶浦由記")
> !dir
Volume in drive F has no label.
Volume Serial Number is C458-D3A7
Directory of F:\test2
06/04/2007 05:45 PM <DIR> .
06/04/2007 05:45 PM <DIR> ..
06/04/2007 05:45 PM 13 梶æµ▌ç"±è"~
1 File(s) 13 bytes
2 Dir(s) 12,067,328 bytes free
> (read-file "\230\162\182\230\181\166\231\148\177\232\168\152")
"Hello Unicode"
>
Code: Select all
梶浦由記
Windows and UTF-8
Actually for Win2k and above to set the command line to UTF-8 you will have simply to set the code page with the following command, chcp 65001, prior to the execution of your command. There only caveat is: make sure the command prompt's properties are not set on Raster FontsLutz wrote:I believe notepad.exe has a UTF-8 option and you could paste those characters into it to have the Chinese chars back.
What you would need on Wndows is a cmd.exe which does UTF-8
Lutz
Thanks jp! I wasn't aware of that one.jp wrote:set the code page with the following command, chcp 65001
Now here is that same process after changing the code page.
Code: Select all
F:\temp>chcp 65001
Active code page: 65001
...
newLISP v.9.1.1 on Win32 UTF-8, execute 'newlisp -h' for more info.
> (print "\230\162\182\230\181\166\231\148\177\232\168\152")
梶浦由記""
> (write-file "\230\162\182\230\181\166\231\148\177\232\168\152" "Hello Unicode")
13
> (directory)
("." ".." "")
> !dir /w
Volume in drive F has no label.
Volume Serial Number is C458-D3A7
Directory of F:\temp
[.] [..] 梶浦由記
1 File(s) 13 bytes
2 Dir(s) 12,066,816 bytes free
> (read-file "\230\162\182\230\181\166\231\148\177\232\168\152")
"Hello Unicode"
>
The behavior of that (directory) entry is interesting...
Code: Select all
> (directory)
("." ".." "")
> (length ((directory) 2))
12
> (setq s ((directory) 2))
""
> s
""
> (length s)
12
> (source 's)
"(set 's "")\r\n\r\n"
> (print s)
梶浦由記""
Since I'm not having any luck, I went ahead and implemented UTF-16 versions of functions that refer to path names (using the Win32 API). I'll post them on the "newlisp for Win" board when I'm done.
Windows and UTF-8
Perhaps it is worth mentioning that for win2k and above the internal representations are in Unicode UTF-16LE and if one can change arbitrarily its DOS code page, in Windows proper, the internal character representations remained fixed.m35 wrote:Unfortunately I'm still left with the 梶浦由記 file, and not the proper Unicode one.
Also the name 梶浦由記 strikes me more as being a Japanese name (Kajiura Yuki) rather than a Chinese. Nonetheless Windows will need to have its Chinese/Japanese Fonts enabled in order to render those characters properly.
Speaking about good eyes?? That must be a secret hint.. I was indeed wondering why he mispelled california... ;-) No offence btw... it just caught my eye too and did not know there was perhpas a reason for it..And I believe you are a Japanese native speaker since you inadvertently inverted the L for an R in your login summary.
-- (define? (Cornflakes))
Well, there is nothing too esoteric about it!Speaking about good eyes?? That must be a secret hint..
Japanese has no phonetic equivalent to the L and R consonants but has a consonant that seat somewhere between those 2 sounds. Hence even knowing perfectly well all common place names since childhood due to the lack of that phonetic register the Japanese are often at loss to write down L and R containing names in English they know assuredly in Japanese.