Page 1 of 1
reading utf16 files?
Posted: Mon May 28, 2007 9:56 pm
by cormullion
is it possible to read the contents of UTF16 files with newLISP? I'm just getting a couple of strange characters when I use read-file...
Posted: Mon May 28, 2007 11:28 pm
by Lutz
Only UTF-8 encoded files are supported directly. You would have to read the file in 2-byte pieces expand those to a 4-byte Unicode integer using 'unpack' and the "u" format and then convert it to UTF-8 using the newLISP 'utf-8' function.
Lutz
ps: too busy at the moment on the GUI stuff, to give you a solution, remind me next week.
Posted: Tue May 29, 2007 5:52 am
by Dmi
In case of conversion can help U (to utf-8 or to something national), U can use iconv() libc call.
Look at
http://en.feautec.pp.ru/store/libs/doc/iconv.lsp.html.
Under *nices other than Linux, path to libc may need correction.
Under Win* I'd seen iconv.dll somewhere, but there was slightly different function names.
Posted: Tue May 29, 2007 6:54 am
by cormullion
thanks guys, i'll check these ideas out...!
Posted: Wed May 30, 2007 8:34 pm
by cormullion
Yeah - I found a few MacOS X libraries, but they didn't seem to work:
libc.dylib
libiconv.dylib
libiconv.2.2.0.dylib
libiconv.2.dylib
libiconv.dylib
It seemed a bit easier to try this:
(exec "iconv -f -t " etc....)
But in the end, I used something else altogether, just to get it done. :-(
Thanks though...
Posted: Wed May 30, 2007 9:07 pm
by Dmi
What does "man 3 iconv" shows about linking and about function specifications?
Usage of "iconv" shell command is not a good idea because it doesn't handle incorrect symbols - just stop processing immediately.
In Linux I using "recode -f" for that.
Posted: Mon Jun 04, 2007 11:09 pm
by m35
This may not be the fastest approach, or even the most accurate, but it seemed to work in my tests.
Code: Select all
(define (utf16->utf8 s)
(join
(map
(fn (c)
(utf8 (append (reverse c) "\000\000\000\000\000\000"))
)
(find-all ".." s)
)
)
)
And speaking of Unicode files, does anyone (Lutz ;) know how to open a file with Unicode characters in the path (on Windows)? I tried it using a utf8 string, but
open just returned nil. Do I need to dig into the Win32 API on this one?
Edit: Much faster version (2x), and leaves it as unicode (for you to call utf8 if desired).
Code: Select all
(define (utf16->utf32 s)
(append
(join
(map
; (curry pack "u") ;identical speed
(fn (c)
(pack "u" c)
)
(unpack (dup ">u" (>> (length s) 1)) s)
)
"\000\000"
)
"\000\000\000\000\000\000"
)
)
Edit:
Note the ">u" may also be "<u", depending on whether it is BE or LE encoded.
Posted: Mon Jun 04, 2007 11:32 pm
by Lutz
If I have a file-path name with strange character encoding in it, I try to open it using the string shown from a 'directory' statement. That may show you how the filename characters have to be translated.
Lutz
Posted: Tue Jun 05, 2007 12:10 am
by m35
Thanks Lutz, I gave that a try but still didn't have any luck.
I have a file with the path
"F:\test\梶浦由記\file.txt"
I run the following (in the "test" directory) with the following result.
Code: Select all
F:\test>newlispw -e "(directory)"
("." ".." "????")
(note: newlispw = UTF8 enabled newlisp)
Hoping it's just a console limitation, I also run this
Code: Select all
F:\test>newlispw -e "(write-file {dir.txt} ((directory) 3))"
4
Opening the "dir.txt" file, again all I see is ????
Finally, trying to read the file
Code: Select all
F:\test>newlispw -e "(read-file {????\file.txt})"
nil
F:\test>newlispw -e "(open {????\file.txt} {r})"
nil
Posted: Tue Jun 05, 2007 12:33 am
by Lutz
It works on MacOS X:
Code: Select all
newLISP v.9.1.7 on OSX UTF-8, execute 'newlisp -h' for more info.
> (print "\230\162\182\230\181\166\231\148\177\232\168\152")
????"\230\162\182\230\181\166\231\148\177\232\168\152"
> (write-file "\230\162\182\230\181\166\231\148\177\232\168\152" "Hello Unicode")
13
> (directory)
("." ".." ".DS_Store" "\230\162\182\230\181\166\231\148\177\232\168\152")
> !ls
????????????
> (read-file "\230\162\182\230\181\166\231\148\177\232\168\152")
"Hello Unicode"
>
don't know whats different on Win32
Lutz
Posted: Tue Jun 05, 2007 12:35 am
by Lutz
... before posting I saw the Chinese characters in the post/edit box of the browser, and also in the terminal window, but after posting they got ???? (in the first 'print' statement)
Lutz
Posted: Tue Jun 05, 2007 12:43 am
by Lutz
... the thing is not to 'print', but get the unprinted string to work with. In newLISP you see the raw string where UTF-8 is shown with numbers in the return values. I guess if you do exactly the same thing, I did in MacOS X, it will work for you too on Win32. What it prooves is, that both OSs seem to encode filenames in UTF-8.
Lutz
Posted: Tue Jun 05, 2007 1:05 am
by m35
Wow, things look interesting with that same code on windows
Code: Select all
newLISP v.9.1.1 on Win32 UTF-8, execute 'newlisp -h' for more info.
> (print "\230\162\182\230\181\166\231\148\177\232\168\152")
梶浦由記"梶浦由記"
> (write-file "\230\162\182\230\181\166\231\148\177\232\168\152" "Hello Unicode")
13
> (directory)
("." ".." "梶浦由記")
> !dir
Volume in drive F has no label.
Volume Serial Number is C458-D3A7
Directory of F:\test2
06/04/2007 05:45 PM <DIR> .
06/04/2007 05:45 PM <DIR> ..
06/04/2007 05:45 PM 13 梶æµ▌ç"±è"~
1 File(s) 13 bytes
2 Dir(s) 12,067,328 bytes free
> (read-file "\230\162\182\230\181\166\231\148\177\232\168\152")
"Hello Unicode"
>
I am left with a file named
in the directory.
Posted: Tue Jun 05, 2007 2:11 am
by Lutz
I believe notepad.exe has a UTF-8 option and you could paste those characters into it to have the Chinese chars back.
What you would need on Wndows is a cmd.exe which does UTF-8
Lutz
Windows and UTF-8
Posted: Wed Jun 06, 2007 4:05 am
by jp
Lutz wrote:I believe notepad.exe has a UTF-8 option and you could paste those characters into it to have the Chinese chars back.
What you would need on Wndows is a cmd.exe which does UTF-8
Lutz
Actually for Win2k and above to set the command line to UTF-8 you will have simply to set the code page with the following command, chcp 65001, prior to the execution of your command. There only caveat is: make sure the command prompt's properties are not set on Raster Fonts
Posted: Wed Jun 06, 2007 6:15 pm
by m35
jp wrote:set the code page with the following command, chcp 65001
Thanks jp! I wasn't aware of that one.
Now here is that same process after changing the code page.
Code: Select all
F:\temp>chcp 65001
Active code page: 65001
...
newLISP v.9.1.1 on Win32 UTF-8, execute 'newlisp -h' for more info.
> (print "\230\162\182\230\181\166\231\148\177\232\168\152")
梶浦由記""
> (write-file "\230\162\182\230\181\166\231\148\177\232\168\152" "Hello Unicode")
13
> (directory)
("." ".." "")
> !dir /w
Volume in drive F has no label.
Volume Serial Number is C458-D3A7
Directory of F:\temp
[.] [..] 梶浦由記
1 File(s) 13 bytes
2 Dir(s) 12,066,816 bytes free
> (read-file "\230\162\182\230\181\166\231\148\177\232\168\152")
"Hello Unicode"
>
Note that the 梶浦由記 appear as rectangles in the console (but I assume that's just because the Lucida Console font doesn't have those characters).
The behavior of that (directory) entry is interesting...
Code: Select all
> (directory)
("." ".." "")
> (length ((directory) 2))
12
> (setq s ((directory) 2))
""
> s
""
> (length s)
12
> (source 's)
"(set 's "")\r\n\r\n"
> (print s)
梶浦由記""
Unfortunately I'm still left with the 梶浦由記 file, and not the proper Unicode one.
Since I'm not having any luck, I went ahead and implemented UTF-16 versions of functions that refer to path names (using the Win32 API). I'll post them on the "newlisp for Win" board when I'm done.
Windows and UTF-8
Posted: Thu Jun 07, 2007 12:49 am
by jp
m35 wrote:Unfortunately I'm still left with the 梶浦由記 file, and not the proper Unicode one.
Perhaps it is worth mentioning that for win2k and above the internal representations are in Unicode UTF-16LE and if one can change arbitrarily its DOS code page, in Windows proper, the internal character representations remained fixed.
Also the name 梶浦由記 strikes me more as being a Japanese name (Kajiura Yuki) rather than a Chinese. Nonetheless Windows will need to have its Chinese/Japanese Fonts enabled in order to render those characters properly.
Posted: Thu Jun 07, 2007 12:06 pm
by m35
jp wrote:Also the name 梶浦由記 strikes me more as being a Japanese name (Kajiura Yuki) rather than a Chinese.
Good eye jp. Read Japanese? Other languages?
ps I'm a big fan of Yuki Kajiura's
work :)
Posted: Thu Jun 07, 2007 11:54 pm
by jp
m35 wrote:Good eye jp. Read Japanese? Other languages?
Pleased to oblige!
Yes indeed, I read Japanese. And I believe you are a Japanese native speaker since you inadvertently inverted the L for an R in your login summary.
Posted: Fri Jun 08, 2007 3:23 am
by m35
ご免なさい I know only a little Japanese because I work with Japanese people (and like あにめ ^_^). The カリフォニア typo is part 日本語 accent, and part Arnold Schwarzenegger accent (´∀`)
Posted: Fri Jun 08, 2007 7:12 pm
by newdep
And I believe you are a Japanese native speaker since you inadvertently inverted the L for an R in your login summary.
Speaking about good eyes?? That must be a secret hint.. I was indeed wondering why he mispelled california... ;-) No offence btw... it just caught my eye too and did not know there was perhpas a reason for it..
Posted: Sat Jun 09, 2007 3:26 am
by jp
Speaking about good eyes?? That must be a secret hint..
Well, there is nothing too esoteric about it!
Japanese has no phonetic equivalent to the L and R consonants but has a consonant that seat somewhere between those 2 sounds. Hence even knowing perfectly well all common place names since childhood due to the lack of that phonetic register the Japanese are often at loss to write down L and R containing names in English they know assuredly in Japanese.