reading utf16 files?

cormullion · Post by **cormullion** » Mon May 28, 2007 9:56 pm

is it possible to read the contents of UTF16 files with newLISP? I'm just getting a couple of strange characters when I use read-file...

Lutz · Post by **Lutz** » Mon May 28, 2007 11:28 pm

Only UTF-8 encoded files are supported directly. You would have to read the file in 2-byte pieces expand those to a 4-byte Unicode integer using 'unpack' and the "u" format and then convert it to UTF-8 using the newLISP 'utf-8' function.

Lutz

ps: too busy at the moment on the GUI stuff, to give you a solution, remind me next week.

Dmi · Post by **Dmi** » Tue May 29, 2007 5:52 am

In case of conversion can help U (to utf-8 or to something national), U can use iconv() libc call.
Look at http://en.feautec.pp.ru/store/libs/doc/iconv.lsp.html.
Under *nices other than Linux, path to libc may need correction.
Under Win* I'd seen iconv.dll somewhere, but there was slightly different function names.

cormullion · Post by **cormullion** » Tue May 29, 2007 6:54 am

thanks guys, i'll check these ideas out...!

cormullion · Post by **cormullion** » Wed May 30, 2007 8:34 pm

Yeah - I found a few MacOS X libraries, but they didn't seem to work:

libc.dylib
libiconv.dylib
libiconv.2.2.0.dylib
libiconv.2.dylib
libiconv.dylib

It seemed a bit easier to try this:

(exec "iconv -f -t " etc....)

But in the end, I used something else altogether, just to get it done. :-(

Thanks though...

Dmi · Post by **Dmi** » Wed May 30, 2007 9:07 pm

What does "man 3 iconv" shows about linking and about function specifications?

Usage of "iconv" shell command is not a good idea because it doesn't handle incorrect symbols - just stop processing immediately.

In Linux I using "recode -f" for that.

m35 · Post by **m35** » Mon Jun 04, 2007 11:09 pm

This may not be the fastest approach, or even the most accurate, but it seemed to work in my tests.

Code: Select all

(define (utf16->utf8 s)
	(join
		(map
			(fn (c)
				(utf8 (append (reverse c) "\000\000\000\000\000\000"))
			)
			(find-all ".." s)
		)
	)
)

And speaking of Unicode files, does anyone (Lutz ;) know how to open a file with Unicode characters in the path (on Windows)? I tried it using a utf8 string, but open just returned nil. Do I need to dig into the Win32 API on this one?

Edit: Much faster version (2x), and leaves it as unicode (for you to call utf8 if desired).

Code: Select all

(define (utf16->utf32 s)
	(append 
		(join 
			(map 
				; (curry pack "u") ;identical speed 
				(fn (c) 
					(pack "u" c)
				)
				(unpack (dup ">u" (>> (length s) 1)) s)
			)
			"\000\000"
		)
		"\000\000\000\000\000\000"
	)
)

Edit:
Note the ">u" may also be "<u", depending on whether it is BE or LE encoded.

Lutz · Post by **Lutz** » Mon Jun 04, 2007 11:32 pm

If I have a file-path name with strange character encoding in it, I try to open it using the string shown from a 'directory' statement. That may show you how the filename characters have to be translated.

Lutz

m35 · Post by **m35** » Tue Jun 05, 2007 12:10 am

Thanks Lutz, I gave that a try but still didn't have any luck.

I have a file with the path
"F:\test\梶浦由記\file.txt"

I run the following (in the "test" directory) with the following result.

Code: Select all

F:\test>newlispw -e "(directory)"
("." ".." "????")

(note: newlispw = UTF8 enabled newlisp)

Hoping it's just a console limitation, I also run this

Code: Select all

F:\test>newlispw -e "(write-file {dir.txt} ((directory) 3))"
4

Opening the "dir.txt" file, again all I see is ????

Finally, trying to read the file

Code: Select all

F:\test>newlispw -e "(read-file {????\file.txt})"
nil

F:\test>newlispw -e "(open {????\file.txt} {r})"
nil

Lutz · Post by **Lutz** » Tue Jun 05, 2007 12:33 am

It works on MacOS X:

Code: Select all

newLISP v.9.1.7 on OSX UTF-8, execute 'newlisp -h' for more info.

> (print "\230\162\182\230\181\166\231\148\177\232\168\152")
????"\230\162\182\230\181\166\231\148\177\232\168\152"
> (write-file "\230\162\182\230\181\166\231\148\177\232\168\152" "Hello Unicode")
13
> (directory)
("." ".." ".DS_Store" "\230\162\182\230\181\166\231\148\177\232\168\152")
> !ls
????????????
> (read-file "\230\162\182\230\181\166\231\148\177\232\168\152")
"Hello Unicode"
>

don't know whats different on Win32

Lutz

Lutz · Post by **Lutz** » Tue Jun 05, 2007 12:35 am

... before posting I saw the Chinese characters in the post/edit box of the browser, and also in the terminal window, but after posting they got ???? (in the first 'print' statement)

Lutz

Lutz · Post by **Lutz** » Tue Jun 05, 2007 12:43 am

... the thing is not to 'print', but get the unprinted string to work with. In newLISP you see the raw string where UTF-8 is shown with numbers in the return values. I guess if you do exactly the same thing, I did in MacOS X, it will work for you too on Win32. What it prooves is, that both OSs seem to encode filenames in UTF-8.

Lutz

m35 · Post by **m35** » Tue Jun 05, 2007 1:05 am

Wow, things look interesting with that same code on windows

Code: Select all

newLISP v.9.1.1 on Win32 UTF-8, execute 'newlisp -h' for more info.

> (print "\230\162\182\230\181\166\231\148\177\232\168\152")
µó╢µ╡ªτö▒Φ¿ÿ"µó╢µ╡ªτö▒Φ¿ÿ"
> (write-file "\230\162\182\230\181\166\231\148\177\232\168\152" "Hello Unicode")
13
> (directory)
("." ".." "µó╢µ╡ªτö▒Φ¿ÿ")
> !dir
 Volume in drive F has no label.
 Volume Serial Number is C458-D3A7

 Directory of F:\test2

06/04/2007  05:45 PM    <DIR>          .
06/04/2007  05:45 PM    <DIR>          ..
06/04/2007  05:45 PM                13 æ¢¶æµ▌ç"±è"~
               1 File(s)             13 bytes
               2 Dir(s)      12,067,328 bytes free
> (read-file "\230\162\182\230\181\166\231\148\177\232\168\152")
"Hello Unicode"
>

I am left with a file named

Code: Select all

æ¢¶æµ¦ç”±è¨˜

in the directory.

Lutz · Post by **Lutz** » Tue Jun 05, 2007 2:11 am

I believe notepad.exe has a UTF-8 option and you could paste those characters into it to have the Chinese chars back.

What you would need on Wndows is a cmd.exe which does UTF-8

Lutz

jp · Post by jp » Wed Jun 06, 2007 4:05 am

Lutz wrote:I believe notepad.exe has a UTF-8 option and you could paste those characters into it to have the Chinese chars back.

What you would need on Wndows is a cmd.exe which does UTF-8

Lutz

Actually for Win2k and above to set the command line to UTF-8 you will have simply to set the code page with the following command, chcp 65001, prior to the execution of your command. There only caveat is: make sure the command prompt's properties are not set on Raster Fonts

m35 · Post by **m35** » Wed Jun 06, 2007 6:15 pm

jp wrote:set the code page with the following command, chcp 65001

Thanks jp! I wasn't aware of that one.

Now here is that same process after changing the code page.

Code: Select all

F:\temp>chcp 65001
Active code page: 65001

...

newLISP v.9.1.1 on Win32 UTF-8, execute 'newlisp -h' for more info.

> (print "\230\162\182\230\181\166\231\148\177\232\168\152")
梶浦由記""
> (write-file "\230\162\182\230\181\166\231\148\177\232\168\152" "Hello Unicode")
13
> (directory)
("." ".." "")
> !dir /w
 Volume in drive F has no label.
 Volume Serial Number is C458-D3A7

 Directory of F:\temp

[.]            [..]           æ¢¶æµ¦ç”±è¨˜
               1 File(s)             13 bytes
               2 Dir(s)      12,066,816 bytes free
> (read-file "\230\162\182\230\181\166\231\148\177\232\168\152")
"Hello Unicode"
>

Note that the 梶浦由記 appear as rectangles in the console (but I assume that's just because the Lucida Console font doesn't have those characters).

The behavior of that (directory) entry is interesting...

Code: Select all

> (directory)
("." ".." "")
> (length ((directory) 2))
12
> (setq s ((directory) 2))
""
> s
""
> (length s)
12
> (source 's)
"(set 's "")\r\n\r\n"
> (print s)
梶浦由記""

Unfortunately I'm still left with the æ¢¶æµ¦ç”±è¨˜ file, and not the proper Unicode one.

Since I'm not having any luck, I went ahead and implemented UTF-16 versions of functions that refer to path names (using the Win32 API). I'll post them on the "newlisp for Win" board when I'm done.

jp · Post by jp » Thu Jun 07, 2007 12:49 am

m35 wrote:Unfortunately I'm still left with the æ¢¶æµ¦ç”±è¨˜ file, and not the proper Unicode one.

Perhaps it is worth mentioning that for win2k and above the internal representations are in Unicode UTF-16LE and if one can change arbitrarily its DOS code page, in Windows proper, the internal character representations remained fixed.
Also the name 梶浦由記 strikes me more as being a Japanese name (Kajiura Yuki) rather than a Chinese. Nonetheless Windows will need to have its Chinese/Japanese Fonts enabled in order to render those characters properly.

m35 · Post by **m35** » Thu Jun 07, 2007 12:06 pm

jp wrote:Also the name 梶浦由記 strikes me more as being a Japanese name (Kajiura Yuki) rather than a Chinese.

Good eye jp. Read Japanese? Other languages?

ps I'm a big fan of Yuki Kajiura's work :)

jp · Post by jp » Thu Jun 07, 2007 11:54 pm

m35 wrote:Good eye jp. Read Japanese? Other languages?

Pleased to oblige!
Yes indeed, I read Japanese. And I believe you are a Japanese native speaker since you inadvertently inverted the L for an R in your login summary.

m35 · Post by **m35** » Fri Jun 08, 2007 3:23 am

ご免なさい I know only a little Japanese because I work with Japanese people (and like あにめ ^_^). The カリフォニア typo is part 日本語 accent, and part Arnold Schwarzenegger accent （´∀｀）

newdep · Post by **newdep** » Fri Jun 08, 2007 7:12 pm

And I believe you are a Japanese native speaker since you inadvertently inverted the L for an R in your login summary.

Speaking about good eyes?? That must be a secret hint.. I was indeed wondering why he mispelled california... ;-) No offence btw... it just caught my eye too and did not know there was perhpas a reason for it..

jp · Post by jp » Sat Jun 09, 2007 3:26 am

Speaking about good eyes?? That must be a secret hint..

Well, there is nothing too esoteric about it!
Japanese has no phonetic equivalent to the L and R consonants but has a consonant that seat somewhere between those 2 sounds. Hence even knowing perfectly well all common place names since childhood due to the lack of that phonetic register the Japanese are often at loss to write down L and R containing names in English they know assuredly in Japanese.

newlispfanclub.alh.net

reading utf16 files?

reading utf16 files?

Windows and UTF-8

Windows and UTF-8