text-convert to UTF8 with standard newLISP?

For the Compleat Fan
Locked
HPW
Posts: 1390
Joined: Thu Sep 26, 2002 9:15 am
Location: Germany
Contact:

text-convert to UTF8 with standard newLISP?

Post by HPW »

Maybe someone ran across this problem before:

Can I use standard newLISP to convert a text-file to a UTF8-file?

http://en.wikipedia.org/wiki/UTF8
Hans-Peter

Dmi
Posts: 408
Joined: Sat Jun 04, 2005 4:16 pm
Location: Russia
Contact:

Post by Dmi »

On linux (and probably others *nixes with iconv() function in stdlib) you may use ICONV context I've posted just yesterday here (topic iconv/recode):

Code: Select all

(load "iconv.lsp")
(write-file "text.utf-8"
  (ICONV:recode-once "YOUR-ENCODING" "UTF-8"
                     (read-file "text.plain")))
WBR, Dmi

HPW
Posts: 1390
Joined: Thu Sep 26, 2002 9:15 am
Location: Germany
Contact:

Post by HPW »

Thanks for the hint, but I am searching a windows solution.
Will do some further research.
Hans-Peter

Dmi
Posts: 408
Joined: Sat Jun 04, 2005 4:16 pm
Location: Russia
Contact:

Post by Dmi »

I'm interested in a way for Windows too :-)
WBR, Dmi

HPW
Posts: 1390
Joined: Thu Sep 26, 2002 9:15 am
Location: Germany
Contact:

Post by HPW »

As a start I have written 2 commandline-utilitys in delphi to get it working:

TxtToUtf8.exe
Utf8ToTxt.exe

I am considering to add such code to my hpwNLUtility lib.
Hans-Peter

pjot
Posts: 733
Joined: Thu Feb 26, 2004 10:19 pm
Location: The Hague, The Netherlands
Contact:

Post by pjot »

When using GTK you'll have the same problem. I created a UTF-8 converter in newLisp, which converts high ASCII to UTF-8 format:

Code: Select all

(context 'UTF)

# Only replace extended ASCII characters by 2-byte UTF-8 sequence
(define (UTF:UTF str, t x b1 b2)
(set 't 0)
(while (< t (length str))
	(begin
		(set 'x (nth t str))
		(if (> (char x) 127)
			(begin
				(set 'b1 (+ (/ (& (char x) 192) 64) 192))
				(set 'b2 (+ (& (char x) 63) 128))
				(set-nth t str (append (char b1)(char b2)))
				(inc 't)
			)
		)
		(inc 't)
	)
)
str)

(context 'MAIN)

(println (UTF "é ö ù ñ ô"))
(exit)
Result:
é ö ù ñ ô
So this works with ASCII 0-255. Higher codes in the UCS-2 table are not supported by this context but can be converted in a similar way.

Peter

Lutz
Posts: 5289
Joined: Thu Sep 26, 2002 4:45 pm
Location: Pasadena, California
Contact:

Post by Lutz »

This works fine, but should only be executed on a non-UTF8 version of newLISP, if you do this on an UTF8 version of newLISP, the following would hang.

Code: Select all

(UTF "\148") ; hangs on utf8 version of newLISP

; fine on non-utf8 version
"\194\148"
The reason may be that 'nth' and 'set-nth' do not work on bytes but (multi-byte) characters in the UTF8 version.

But here is a solution when using the UTF8 version of newLISP:

Code: Select all

#!/usr/bin/newlisp

(unless (primitive? utf8)
        (begin
                (println "need UTF8 version of newLISP")
                (exit)))

(unless (set 'ascii-file (open (main-args 2)  "read"))
        (begin
                (println "cannot open " (main-args 2))
                (exit)))

(set 'utf8-file (open (main-args 3)  "write"))

; convert ascii file to utf8 fiole
(while (set 'chr (read-char ascii-file))
        (write-buffer utf8-file (char chr)))

(close ascii-file)
(close utf8-file)
(exit)
The routine relies on the fact that the function (char value) will convert any integer value to an UTF8 character string. Try this in an UTF8 version:

Code: Select all

> (char 148)
"\194\148"
> 
But I believe HPW's problem may still not be solved. On a German PC with codepage 859 for example 148 will be oe (umlaut o with two dots on top). But in the official upper-ascii 148 is defined as an empty string.

Lutz

Lutz
Posts: 5289
Joined: Thu Sep 26, 2002 4:45 pm
Location: Pasadena, California
Contact:

Post by Lutz »

... I just realize there is a simple ASCII -> UTF8 converter on Windows: notepad.exe has a filetype option when you save.

Lutz

HPW
Posts: 1390
Joined: Thu Sep 26, 2002 9:15 am
Location: Germany
Contact:

Post by HPW »

Peter,

There seems to be problems with german umlauts.
And maybe there may also a performance problem.

Lutz,

of cource I was looking for a batch/code solution.
;-)
Hans-Peter

pjot
Posts: 733
Joined: Thu Feb 26, 2004 10:19 pm
Location: The Hague, The Netherlands
Contact:

Post by pjot »

What would be the problem with German umlauts? In Dutch, we have the same 'umlauts' (called "trema") on some letters. But I can convert them all right. My context seem to work with the GTK widgets also.

Probably you are using some non-standard codepage? I use ISO-8859-1 or ISO-8859-15 (includes euro symbol) myself. The first 256 values are the same as with UCS-2. If you are using a codepage which has not the same values for the first 256 characters as UCS-2 has, then indeed my context does not work.

Peter

Lutz
Posts: 5289
Joined: Thu Sep 26, 2002 4:45 pm
Location: Pasadena, California
Contact:

Post by Lutz »

Yes, ISO-8859-1 or ISO-8859-15 are fine for conversion. What I meant was PC codepage 850 (not 859), which ships with Windows and has characters between 128->160, where it should have nothing. See the following links for clarification:

http://www.columbia.edu/kermit/cp850.html

http://www.ramsch.org/martin/uni/fmi-hp/iso8859-1.html

Lutz

pjot
Posts: 733
Joined: Thu Feb 26, 2004 10:19 pm
Location: The Hague, The Netherlands
Contact:

Post by pjot »

Thanks.

The Dutch versions of Windows also ship with codepage 850, but my newLisp e-text reader for example displays German texts without any problems. I guess that's because it is displayed in a widget, and not in a DOS box. That's how my confusion probably started.

There is a DOS command 'chcp' to change the codepage, though.
c:\> chcp 28591
This will change the codepage in a DOS box to ISO-8859-1. Also the font must be changed to Lucida Console in order to display the correct characters.

Peter

HPW
Posts: 1390
Joined: Thu Sep 26, 2002 9:15 am
Location: Germany
Contact:

Post by HPW »

Hans-Peter

pjot
Posts: 733
Joined: Thu Feb 26, 2004 10:19 pm
Location: The Hague, The Netherlands
Contact:

Post by pjot »

Don't tempt me to create a newLISP/GTK clone of this! ;-)

Peter

Locked