Encoding surrealism in WIN10: UTF-8-NEWLISP in CMD.EXE

Machine-specific discussion
Unix, Linux, OS X, OS/2, Windows, ..?

Encoding surrealism in WIN10: UTF-8-NEWLISP in CMD.EXE

Postby IVShilov » Wed Apr 04, 2018 9:13 am

I spent 8 hours figuring out HOW it works in windows cmd.exe and found a paradox.
Two paradoxes.
Try this by yourself, all code in this post is copy and paste from cmd.exe window.

Starts CMD.EXE, and newlisp.exe without any init.lsp, and put him a valid cyrillic filepath as first parameter:

Code: Select all
D:\tmp>r:\bin\newlisp\newlisp.exe -n "D:\tmp\Ё.doc"
newLISP v.10.7.1 64-bit on Windows IPv4/6 UTF-8 libffi, options: newlisp -h

> (last (main-args))
"D:\\tmp\\╨╕.doc"             # two symbols - not one, it's UNICODE
> (load {R:\bin\newlisp\modules\iconv.lsp})
> (file? (last (main-args)))  # may be it understands as valid path?
> (Iconv:convert (last (main-args)) {UTF-8} {CP866}) # OK, de-UNICODE it
> # one symbol, but there must be "Ё"!

After hours of en- decoding between UTF-8, CP866 and CP1251 I have lucky shot in the dark and have paradox one: UTF8-path, decoded in CP866, must be decoded as CP1251 to CP866 again:
Code: Select all
> (Iconv:convert (Iconv:convert (last (main-args)) {UTF-8} {CP866}) {CP1251} {CP866})
> #  no logic, but now we have a readable file path!
> (file? (Iconv:convert (Iconv:convert (last (main-args)) {UTF-8} {CP866}) {CP1251} {CP866}) ) # but what about this thinks newlisp itself?

Newlisp think that there is no such file, but I think it is, I see "D:\\tmp\\Ё.doc".
Paradox two:
Code: Select all
> (write-file {D:\tmp\1.txt} {1}) # OK, newlisp, does the file you create by yourself...
> (file? {D:\tmp\1.txt}) # ... would be a truly file?
> (write-file {D:\tmp\Ё.txt} {Ё}) # OK, now special case
> (file? {D:\tmp\Ё.txt})
> (file? {D:\tmp\Ё.doc})

Ok, explorer.exe, what do you think about that?
Ё.doc.jpg (6.95 KiB) Viewed 1620 times

Ё.txt.jpg (7.02 KiB) Viewed 1620 times

PPL, I think only some kind of Data Flow Diagram may clearly shows whats going under the hood of GUI and where the silent charset translations take place.
DFD CMD-newlisp-OS.jpg
DFD CMD-newlisp-OS.jpg (23.97 KiB) Viewed 1620 times

As I know, CMD.EXE works in CP866, FileSystem store file paths in CP1251, and newlisp.exe internally works in UTF-8. Let's discuss
Posts: 17
Joined: Wed Apr 12, 2017 1:58 am

Re: Encoding surrealism in WIN10: UTF-8-NEWLISP in CMD.EXE

Postby TedWalther » Wed Apr 04, 2018 5:55 pm

Wow, good detective work. Thank you for that nice diagram. Do you have a program to auto-generate it from a script, or was it a hand drawn work of art? I ask because often I'd like to make similar diagrams to illustrate things.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence. Nine months later, they left with a baby named newLISP. The women of the ivory towers wept and wailed. "Abomination!" they cried.
Posts: 605
Joined: Mon Feb 05, 2007 1:04 am
Location: Abbotsford, BC

Re: Encoding surrealism in WIN10: UTF-8-NEWLISP in CMD.EXE

Postby IVShilov » Thu Apr 05, 2018 11:25 am

Diagram is pure handmade, not script fabricated.

In cmd.exe encoding INPUT and for OUTPUT for a started process can be change by a command "chcp" (CHange Code Page, https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/chcp):
Code: Select all
> (exec "chcp")         # get current code page
("Active code page: 866")
> (exec "chcp 1251") # change code page to CP1251
("Active code page: 1251")
 (fn (ctx)
    (last (parse (first (exec "chcp")) { }))
    { > })))

1251 > # now for debugging we see chcp in prompt

For clearly understand whats going on, I use command- and reader-events:
Code: Select all
1251 > (reader-event (lambda (ex) (println "reader-event IN: " ex)))
 => (reader-event (lambda (ex) (println "reader-event IN: " ex)))
1251 > (char "Ё")
reader-event IN: (char "Ё")
1251 > (command-event (fn (s)(println {command-event IN:} s) s))
reader-event IN: (command-event (lambda (s) (println "command-event IN:" s) s))
1251 >

Now try "greek omega test" from newlisp manual on CP1251 and UTF-8 chcp settings:
Code: Select all
65001 > (println (char 937))
command-event IN:(println (char 937))
reader-event IN: (println (char 937))

Output is good,
Code: Select all
65001 > (println "Ω")
command-event IN:(println

ERR: missing parenthesis : "...(println"
65001 >

Looks like cmd.exe not ever passed "Ω" to newlisp subprocess: see command-event IN:(println - string cutted.

Try CP1251:
Code: Select all
1251 > (println (char 937))
command-event IN:(println (char 937))
reader-event IN: (println (char 937))

Output fails, and
Code: Select all
1251 > (print "Ω")
command-event IN:(print "?")
reader-event IN: (print "?")
1251 >

Unsuccessful too, because decoding UTF->CP1251 needed, and CP1251 have no "Ω" letter.

Two days out of luck.
Possible solutuions:
A) set cmd.exe in "chcp 1251":
- translate INPUT in newlisp CP1251->UTF by command-event;
- translate OUTPUT from newlisp UTF->CP1251 by another event handler - I dont know such.
B) set cmd.exe in "chcp 65001" and figure out input translation by reading Microsoft docs.

This problem python have too: https://github.com/Drekin/win-unicode-console/tree/development#win-unicode-console.
Posts: 17
Joined: Wed Apr 12, 2017 1:58 am

Post for who have interest about this problem.

Postby IVShilov » Tue Mar 26, 2019 8:44 pm

IVShilov wrote: Two days out of luck.

Much more days out of luck, but some light illuminates the darkness.

Many (all?) UTF8-apps, starts from (in?) CMD.EXE, have problems in Windows environment (python too)

Full view about whats under the hood from (depths of) MS: https://devblogs.microsoft.com/commandl ... kgrounder/

How to force CMD.EXE fully supports UTF8 knows this guy:
1. Best answer for thread "How to use unicode characters in Windows command line?" here, briefly explain a problem: https://stackoverflow.com/questions/388 ... mmand-line
2. His site with solutions: https://math.berkeley.edu/~serganov/ilyaz.org/keyboard/

In any case, we need functions for encoding/decoding (I still cannot import libiconv, forced use iconv.exe) and other batteries in newlisp distro like python have.
UPD: IMHO as minimum we need a prediacte (utf? str) like
Code: Select all
(define (utf? str) (= (length str) (utf8len str)))

for figure out wait problems or not.
Posts: 17
Joined: Wed Apr 12, 2017 1:58 am

Return to newLISP and the O.S.

Who is online

Users browsing this forum: No registered users and 1 guest