Encoding surrealism in WIN10: UTF-8-NEWLISP in CMD.EXE

Machine-specific discussion
Unix, Linux, OS X, OS/2, Windows, ..?
Locked
IVShilov
Posts: 23
Joined: Wed Apr 12, 2017 1:58 am

Encoding surrealism in WIN10: UTF-8-NEWLISP in CMD.EXE

Post by IVShilov »

I spent 8 hours figuring out HOW it works in windows cmd.exe and found a paradox.
Two paradoxes.
Try this by yourself, all code in this post is copy and paste from cmd.exe window.

Starts CMD.EXE, and newlisp.exe without any init.lsp, and put him a valid cyrillic filepath as first parameter:

Code: Select all

D:\tmp>r:\bin\newlisp\newlisp.exe -n "D:\tmp\Ё.doc"
newLISP v.10.7.1 64-bit on Windows IPv4/6 UTF-8 libffi, options: newlisp -h

> (last (main-args))
"D:\\tmp\\╨╕.doc"             # two symbols - not one, it's UNICODE
> (load {R:\bin\newlisp\modules\iconv.lsp})
MAIN
> (file? (last (main-args)))  # may be it understands as valid path? 
nil
> (Iconv:convert (last (main-args)) {UTF-8} {CP866}) # OK, de-UNICODE it
"D:\\tmp\\и.doc"
> # one symbol, but there must be "Ё"! 
After hours of en- decoding between UTF-8, CP866 and CP1251 I have lucky shot in the dark and have paradox one: UTF8-path, decoded in CP866, must be decoded as CP1251 to CP866 again:

Code: Select all

> (Iconv:convert (Iconv:convert (last (main-args)) {UTF-8} {CP866}) {CP1251} {CP866}) 
"D:\\tmp\\Ё.doc" 
> #  no logic, but now we have a readable file path!
> (file? (Iconv:convert (Iconv:convert (last (main-args)) {UTF-8} {CP866}) {CP1251} {CP866}) ) # but what about this thinks newlisp itself?
nil
Newlisp think that there is no such file, but I think it is, I see "D:\\tmp\\Ё.doc".
Paradox two:

Code: Select all

> (write-file {D:\tmp\1.txt} {1}) # OK, newlisp, does the file you create by yourself...
1
> (file? {D:\tmp\1.txt}) # ... would be a truly file?
true
> (write-file {D:\tmp\Ё.txt} {Ё}) # OK, now special case
1
> (file? {D:\tmp\Ё.txt})
true
> (file? {D:\tmp\Ё.doc})
nil
> 
Ok, explorer.exe, what do you think about that?
Ё.doc:
Ё.doc.jpg
Ё.doc.jpg (6.95 KiB) Viewed 5737 times
Ё.txt
Ё.txt.jpg
Ё.txt.jpg (7.02 KiB) Viewed 5737 times
PPL, I think only some kind of Data Flow Diagram may clearly shows whats going under the hood of GUI and where the silent charset translations take place.
DFD CMD-newlisp-OS.jpg
DFD CMD-newlisp-OS.jpg (23.97 KiB) Viewed 5737 times
As I know, CMD.EXE works in CP866, FileSystem store file paths in CP1251, and newlisp.exe internally works in UTF-8. Let's discuss

TedWalther
Posts: 608
Joined: Mon Feb 05, 2007 1:04 am
Location: Abbotsford, BC
Contact:

Re: Encoding surrealism in WIN10: UTF-8-NEWLISP in CMD.EXE

Post by TedWalther »

Wow, good detective work. Thank you for that nice diagram. Do you have a program to auto-generate it from a script, or was it a hand drawn work of art? I ask because often I'd like to make similar diagrams to illustrate things.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence. Nine months later, they left with a baby named newLISP. The women of the ivory towers wept and wailed. "Abomination!" they cried.

IVShilov
Posts: 23
Joined: Wed Apr 12, 2017 1:58 am

Re: Encoding surrealism in WIN10: UTF-8-NEWLISP in CMD.EXE

Post by IVShilov »

Diagram is pure handmade, not script fabricated.

In cmd.exe encoding INPUT and for OUTPUT for a started process can be change by a command "chcp" (CHange Code Page, https://docs.microsoft.com/en-us/window ... mands/chcp):

Code: Select all

> (exec "chcp")         # get current code page
("Active code page: 866")
> (exec "chcp 1251") # change code page to CP1251
("Active code page: 1251")
>
(prompt-event
 (fn (ctx)
   (string
    (last (parse (first (exec "chcp")) { }))
    { > })))

$prompt-event
1251 > # now for debugging we see chcp in prompt
For clearly understand whats going on, I use command- and reader-events:

Code: Select all

1251 > (reader-event (lambda (ex) (println "reader-event IN: " ex)))
 => (reader-event (lambda (ex) (println "reader-event IN: " ex)))
$reader-event
1251 > (char "Ё")
reader-event IN: (char "Ё")
168
1251 > (command-event (fn (s)(println {command-event IN:} s) s))
reader-event IN: (command-event (lambda (s) (println "command-event IN:" s) s))
$command-event
1251 > 
Now try "greek omega test" from newlisp manual on CP1251 and UTF-8 chcp settings:

Code: Select all

65001 > (println (char 937))
command-event IN:(println (char 937))
reader-event IN: (println (char 937))
Ω
"Ω"
Output is good,

Code: Select all

65001 > (println "Ω")
command-event IN:(println

ERR: missing parenthesis : "...(println"
65001 > 
Looks like cmd.exe not ever passed "Ω" to newlisp subprocess: see command-event IN:(println - string cutted.

Try CP1251:

Code: Select all

1251 > (println (char 937))
command-event IN:(println (char 937))
reader-event IN: (println (char 937))
О©
"О©"
Output fails, and

Code: Select all

1251 > (print "Ω")
command-event IN:(print "?")
reader-event IN: (print "?")
?"?"
1251 >
Unsuccessful too, because decoding UTF->CP1251 needed, and CP1251 have no "Ω" letter.

Two days out of luck.
Possible solutuions:
A) set cmd.exe in "chcp 1251":
- translate INPUT in newlisp CP1251->UTF by command-event;
- translate OUTPUT from newlisp UTF->CP1251 by another event handler - I dont know such.
B) set cmd.exe in "chcp 65001" and figure out input translation by reading Microsoft docs.

This problem python have too: https://github.com/Drekin/win-unicode-c ... de-console.

IVShilov
Posts: 23
Joined: Wed Apr 12, 2017 1:58 am

Post for who have interest about this problem.

Post by IVShilov »

IVShilov wrote: Two days out of luck.
Much more days out of luck, but some light illuminates the darkness.

Many (all?) UTF8-apps, starts from (in?) CMD.EXE, have problems in Windows environment (python too)

Full view about whats under the hood from (depths of) MS: https://devblogs.microsoft.com/commandl ... kgrounder/

How to force CMD.EXE fully supports UTF8 knows this guy:
1. Best answer for thread "How to use unicode characters in Windows command line?" here, briefly explain a problem: https://stackoverflow.com/questions/388 ... mmand-line
2. His site with solutions: https://math.berkeley.edu/~serganov/ilyaz.org/keyboard/

In any case, we need functions for encoding/decoding (I still cannot import libiconv, forced use iconv.exe) and other batteries in newlisp distro like python have.
UPD: IMHO as minimum we need a prediacte (utf? str) like

Code: Select all

(define (utf? str) (= (length str) (utf8len str)))
for figure out wait problems or not.

Locked