Unicode path handling on Windows

Machine-specific discussion
Unix, Linux, OS X, OS/2, Windows, ..?
Locked
m35
Posts: 171
Joined: Wed Feb 14, 2007 12:54 pm
Location: Carifornia

Unicode path handling on Windows

Post by m35 »

Warning, huge post

In response to my troubles found here.

The following newLISP functions don't seem to work for Unicode paths on Windows:
directory?
file?
change-dir
delete-file
directory
file-info
make-dir
open
real-path
remove-dir
rename-file
read-file
write-file
append-file
save
Edit: forgot load!

As a result, I implemented equivalent functions that can work with Unicode paths, using direct calls to the Win32 API (and now with my luck, someone will post an easy, 30 second fix to this problem that would have saved me the trouble ;).

I'm throwing this code out here in the rare chance anyone needs it, and to increase awareness about encoding issues.


Some technical details:
The Win32 API "C Run-Time Libraries" (msvcrt.dll) provides Unicode versions of all the functions used in the newLISP source "nl-file.sys" (only exceptions being the opendir functions). These Unicode variants only accept UTF-16 strings as arguments. So I modeled this code after the newLISP source, with three main differences:
* First converts UTF-8 strings to UTF-16
* Uses the Win32 API Unicode functions
* Written in the awesome language of newLISP :)

Please forgive me if the form is terrible.

Code: Select all

;;;
;;; Windows32 Wide Input/Output  v0.2
;;;
;;; Unicode substitutions for functions that read or write path names.
;;;
;;;  # Predicates #
;;;     directory?  -> wdirectory?
;;;     file?       -> wfile?
;;;
;;;  # Input/output and file operations #
;;;     change-dir  -> wchange-dir
;;;     delete-file -> wdelete-file
;;;     directory   -> wwin-dir *
;;;     file-info   -> wfile-info
;;;     load        -> ***
;;;     make-dir    -> wmake-dir
;;;     open        -> wopen
;;;     real-path   -> wreal-path
;;;     remove-dir  -> wremove-dir
;;;     rename-file -> wrename-file
;;;    
;;;  # File and directory management #
;;;     read-file   -> wread-file
;;;     write-file  -> wwrite-file
;;;     append-file -> wappend-file
;;;     save        -> **
;;;
;;; All functions work the same, accepting UTF-8 strings, and should be 
;;; pretty fast, as they are simple wrappers. Exceptions are directory and 
;;; save:
;;; *  wwin-dir works like the windows dir command, and doesn't
;;;    provide regex filtering. It is also pretty slow. 
;;; ** I didn't feel like making an alternative save function.
;;; *** Forgot about load
;;;
;;;
;;; Note:
;;;  All comments regarding header files refer to the 
;;;  Microsoft Visual Studio files, gleaned from
;;;  MSDN documentation.
;;;

(unless utf8 (throw-error "Module W32-WIO requires UTF-8 enabled newLISP."))

(context 'W32-WIO)

## Globals #####################################################################

(constant 'SIZEOF_WCHAR 2) ; assumption

## Conversion: UTF-8 <-> UTF-16 ################################################

# // Declared in <winnls.h>
(constant 'CP_UTF8 65001) ; code page 65001 = UTF-8

;...............................................................................

# // Declared in <winnls.h>; include <windows.h> 
# int MultiByteToWideChar(
#   UINT CodePage, 
#   DWORD dwFlags,         
#   LPCSTR lpMultiByteStr, 
#   int cbMultiByte,       
#   LPWSTR lpWideCharStr,  
#   int cchWideChar        
# );
(import "kernel32.dll" "MultiByteToWideChar")

;...............................................................................

# This function takes the place of WideCharToMultiByte
(define (utf16->utf32 s , len)
    (setq s
        (map
            (fn (c) (pack "u" c) )
            ; Windows returns little-endian ("<u") encoding
            (unpack (dup "<u" (>> (length s) 1)) s)
        )
    )
    ; Find the end of the string (double NULL)
    (setq len (+ (find "\000\000" s) 1))
    (if len
        ; Trim off the excess
        (append (join (slice s 0 len) "\000\000") "\000\000")
        ; If no end found, add our own (quad NULL)
        (append (join s "\000\000") "\000\000\000\000")
    )
)

;...............................................................................

(define (utf8->16 lpMultiByteStr , cchWideChar lpWideCharStr ret)

    ; calculate the size of buffer (in WCHAR's)
    (setq cchWideChar (MultiByteToWideChar 
        CP_UTF8 ; from UTF-8
        0       ; no flags necessary
        lpMultiByteStr 
        -1      ; convert until NULL is encountered
        0
        0
    ))
    
    ; allocate the buffer
    (setq lpWideCharStr (dup " " (* cchWideChar SIZEOF_WCHAR)))
    
    ; convert
    (setq ret (MultiByteToWideChar 
        CP_UTF8 ; from UTF-8
        0       ; no flags necessary
        lpMultiByteStr
        -1      ; convert until NULL is encountered
        lpWideCharStr
        cchWideChar
    ))
    (if (> ret 0) lpWideCharStr nil)
)

## wdirectory? #################################################################

(constant 'S_IFDIR 0040000)

;...............................................................................

(define (wdirectory? str-path)
    (= 
        (& 
            ((wfile-info str-path) 1)
            S_IFDIR
        )
        S_IFDIR
    )
)

## wfile? ######################################################################

(define (wfile? str-name)
    (true? (wfile-info str-name))
)

;...............................................................................


## wchange-dir #################################################################

# // Declared in <direct.h> or <wchar.h>
# int _wchdir( 
#    const wchar_t *dirname 
# );
(import "msvcrt.dll" "_wchdir")

;...............................................................................

(define (wchange-dir str-path)
    (case (_wchdir (utf8->16 str-path))
        (0 true)
        (-1 nil)
        (true (throw-error "???"))
    )
)

## wdelete-file ################################################################

# // Declared in <io.h> or <wchar.h>
# int _wunlink(
#    const wchar_t *filename 
# );
(import "msvcrt.dll" "_wunlink")

;...............................................................................

(define (wdelete-file str-file-name)
    (case (_wunlink (utf8->16 str-file-name))
        (0 true)
        (-1 nil)
        (true (throw-error "???"))
    )
)

## wwin-dir ####################################################################

# // Declared in <io.h> or <wchar.h>
# intptr_t _wfindfirst(
#    const wchar_t *filespec,
#    struct _wfinddata_t *fileinfo 
# );
; Note: MinGW library has the function _wopendir(), 
;		which I assume calls _wfindfirst
(import "msvcrt.dll" "_wfindfirst")

# // Declared in <io.h> or <wchar.h>
# int _wfindnext(
#    intptr_t handle,
#    struct _wfinddata_t *fileinfo 
# );
(import "msvcrt.dll" "_wfindnext")

# // Declared in <io.h> or <wchar.h>
# int _findclose( 
#    intptr_t handle 
# );
(import "msvcrt.dll" "_findclose")

;...............................................................................

# // Declared in <sys/stat.h> 
# typedef long time_t;     
# typedef unsigned long _fsize_t;

# // Declared in <io.h> or <wchar.h>
# struct _wfinddata_t {
#     unsigned    attrib;       // 4
#     time_t      time_create;  // 4
#     time_t      time_access;  // 4  
#     time_t      time_write;   // 4
#     _fsize_t    size;         // 4
#     wchar_t     name[260];    // 260 * SIZEOF_WCHAR = 520
# };

(constant 'SIZEOF_wfinddata_t (+ 4 4 4 4 4 520))

(define (unpack_wfinddata_t str-data )
    (unpack "lu ld ld ld ld s520" str-data)
)

;...............................................................................


;; wwin-dir provides information like directory, but has a different interface.
;; This was due to how much trouble it was trying to replicate the directory
;; interface, and how slow the function was becoming as a result.
;; It accepts only one optional argument: str-path
;; str-path works like the one argument to the dir command in the console:
;;   (wwin-dir "*")       =  dir *
;;   (wwin-dir "*.txt")   =  dir *.txt 
;;   (wwin-dir "c:\\*.*") =  dir c:\*.*
;;
;; If you want regex filtering, you'll have to manually do it on the 
;; returned list.
(define (wwin-dir (str-path "*") , info handle dirlist)
    (setq str-path (utf8->16 str-path))
    ; allocate space for info
    (setq info (dup " " SIZEOF_wfinddata_t))
    ; get the first directory entry
    (setq handle (_wfindfirst str-path info))
    (if (!= handle -1) 
        (begin
            (setq dirlist '())
            (do-while (zero? (_wfindnext handle info))
                (push 
                    (utf8 (utf16->utf32 (last (unpack_wfinddata_t info))))
                    dirlist -1
                )
                (setq info (dup " " SIZEOF_wfinddata_t))
            )
            (_findclose handle)
            dirlist
        )
        nil
    )
)

## wfile-info ##################################################################


# // Declared in <sys\types.h>
# typedef unsigned int _dev_t;
# typedef unsigned short _ino_t;
# typedef long _off_t;

# // Declared in <sys\stat.h>
# typedef long time_t;     
# typedef unsigned long _fsize_t;
#
# struct _stati64 {                    ofs  size
#         _dev_t st_dev;            // (0   lu     = 4)
#         _ino_t st_ino;            // (4   u      = 2)
#         unsigned short st_mode;   //  6   u      = 2
#         short st_nlink;           // (8   d->u   = 2)
#         short st_uid;             //  10  d->u   = 2
#         short st_gid;             //  12  d->u   = 2
#                                   // (14  n2     = 2)
#         _dev_t st_rdev;           //  16  lu     = 4
#                                   // (20  n4     = 4)
#         __int64 st_size;          //  24  L->Lu  = 8
#         time_t st_atime;          //  32  ld->lu = 4
#         time_t st_mtime;          //  36  ld->lu = 4
#         time_t st_ctime;          //  40  ld->lu = 4
#         };

(constant 'SIZEOF_stat 44)

(define (unpack_stat data)
    (unpack "lu u u u u u n2 lu n4 Lu lu lu lu" data)
)

;...............................................................................

# // Declared in <sys\stat.h>
# int _wstati64(
#    const wchar_t *path,
#    struct _stat *buffer 
# );
(import "msvcrt.dll" "_wstati64")

;...............................................................................

(define (wfile-info str_name , fileInfo)
    ; allocate space for file info
    (setq fileInfo (dup "\000" SIZEOF_stat))
    (case (_wstati64 (utf8->16 str_name) fileInfo)
        (0 (select (unpack_stat fileInfo)  '(7 2 6 4 5 8 9 10)))
        (-1 nil)
        (true (throw-error "???"))
    )
)

## wmake-dir ###################################################################

# // Declared in <direct.h> or <wchar.h>
# int _wmkdir(
#    const wchar_t *dirname 
# );
(import "msvcrt.dll" "_wmkdir")

;...............................................................................

(define (wmake-dir str-dir-name)
    (case (_wmkdir (utf8->16 str-dir-name))
        (0 true)
        (-1 nil)
        (true (throw-error "???"))
    )
)

## wopen #######################################################################

# // Declared in <io.h> or <wchar.h>
# int _wopen(
#    const wchar_t *filename,
#    int oflag [,
#    int pmode] 
# );
(import "msvcrt.dll" "_wopen")

;...............................................................................

# // Declared in <fcntl.h>
(constant 'O_RDONLY 0x0000)
(constant 'O_WRONLY 0x0001)
(constant 'O_RDWR   0x0002)
(constant 'O_APPEND 0x0008)
(constant 'O_CREAT  0x0100)
(constant 'O_TRUNC  0x0200)
(constant 'O_EXCL   0x0400)
(constant 'O_TEXT   0x4000)
(constant 'O_BINARY 0x8000)

# // Declared in <sys/stat.h>
(constant 'S_IFMT   0170000) 
(constant 'S_IFDIR  0040000) 
(constant 'S_IFCHR  0020000) 
(constant 'S_IFIFO  0010000) 
(constant 'S_IFREG  0100000) 
(constant 'S_IREAD  0000400) 
(constant 'S_IWRITE 0000200) 
(constant 'S_IEXEC  0000100) 

;...............................................................................

(define (wopen str-path-file str-access-mode , handle)
    (setq str-path-file (utf8->16 str-path-file))
    (setq handle
        (if 
            (starts-with str-access-mode "r")
            (_wopen str-path-file (| O_RDONLY O_BINARY ) 0)
            
            (starts-with str-access-mode "w")
            (_wopen str-path-file 
                (| O_WRONLY O_CREAT O_TRUNC O_BINARY )
                (| S_IREAD S_IWRITE)
            )
            
            (starts-with str-access-mode "u")
            (_wopen str-path-file (| O_RDWR O_BINARY) 0)
            
            (starts-with str-access-mode "a")
            (_wopen str-path-file 
                (| O_RDWR O_APPEND O_BINARY O_CREAT) 
                (| S_IREAD S_IWRITE)
            )
            
            -1
        )
    )
    (if (= handle -1) nil handle)
)

## wreal-path ##################################################################

# // Declared in <windef.h>
(constant 'MAX_PATH 260)

# // Declared in <winbase.h>; include <windows.h>.
# DWORD GetFullPathName(
#   LPCTSTR lpFileName,
#   DWORD nBufferLength,
#   LPTSTR lpBuffer,
#   LPTSTR* lpFilePart
# );
(import "kernel32.dll" "GetFullPathNameW")

;...............................................................................

(define (wreal-path (str-path ".") , realpath len)
    ; allocate space for real path
    (setq realpath (dup "\000" (* MAX_PATH SIZEOF_WCHAR)))
    ; returns length of the string
    (setq len (GetFullPathNameW (utf8->16 str-path) MAX_PATH realpath 0 ))
    (case len 
        (0 nil )
        (true 
            ; trim the result
            (utf8 (utf16->utf32 (slice realpath 0 (* (+ len 1) SIZEOF_WCHAR))))
        )
    )
)

## wremove-dir #################################################################

# // Declared in <direct.h> or <wchar.h>
# int _wrmdir(
#    const wchar_t *dirname 
# );
(import "msvcrt.dll" "_wrmdir")

;...............................................................................

(define (wremove-dir str-path)
    (case (_wrmdir (utf8->16 str-path))
        (0 true)
        (-1 nil)
        (true (throw-error "???"))
    )
)

## wrename-file ################################################################

# // Declared in <stdio.h> or <wchar.h>
# int _wrename(
#    const wchar_t *oldname,
#    const wchar_t *newname 
# );
(import "msvcrt.dll" "_wrename")

;...............................................................................

(define (wrename-file str-path-old str-path-new)
    (case (_wrename (utf8->16 str-path-old) (utf8->16 str-path-new))
        (0 true)
        (-1 nil)
        (true (throw-error "???"))
    )
)

## wread-file ##################################################################

(define (wread-file str-file-name , buff tmp-buff handle)
    (if (setq handle (wopen str-file-name "r")) ; open file
        (begin
            (setq buff ""  tmp-buff "")
            (if (read-buffer handle 'buff 0xFFFF) ; open wide
                (while (read-buffer handle 'tmp-buff 0xFFFF)
                    (write-buffer buff tmp-buff)
                )
            )
            (close handle)
            buff
        )
        nil ; couldn't open file
    )
)

## wwrite-file #################################################################

(define (wwrite-file str-file-name str-buffer , handle ret)
    (if (setq handle (wopen str-file-name "w")) ; open file
        (begin
            (setq ret (write-buffer handle str-buffer))
            (close handle)
            ret
        )
        nil ; couldn't open file
    )
)

## wappend-file ################################################################

(define (wappend-file str-filename str-buffer , handle ret)
    (if (setq handle (wopen str-filename "a")) ; open file
        (begin
            (setq ret (write-buffer handle str-buffer))
            (close handle)
            ret
        )
        nil ; couldn't open file
    )
)

################################################################################

(context 'MAIN)


;; Quick and dirty test of the functions in this module.
;; It is by no means comprehensive.
;; You're welcome to monitor the %TEMP% directory as things change.
(define (test-W32-WIO)
    (setq tmpdir (env "TEMP"))
    (unless tmpdir (throw-error "Couldn't find a temp directory to test in."))
    
    (setq unifile "\230\162\182\230\181\166\231\148\177\232\168\152")
    (setq unifile2 "notunicode")
    
    (setq unidir (append tmpdir "\\" unifile))
    
    (println "Making unicode dir")
    (unless (W32-WIO:wmake-dir unidir) 
        (throw-error "failed! (does the dir already exist?)"))
    (println "ok")(read-line)
        
    (println "Is it a directory?")
    (unless (W32-WIO:wdirectory? unidir) (throw-error "failed!"))
    (println "Yes")(read-line)
    
    (println "Is it a file?")
    (unless (W32-WIO:wfile? unidir) (throw-error "failed!"))
    (println "Yes")(read-line)
    
    (println "Change dir to uni dir")
    (unless (W32-WIO:wchange-dir unidir)  (throw-error "failed!"))
    (println "ok")(read-line)
    
    (println "Current path:")
    (println (W32-WIO:wreal-path))
    (read-line)
    
    (println "Writing a file")
    (unless (W32-WIO:wwrite-file unifile "Hello unicode") 
        (throw-error "failed!"))
    (println "ok")(read-line)
    
    (println "Is it a file?")
    (unless (W32-WIO:wfile? unifile) (throw-error "failed!"))
    (println "Yes")(read-line)
    
    (println "Is it a directory?")
    (if (W32-WIO:wdirectory? unifile) (throw-error "failed!"))
    (println "No")(read-line)
    
    (println "Appending to the file")
    (unless (W32-WIO:wappend-file unifile "  Hello again") 
        (throw-error "failed!"))
    (println "ok")(read-line)
    
    (println "Read the file:")
    (println (W32-WIO:wread-file unifile))
    (read-line)
    
    (println "File info:")
    (println (W32-WIO:wfile-info unifile))
    (read-line)
    
    (println "Directory list:")
    (println (W32-WIO:wwin-dir))
    (read-line)
    
    (println "Rename file")
    (unless (W32-WIO:wrename-file unifile unifile2)  (throw-error "failed!"))
    (println "ok")(read-line)
    
    (println "Delete file")
    (unless (W32-WIO:wdelete-file unifile2)  (throw-error "failed!"))
    (println "ok")(read-line)
    
    (println "Backing up from the uni dir")
    (unless (W32-WIO:wchange-dir "..")   (throw-error "failed!"))
    (println "ok")(read-line)
    
    (println "Real path of the uni dir again")
    (println (W32-WIO:wreal-path unifile))
    (read-line)
    
    (println "Removing uni dir")
    (unless (W32-WIO:wremove-dir unidir) (throw-error "failed!"))
    (println "ok")(read-line)
    (println "Done.")
)
Assuming that using the Win32 API (or the equivalent interface functions in MinGW) is the correct approach to accessing Unicode paths on Windows, it would be really nice if this could be added to newLISP by default :D

m35
Posts: 171
Joined: Wed Feb 14, 2007 12:54 pm
Location: Carifornia

Post by m35 »

Sort of a follow up.

During the last week, I've researched much about Unicode on Linux and Windows. Information is randomly scattered everywhere, so it is really hard to pin down a good overview of what's going on. Here is a summary, to the best of my understanding. If anyone sees any errors, please share :)

* Windows sizeof(wchar_t) = 2 (UTF-16)
* Most Unix platforms sizeof(wchar_t) = 4 (UTF-32)
* Note: Java also uses UTF-16 internally.
* Windows cannot use a locale encoding that requires more than one byte per character (i.e. UTF-7 and UTF-8 cannot be set as the locale); as a result (on Windows) the CRT functions wcstombs/mbstowcs can never be used to convert wchar_t to/from UTF-8.
* Windows NTFS stores file names as a sequence of 16 bit characters; therefore UTF-16 is supported (although it does not check if the string is valid UTF-16)
* Linux ext2/ext3 file systems stores file names as a sequence of 8 bit characters; therefore UTF-8 is supported (although it does not check if the string is valid UTF-8)
* The Linux internals do not concern themselves with the locale encoding. Unicode (i.e. UTF-8) strings are treated just like non-Unicode strings: as a sequence of bytes.
* The Windows internals DO concern themselves with the current locale. Strings (including file names) are translated to/from the locale.
* To interact with Windows independent of the locale, wide character (UTF-16) functions must be used (those ending with 'W' in the Win32 API, or those beginning with '_w' in the MinGW CRT).

Locked