russian text and find-all problem

Q&A's, tips, howto's
Locked
hivecluster
Posts: 1
Joined: Wed Apr 22, 2009 2:16 pm

russian text and find-all problem

Post by hivecluster »

hello, everybody
look here:

Code: Select all

newLISP v.10.0.2 on Linux IPv4 UTF-8, execute 'newlisp -h' for more info.
> (set-locale "ru_RU.utf8")
("ru_RU.utf8" ",")
> (find-all "eye" "EYE eYe eye" $0 1)
("EYE" "eYe" "eye")
> (find-all "глаз" "ГЛАЗ гЛаз глаз" $0 1)
("глаз")
can you explain why don't find-all matches all words?

p.s. sorry for my pijin english :(

Lutz
Posts: 5289
Joined: Thu Sep 26, 2002 4:45 pm
Location: Pasadena, California
Contact:

Post by Lutz »

Regular expression are part of the PCRE http://pcre.org library code newLISP is using. When PCRE gets compiled it gets compiled for upper/lower-casing, case flipping and character classifying of (letters, numbers, hex-digit etc.) for a specific locale.

In the standard newLISP distribution a file: pcre-chartables.c is contained, which gets automatically generated for a specific locale. In newLISP this locale is the so called 'C'-locale. It does casing etc. only for the first page of one-byte characters in the UTF-8 character set, but guarantees internationally consistent behavior of newLISP at least in the English language. When newLISP starts up, it pus itself into this locale.

As a workaround you could do something like this:

Code: Select all

(find-all (lower-case search-str) (lower-case text-str))
Of course this depends on the newLISP 'upper/lower-case' routines working correctly in your locale's UTF-8 implementation and for the character set used, which should have tables working for the C-libraries towupper() and towlower() functions to pick the right character and case.

Last not least, when using UTF-8 code all regex flags should be or'ed wirh 2048 (see docs for regex). It makes the following difference:

Code: Select all

; wrong because (char 937) should count as only one UTF-8 character
(find (append "." (char 937) ".") (append (char 937) (char 937) (char 937)) 0) => 1

; correct because the first to bytes in (char 937) form one UTF-8 character
(find (append "." (char 937) ".") (append (char 937) (char 937) (char 937)) 2048) => 0
The character used here is the Greek Omega character. I have coded it as (char 937), so you can copy/paste the code without problems. This is what I raelly did:

Code: Select all

(find ".Ω." "ΩΩΩ" 2048) => 0 ; correct offset 0

Fritz
Posts: 66
Joined: Sun Sep 27, 2009 12:08 am
Location: Russia

Post by Fritz »

Sorry for necroposting. (upper-case) and (lower-case) won`t work with russian letters.

May be, somewhere deep in newLISP is already implemented International Support? (Workin in "newLISP v.10.1.2 on Win32 IPv4" now).

If it is not, but you plan to, here is russian alphabet (33 letters inside):

Lower:
"\224\225\226\227\228\229\184\230\231\232\233\234\235\236\237\238\239\240\241\242\243\244\245\246\247\248\249\250\251\252\253\254\255"

Upper:
"\192\193\194\195\196\197\168\198\199\200\201\202\203\204\205\206\207\208\209\210\211\212\213\214\215\216\217\218\219\220\221\222\223"

Btw, lower-case and upper-case — only a part of a problem. When I have an error in the function with russian name, I have to resolve messages like "ERR: missing parenthesis : "...\238\239\224 14)\n ".

cormullion
Posts: 2038
Joined: Tue Nov 29, 2005 8:28 pm
Location: latiitude 50N longitude 3W
Contact:

Post by cormullion »

It appears to work ok on my MacOS X UTF-8 newLISP 10.1.5:

Code: Select all

(set-locale "ru_RU.UTF-8")

(println "ангел снега")
ангел снега
(println (upper-case "ангел снега"))
АНГЕЛ СНЕГА

(println (upper-case "Bглядывaяcь в глyбинy вpeмeни"))
BГЛЯДЫВAЯCЬ В ГЛYБИНY ВPEМEНИ

(println (lower-case "Bглядывaяcь в глyбинy вpeмeни"))
bглядывaяcь в глyбинy вpeмeни

(println (title-case "Bглядывaяcь в глyбинy вpeмeни"))
Bглядывaяcь в глyбинy вpeмeни
although it's hard to tell - the letters look similar regardless of case... (sorry, don't speak Russian... :)

Fritz
Posts: 66
Joined: Sun Sep 27, 2009 12:08 am
Location: Russia

Post by Fritz »

Yep, just tried it — it works with UTF. (I have just installed UTF-version to check).

Funny, but I got another problem: now GUI is unable load my config file from my home directory, becouse it has russian name. I think, I`ll be able to resolve it by renaming my windows user.

Image

Locked