regex doesn't match utf-8 character using \w pattern

Q&A's, tips, howto's

regex doesn't match utf-8 character using \w pattern

Postby vetelko » Fri Jan 27, 2017 8:37 pm

Hello newLISP-ers :)

the following code returns nil for utf-8 character
(regex {\w} "č")
and so the following
(regex {\w} "č" "u")
while this works
(regex {\w} "m")

Is this not implemented or am I doing something wrong?
VT
newLISP v.10.7.4 64-bit on OpenBSD IPv4/6 UTF-8 libffi
vetelko
 
Posts: 21
Joined: Thu Oct 13, 2016 4:47 pm

Re: regex doesn't match utf-8 character using \w pattern

Postby ralph.ronnquist » Fri Jan 27, 2017 9:26 pm

I believe the magic of regex is explained in http://www.newlisp.org/downloads/pcrepattern.html, where you find the two paragraphs:
------
A "word" character is an underscore or any character less than 256 that is a letter or digit. The definition of letters and digits is controlled by PCRE's low-valued character tables, and may vary if locale-specific matching is taking place (see "Locale support" in the pcreapi page). For example, in the "fr_FR" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.

In UTF-8 mode, characters with values greater than 128 never match \d, \s, or \w, and always match \D, \S, and \W. This is true even when Unicode character property support is available. The use of locales with Unicode is discouraged.
------
ralph.ronnquist
 
Posts: 209
Joined: Mon Jun 02, 2014 1:40 am
Location: Melbourne, Australia

Re: regex doesn't match utf-8 character using \w pattern

Postby vetelko » Sat Jan 28, 2017 11:02 am

So there is no way how to match this character as a word character using pattern?
newLISP v.10.7.4 64-bit on OpenBSD IPv4/6 UTF-8 libffi
vetelko
 
Posts: 21
Joined: Thu Oct 13, 2016 4:47 pm


Return to newLISP in the real world

Who is online

Users browsing this forum: No registered users and 1 guest

cron