Page 1 of 1

replace bug?

Posted: Wed Aug 20, 2014 12:37 pm
by ralph.ronnquist
I ran into the following, which seems like a bug in the regex pattern handling, illustrated in the following example:

Code: Select all

> (map char (explode (replace "[‘’]" "‘" "x" 0)))
(120 120 120)
> (map char (explode (replace "‘" "‘" "x" 0)))
(120)
> (map char (explode (replace "[‘’]" "‘" "x" 2048)))
(120)
Thus, when the pattern is within brackets, the replacement of char u8216 gets replicated into each of the source bytes, whereas without the brackets, the "proper" replacement occurs. The replace is also proper with the flags code 2048 raher than 0.

newLISP v.10.6.0 32-bit on Linux IPv4/6 UTF-8 libffi.

Re: replace bug?

Posted: Wed Aug 20, 2014 9:44 pm
by Lutz
The behavior is correct. When using UTF-8 characters in PCRE character classes and not specifying the UTF-8 option (either 2048 or letter “u” in version 10.6.1), each byte in the UTF-8 multibyte character found from the character class will be replaced. Character classes are taken byte-wise if not specifying UTF-8 mode.

http://www.newlisp.org/downloads/pcrepattern.html#SEC7

Re: replace bug?

Posted: Wed Aug 20, 2014 11:45 pm
by ralph.ronnquist
Ah. Yes, of course!

And, a bit much to expect newlisp mode in emacs know and show this difference, so the dumbbell at the keyboard can go on thinking about nothing...