UTF8 and regular expressions in newLISP

Lutz · Post by **Lutz** » Sun Aug 09, 2015 4:47 pm

For patterns, which don’t address UTF8 characters as a single character, working with to without the PCRE_UTF8 flag 2048 or “u” will not make a difference, but the flag is needed when looking at multibyte sequences as UTF8 characters:

Code: Select all

> (set 'utf8str "我能吞下玻璃而不伤身体。")
"我能吞下玻璃而不伤身体。"

> (regex "(.)(.)(.)" utf8str)
("我" 0 3 "?" 0 1 "?" 1 1 "?" 2 1)

Without the flag the string matched consists of 3 single characters, each octet matched by a dot, which are then combined by UTF8 enabled newLISP and a UTF8 enabled terminal to a displayable UTF8 character. They are represented as "?" because they are neither UTF8 by itself nor ASCII characters.

Now the same with using the PCRE_UTF8 options flag:

Code: Select all

> (regex "(.)(.)(.)" utf8str 2048)
("我能吞" 0 3 "我" 0 1 "能" 1 1 "吞" 2 1)

> (regex "(.)(.)(.)" utf8str "u")
("我能吞" 0 3 "我" 0 1 "能" 1 1 "吞" 2 1)
>

… the dot ”.” now represents a UTF8 character.

The above examples work on versions 10.6.2, 10.6.3 and 10.6.4 of newLISP.

The error message "invalid UTF8 string" is only generated by the functions first, rest, last and pop and implicit indexing of strings, when the string seen as an UTF8 string would occupy more bytes then allocated or terminated by 0 for a string not meant to be a nUTF8 string.

TedWalther · Post by **TedWalther** » Sun Aug 09, 2015 5:24 pm

Thanks for the explanation Lutz. I was trying to duplicate the error with version 10.6.2, but it didn't show up:

> (set 'foo (string "abcd" (pack "b" (int "0b11001111"))))
"abcd�"
> (regex "(\r|\n)$" foo 0)
nil
> (set 'foo (string "abcd" (pack "b" (int "0b11101111")) "e"))
"abcd�e"
> (regex "(\r|\n)$" foo 0)
nil
> (regex "(\r|\n)$" foo)
nil

Then I thought: "implicit indexing, ah"

> (regex "\r|\n" (foo -1))
nil
> (set 'foo (string "abcd" (pack "b" (int "0b11001111"))))
"abcd�"
> (regex "\r|\n" (foo -1))
nil

So still not sure exactly how my code triggered the exception. I'd like to duplicate the bug so I can fix it in my code.

Update

You mentioned a string containing a 0 byte, so I'll test that. And, still not triggering the exception.

> (set 'foo (string "abcd" (pack "b" 0) "e"))
"abcde"
> (regex "\r|\n" (foo -1))
nil
> (regex "\r|\n" foo)
nil
> (set 'foo (string "abc\rd" (pack "b" 0) "e\n"))
"abc\rde\n"
> (regex "\r|\n" foo)
("\r" 3 1)
> (regex "\n" foo)
("\n" 6 1)

OpenBSD

OpenBSD recently added support for Lua patterns to their web server; I read the manpage. The patterns are almost like regular expressions, but smaller, simple, very fast to implement, and include some nice things like paren-matching. 700 lines of code.

http://www.openbsd.org/cgi-bin/man.cgi/ ... y=patterns

http://comments.gmane.org/gmane.os.openbsd.tech/42569

there is some great interest in getting support for rewrites and
better matching in httpd. I refused to implement this using regex, as
regex is extremely complicated code, there have been lots of bugs,
they allow, if not specified carefully, dangerous recursions and
ReDOS, and I would add another potential attack surface in httpd.

Thanks to tedu <at> 's hint at BSDCan, I stumbled across Lua's pattern
matching implementation. It is relatively small (less than 700loc),
powerful, portable C code, MIT-licensed, and doesn't suffer from some
of regex' problems (eg., it doesn't allow recursive captures). I
ported it on my flight back from Ottawa, KNF'ed it, and turned it into
a C API without the Lua bindings. No, this diff does not bring the
Lua language to httpd!

cormullion · Post by **cormullion** » Sun Aug 09, 2015 5:58 pm

try PEG!

https://github.com/dahu/nlpeg

TedWalther · Post by **TedWalther** » Sun Aug 09, 2015 6:54 pm

cormullion wrote:try PEG!

https://github.com/dahu/nlpeg

PEG looks neat. Have you tried this implementation? Does it work?

Update

Reading the Wiki page, I didn't realize that PEG is a formalism for recursive descent parsers. Awesome!

ralph.ronnquist · Post by **ralph.ronnquist** » Sun Aug 09, 2015 10:23 pm

E.g.,

Code: Select all

> (setf b (pack "b" (+ 0xc0 0x30)))
"�"
> (regex "x" (b -1))

ERR: invalid UTF8 string in function regex

TedWalther · Post by **TedWalther** » Sun Aug 09, 2015 10:47 pm

I guess the confusion is that the difference between character streams and byte streams isn't always obvious. Both are useful.

TedWalther · Post by **TedWalther** » Sun Aug 09, 2015 11:04 pm

ralph.ronnquist wrote:E.g.,

Code: Select all

> (setf b (pack "b" (+ 0xc0 0x30)))
"�"
> (regex "x" (b -1))

ERR: invalid UTF8 string in function regex

This is why it is hard to chase down; the interactions between utf8 mode and octet (raw byte) mode.

> (b -1)

ERR: invalid UTF8 string
> b
"�"
> (char b)
2827
> (bits b)

ERR: value expected in function bits : b
> (bits (char b))
"101100001011"
>

Perhaps the get-char or unpack functions would do the trick. They usually aren't the first things I think of. I find my brain having to work to do the shift between character and byte oriented streams, each with their different API. It wants to use the same API for both, with perhaps the occasional boolean flag or two to disambiguate.

In this case, the char function is silently converting a byte value to... to what? As a 16 bit quantity, it is valid UTF8.

abaddon1234 · Post by **abaddon1234** » Sat Apr 30, 2016 9:23 am

Thanks for the info
จีคลับ

newlispfanclub.alh.net

UTF8 and regular expressions in newLISP

UTF8 and regular expressions in newLISP

Re: UTF8 and regular expressions in newLISP

Re: UTF8 and regular expressions in newLISP

Re: UTF8 and regular expressions in newLISP

Re: UTF8 and regular expressions in newLISP

Re: UTF8 and regular expressions in newLISP

Re: UTF8 and regular expressions in newLISP

Re: UTF8 and regular expressions in newLISP