UTF8 and regular expressions in newLISP

Notices and updates
Locked
Lutz
Posts: 5289
Joined: Thu Sep 26, 2002 4:45 pm
Location: Pasadena, California
Contact:

UTF8 and regular expressions in newLISP

Post by Lutz »

For patterns, which don’t address UTF8 characters as a single character, working with to without the PCRE_UTF8 flag 2048 or “u” will not make a difference, but the flag is needed when looking at multibyte sequences as UTF8 characters:

Code: Select all

> (set 'utf8str "我能吞下玻璃而不伤身体。")
"我能吞下玻璃而不伤身体。"

> (regex "(.)(.)(.)" utf8str)
("我" 0 3 "?" 0 1 "?" 1 1 "?" 2 1)
Without the flag the string matched consists of 3 single characters, each octet matched by a dot, which are then combined by UTF8 enabled newLISP and a UTF8 enabled terminal to a displayable UTF8 character. They are represented as "?" because they are neither UTF8 by itself nor ASCII characters.

Now the same with using the PCRE_UTF8 options flag:

Code: Select all

> (regex "(.)(.)(.)" utf8str 2048)
("我能吞" 0 3 "我" 0 1 "能" 1 1 "吞" 2 1)

> (regex "(.)(.)(.)" utf8str "u")
("我能吞" 0 3 "我" 0 1 "能" 1 1 "吞" 2 1)
> 
… the dot ”.” now represents a UTF8 character.

The above examples work on versions 10.6.2, 10.6.3 and 10.6.4 of newLISP.

The error message "invalid UTF8 string" is only generated by the functions first, rest, last and pop and implicit indexing of strings, when the string seen as an UTF8 string would occupy more bytes then allocated or terminated by 0 for a string not meant to be a nUTF8 string.

TedWalther
Posts: 608
Joined: Mon Feb 05, 2007 1:04 am
Location: Abbotsford, BC
Contact:

Re: UTF8 and regular expressions in newLISP

Post by TedWalther »

Thanks for the explanation Lutz. I was trying to duplicate the error with version 10.6.2, but it didn't show up:
> (set 'foo (string "abcd" (pack "b" (int "0b11001111"))))
"abcd�"
> (regex "(\r|\n)$" foo 0)
nil
> (set 'foo (string "abcd" (pack "b" (int "0b11101111")) "e"))
"abcd�e"
> (regex "(\r|\n)$" foo 0)
nil
> (regex "(\r|\n)$" foo)
nil

Then I thought: "implicit indexing, ah"
> (regex "\r|\n" (foo -1))
nil
> (set 'foo (string "abcd" (pack "b" (int "0b11001111"))))
"abcd�"
> (regex "\r|\n" (foo -1))
nil
So still not sure exactly how my code triggered the exception. I'd like to duplicate the bug so I can fix it in my code.

Update

You mentioned a string containing a 0 byte, so I'll test that. And, still not triggering the exception.
> (set 'foo (string "abcd" (pack "b" 0) "e"))
"abcde"
> (regex "\r|\n" (foo -1))
nil
> (regex "\r|\n" foo)
nil
> (set 'foo (string "abc\rd" (pack "b" 0) "e\n"))
"abc\rde\n"
> (regex "\r|\n" foo)
("\r" 3 1)
> (regex "\n" foo)
("\n" 6 1)
OpenBSD

OpenBSD recently added support for Lua patterns to their web server; I read the manpage. The patterns are almost like regular expressions, but smaller, simple, very fast to implement, and include some nice things like paren-matching. 700 lines of code.

http://www.openbsd.org/cgi-bin/man.cgi/ ... y=patterns

http://comments.gmane.org/gmane.os.openbsd.tech/42569
there is some great interest in getting support for rewrites and
better matching in httpd. I refused to implement this using regex, as
regex is extremely complicated code, there have been lots of bugs,
they allow, if not specified carefully, dangerous recursions and
ReDOS, and I would add another potential attack surface in httpd.

Thanks to tedu <at> 's hint at BSDCan, I stumbled across Lua's pattern
matching implementation. It is relatively small (less than 700loc),
powerful, portable C code, MIT-licensed, and doesn't suffer from some
of regex' problems (eg., it doesn't allow recursive captures). I
ported it on my flight back from Ottawa, KNF'ed it, and turned it into
a C API without the Lua bindings. No, this diff does not bring the
Lua language to httpd!
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence. Nine months later, they left with a baby named newLISP. The women of the ivory towers wept and wailed. "Abomination!" they cried.

cormullion
Posts: 2038
Joined: Tue Nov 29, 2005 8:28 pm
Location: latiitude 50N longitude 3W
Contact:

Re: UTF8 and regular expressions in newLISP

Post by cormullion »


TedWalther
Posts: 608
Joined: Mon Feb 05, 2007 1:04 am
Location: Abbotsford, BC
Contact:

Re: UTF8 and regular expressions in newLISP

Post by TedWalther »

cormullion wrote:try PEG!

https://github.com/dahu/nlpeg
PEG looks neat. Have you tried this implementation? Does it work?

Update

Reading the Wiki page, I didn't realize that PEG is a formalism for recursive descent parsers. Awesome!
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence. Nine months later, they left with a baby named newLISP. The women of the ivory towers wept and wailed. "Abomination!" they cried.

ralph.ronnquist
Posts: 228
Joined: Mon Jun 02, 2014 1:40 am
Location: Melbourne, Australia

Re: UTF8 and regular expressions in newLISP

Post by ralph.ronnquist »

E.g.,

Code: Select all

> (setf b (pack "b" (+ 0xc0 0x30)))
"�"
> (regex "x" (b -1))

ERR: invalid UTF8 string in function regex

TedWalther
Posts: 608
Joined: Mon Feb 05, 2007 1:04 am
Location: Abbotsford, BC
Contact:

Re: UTF8 and regular expressions in newLISP

Post by TedWalther »

I guess the confusion is that the difference between character streams and byte streams isn't always obvious. Both are useful.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence. Nine months later, they left with a baby named newLISP. The women of the ivory towers wept and wailed. "Abomination!" they cried.

TedWalther
Posts: 608
Joined: Mon Feb 05, 2007 1:04 am
Location: Abbotsford, BC
Contact:

Re: UTF8 and regular expressions in newLISP

Post by TedWalther »

ralph.ronnquist wrote:E.g.,

Code: Select all

> (setf b (pack "b" (+ 0xc0 0x30)))
"�"
> (regex "x" (b -1))

ERR: invalid UTF8 string in function regex
This is why it is hard to chase down; the interactions between utf8 mode and octet (raw byte) mode.
> (b -1)

ERR: invalid UTF8 string
> b
"�"
> (char b)
2827
> (bits b)

ERR: value expected in function bits : b
> (bits (char b))
"101100001011"
>
Perhaps the get-char or unpack functions would do the trick. They usually aren't the first things I think of. I find my brain having to work to do the shift between character and byte oriented streams, each with their different API. It wants to use the same API for both, with perhaps the occasional boolean flag or two to disambiguate.

In this case, the char function is silently converting a byte value to... to what? As a 16 bit quantity, it is valid UTF8.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence. Nine months later, they left with a baby named newLISP. The women of the ivory towers wept and wailed. "Abomination!" they cried.

abaddon1234
Posts: 21
Joined: Mon Sep 14, 2015 3:09 am

Re: UTF8 and regular expressions in newLISP

Post by abaddon1234 »

Thanks for the info
จีคลับ

Locked