UTF8 and regular expressions in newLISP

Notices and updates

UTF8 and regular expressions in newLISP

Postby Lutz » Sun Aug 09, 2015 4:47 pm

For patterns, which don’t address UTF8 characters as a single character, working with to without the PCRE_UTF8 flag 2048 or “u” will not make a difference, but the flag is needed when looking at multibyte sequences as UTF8 characters:

Code: Select all
> (set 'utf8str "我能吞下玻璃而不伤身体。")
"我能吞下玻璃而不伤身体。"

> (regex "(.)(.)(.)" utf8str)
("我" 0 3 "?" 0 1 "?" 1 1 "?" 2 1)


Without the flag the string matched consists of 3 single characters, each octet matched by a dot, which are then combined by UTF8 enabled newLISP and a UTF8 enabled terminal to a displayable UTF8 character. They are represented as "?" because they are neither UTF8 by itself nor ASCII characters.

Now the same with using the PCRE_UTF8 options flag:

Code: Select all
> (regex "(.)(.)(.)" utf8str 2048)
("我能吞" 0 3 "我" 0 1 "能" 1 1 "吞" 2 1)

> (regex "(.)(.)(.)" utf8str "u")
("我能吞" 0 3 "我" 0 1 "能" 1 1 "吞" 2 1)
>


… the dot ”.” now represents a UTF8 character.

The above examples work on versions 10.6.2, 10.6.3 and 10.6.4 of newLISP.

The error message "invalid UTF8 string" is only generated by the functions first, rest, last and pop and implicit indexing of strings, when the string seen as an UTF8 string would occupy more bytes then allocated or terminated by 0 for a string not meant to be a nUTF8 string.
Lutz
 
Posts: 5258
Joined: Thu Sep 26, 2002 4:45 pm
Location: Pasadena, California

Re: UTF8 and regular expressions in newLISP

Postby TedWalther » Sun Aug 09, 2015 5:24 pm

Thanks for the explanation Lutz. I was trying to duplicate the error with version 10.6.2, but it didn't show up:

> (set 'foo (string "abcd" (pack "b" (int "0b11001111"))))
"abcd�"
> (regex "(\r|\n)$" foo 0)
nil
> (set 'foo (string "abcd" (pack "b" (int "0b11101111")) "e"))
"abcd�e"
> (regex "(\r|\n)$" foo 0)
nil
> (regex "(\r|\n)$" foo)
nil



Then I thought: "implicit indexing, ah"

> (regex "\r|\n" (foo -1))
nil
> (set 'foo (string "abcd" (pack "b" (int "0b11001111"))))
"abcd�"
> (regex "\r|\n" (foo -1))
nil


So still not sure exactly how my code triggered the exception. I'd like to duplicate the bug so I can fix it in my code.

Update

You mentioned a string containing a 0 byte, so I'll test that. And, still not triggering the exception.

> (set 'foo (string "abcd" (pack "b" 0) "e"))
"abcde"
> (regex "\r|\n" (foo -1))
nil
> (regex "\r|\n" foo)
nil
> (set 'foo (string "abc\rd" (pack "b" 0) "e\n"))
"abc\rde\n"
> (regex "\r|\n" foo)
("\r" 3 1)
> (regex "\n" foo)
("\n" 6 1)


OpenBSD

OpenBSD recently added support for Lua patterns to their web server; I read the manpage. The patterns are almost like regular expressions, but smaller, simple, very fast to implement, and include some nice things like paren-matching. 700 lines of code.

http://www.openbsd.org/cgi-bin/man.cgi/ ... y=patterns

http://comments.gmane.org/gmane.os.openbsd.tech/42569

there is some great interest in getting support for rewrites and
better matching in httpd. I refused to implement this using regex, as
regex is extremely complicated code, there have been lots of bugs,
they allow, if not specified carefully, dangerous recursions and
ReDOS, and I would add another potential attack surface in httpd.

Thanks to tedu <at> 's hint at BSDCan, I stumbled across Lua's pattern
matching implementation. It is relatively small (less than 700loc),
powerful, portable C code, MIT-licensed, and doesn't suffer from some
of regex' problems (eg., it doesn't allow recursive captures). I
ported it on my flight back from Ottawa, KNF'ed it, and turned it into
a C API without the Lua bindings. No, this diff does not bring the
Lua language to httpd!
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence. Nine months later, they left with a baby named newLISP. The women of the ivory towers wept and wailed. "Abomination!" they cried.
TedWalther
 
Posts: 602
Joined: Mon Feb 05, 2007 1:04 am
Location: Abbotsford, BC

Re: UTF8 and regular expressions in newLISP

Postby cormullion » Sun Aug 09, 2015 5:58 pm

cormullion
 
Posts: 2037
Joined: Tue Nov 29, 2005 8:28 pm
Location: latiitude 50N longitude 3W

Re: UTF8 and regular expressions in newLISP

Postby TedWalther » Sun Aug 09, 2015 6:54 pm

cormullion wrote:try PEG!

https://github.com/dahu/nlpeg


PEG looks neat. Have you tried this implementation? Does it work?

Update

Reading the Wiki page, I didn't realize that PEG is a formalism for recursive descent parsers. Awesome!
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence. Nine months later, they left with a baby named newLISP. The women of the ivory towers wept and wailed. "Abomination!" they cried.
TedWalther
 
Posts: 602
Joined: Mon Feb 05, 2007 1:04 am
Location: Abbotsford, BC

Re: UTF8 and regular expressions in newLISP

Postby ralph.ronnquist » Sun Aug 09, 2015 10:23 pm

E.g.,
Code: Select all
> (setf b (pack "b" (+ 0xc0 0x30)))
"�"
> (regex "x" (b -1))

ERR: invalid UTF8 string in function regex
ralph.ronnquist
 
Posts: 178
Joined: Mon Jun 02, 2014 1:40 am
Location: Melbourne, Australia

Re: UTF8 and regular expressions in newLISP

Postby TedWalther » Sun Aug 09, 2015 10:47 pm

I guess the confusion is that the difference between character streams and byte streams isn't always obvious. Both are useful.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence. Nine months later, they left with a baby named newLISP. The women of the ivory towers wept and wailed. "Abomination!" they cried.
TedWalther
 
Posts: 602
Joined: Mon Feb 05, 2007 1:04 am
Location: Abbotsford, BC

Re: UTF8 and regular expressions in newLISP

Postby TedWalther » Sun Aug 09, 2015 11:04 pm

ralph.ronnquist wrote:E.g.,
Code: Select all
> (setf b (pack "b" (+ 0xc0 0x30)))
"�"
> (regex "x" (b -1))

ERR: invalid UTF8 string in function regex


This is why it is hard to chase down; the interactions between utf8 mode and octet (raw byte) mode.

> (b -1)

ERR: invalid UTF8 string
> b
"�"
> (char b)
2827
> (bits b)

ERR: value expected in function bits : b
> (bits (char b))
"101100001011"
>


Perhaps the get-char or unpack functions would do the trick. They usually aren't the first things I think of. I find my brain having to work to do the shift between character and byte oriented streams, each with their different API. It wants to use the same API for both, with perhaps the occasional boolean flag or two to disambiguate.

In this case, the char function is silently converting a byte value to... to what? As a 16 bit quantity, it is valid UTF8.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence. Nine months later, they left with a baby named newLISP. The women of the ivory towers wept and wailed. "Abomination!" they cried.
TedWalther
 
Posts: 602
Joined: Mon Feb 05, 2007 1:04 am
Location: Abbotsford, BC

Re: UTF8 and regular expressions in newLISP

Postby abaddon1234 » Sat Apr 30, 2016 9:23 am

Thanks for the info
จีคลับ
abaddon1234
 
Posts: 21
Joined: Mon Sep 14, 2015 3:09 am


Return to newLISP newS

Who is online

Users browsing this forum: No registered users and 1 guest

cron