in utf-8 mode regex return byte length

For the Compleat Fan
Locked
ssqq
Posts: 88
Joined: Sun May 04, 2014 12:49 pm

in utf-8 mode regex return byte length

Post by ssqq »

Code: Select all

> (length (char 0xff))
2
> (utf8len (char 0xff))
1
> (regex (char 0xff) (char 0xff))
("ÿ" 0 2)
> (regex (char 0xff) (char 0xff) 2048)
("ÿ" 0 2)
I think in UTF_8 mode, regex should return character location with utf-8 length.

Lutz
Posts: 5289
Joined: Thu Sep 26, 2002 4:45 pm
Location: Pasadena, California
Contact:

Re: in utf-8 mode regex return byte length

Post by Lutz »

Thanks for reporting this. Fixed for v.10.6.1 here:

http://www.newlisp.org/downloads/develo ... nprogress/

Code: Select all

> (regex "Ω" "Ω" 2048)
("Ω" 0 1)
> (regex "Ω" "Ω")
("Ω" 0 2)
> 

ssqq
Posts: 88
Joined: Sun May 04, 2014 12:49 pm

Re: in utf-8 mode regex return byte length

Post by ssqq »

I make a function to get it.

Code: Select all

(define (get-utf8-index utf8-str byte-index)
  (letn ((str-lst (explode utf8-str))
          (str-len (length str-lst))
          (chop-lst (sequence (- str-len 1) 0))
          (char-len-lst (map length str-lst)))
        (find byte-index
              (map (curry apply +)
                   (map (curry chop char-len-lst) chop-lst)))))
(get-utf8-index (dup (char 0xff) 10) 12) --> 6

Locked