find-all and empty strings

Q&A's, tips, howto's
Locked
Fritz
Posts: 66
Joined: Sun Sep 27, 2009 12:08 am
Location: Russia

find-all and empty strings

Post by Fritz »

For some reason find-all operator stops on empty strings:

Code: Select all

> (find-all "(?<=,)[^,]*(?=,)" ",1,10,,100,1000,")
("1" "10" "")
I did expect here:

Code: Select all

("1" "10" "" "100" "1000")
I think, somewhere inside find-all operator is a check: (if (= found-string prev-string) stop-search)

Is it a feature to make programmers use "parse" operator for empty string searching?

Lutz
Posts: 5289
Joined: Thu Sep 26, 2002 4:45 pm
Location: Pasadena, California
Contact:

Re: find-all and empty strings

Post by Lutz »

'find-all' stops searching if a regular expression doesn't advance the search-offset and the instance found is of length zero. Without this condition 'find-all' would run into an infinite loop on certain regular expressions.

When using 'find-all', the safest method is, to use the regular expression to define the contents of the token, not the borders surrounding it, e.g:

Code: Select all

(find-all "[^,]+" ",1,10,,100,1000," ) => ("1" "10" "100" "1000")
When using 'parse' define the borders in-between the tokens:

Code: Select all

(parse ",1,10,,100,1000," ",") => ("" "1" "10" "" "100" "1000" "")
... then you could use 'clean' to get rid empty strings:

Code: Select all

(clean empty? (parse ",1,10,,100,1000," ",")) => ("1" "10" "100" "1000")

Fritz
Posts: 66
Joined: Sun Sep 27, 2009 12:08 am
Location: Russia

Re: find-all and empty strings

Post by Fritz »

Here is an example from Friddle`s book:

Code: Select all

(find-all {(^|(?<=,))("(([^"]|"")*)"|([^",]*))}
 {Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K})
It will be very hard to make parse operator to process quotes correctly. And I can not just ignore empty fields, becouse their place is important. (I use some complex regexp when parsing data in csv, rtf etc).

However, it is not too important. I can just replace ",," to the ",_empty_," before applying find-all.

Lutz
Posts: 5289
Joined: Thu Sep 26, 2002 4:45 pm
Location: Pasadena, California
Contact:

Re: find-all and empty strings

Post by Lutz »

Here is a solution using 'find' with offset:

Code: Select all

(set 'offset 0)
(while (set 'pos (find {(^|(?<=,))("(([^"]|"")*)"|([^",]*))}
	{Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K} 0 offset ))
	(push $0 lst -1)
        (println "->" (offset $0) "<-")
	(inc offset (+ 1 (length $0)))
)


lst => ("Ten Thousand" "10000" " 2710 " "" "\"10,000\"" "\"It's \"\"10 Grand\"\", baby\"" "10K")

(lst -2) => "\"It's \"\"10 Grand\"\", baby\""
but this solution also shows the dilemma, we are in. After each successful 'find', we calculate the new offset adding the length to the old offset +1. The +1 operation is, what avoids the infinite loop, but it also has the potential to leave out an important character in a different situation.

In the current example, the character skipped is the comma field delimiter. But in another case the character might be part of the following token and should not be skipped. In above example if we don't skip, we end up in an infinite loop.

Fritz
Posts: 66
Joined: Sun Sep 27, 2009 12:08 am
Location: Russia

Re: find-all and empty strings

Post by Fritz »

I guess additional check for zero-length will help:

Code: Select all

(set 'offset 0 'lst '())
(while (set 'pos (find {(^|(?<=,))("(([^"]|"")*)"|([^",]*))}
  {Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K} 0 offset ))
  (if (and lst (= (last lst) $0 ""))
    (inc offset)
    (begin
      (set 'offset (+ offset (length $0)))
      (push $0 lst -1))))
As far as I know, there is a metacharacter \G in the Perl:

"\G Match only where previous m//g left off (works only with /g)"

I have no experience with PCRE standarts, but, may be, there are some preset rules — how should search engine process empty strings?

Lutz
Posts: 5289
Joined: Thu Sep 26, 2002 4:45 pm
Location: Pasadena, California
Contact:

Re: find-all and empty strings

Post by Lutz »

Unfortunately this strategy will swallow legitimate empty strings, if they follow each other. This example will produce the same result:

Code: Select all

{Ten Thousand,10000, 2710 ,,,,,,,"10,000","It's ""10 Grand"", baby",10K}

Fritz
Posts: 66
Joined: Sun Sep 27, 2009 12:08 am
Location: Russia

Re: find-all and empty strings

Post by Fritz »

Lutz wrote:Unfortunately this strategy will swallow legitimate empty strings
Yep, I have made an error: I did not took into account difference between starting search position and actual offset of the found item:

Code: Select all

(set 'offset 0 'lst '())
(while (set 'offset (find {(^|(?<=,))("(([^"]|"")*)"|([^",]*))}
  {Ten Thousand,10000, 2710 ,,,,,,,"10,000","It's ""10 Grand"", baby",10K} 0 offset))
  (set 'offset (+ offset (length $0)))
  (if (and lst (= (list offset $0) (last lst)))
    (inc offset)
    (push (list offset $0) lst -1)))

(silent 
  (println "--- Found: " (length lst) " ---")
  (map println lst))

Lutz
Posts: 5289
Joined: Thu Sep 26, 2002 4:45 pm
Location: Pasadena, California
Contact:

Re: find-all and empty strings

Post by Lutz »

Checking the previous offset seems to be sufficient to detect the loop condition, as it implies that a length zero string was found at the same offset.

This is how it will be implemented in C in the next development version:

Code: Select all

(set 'offset 0 'lastOffset 0 'lst '())
(while (set 'offset (find {(^|(?<=,))("(([^"]|"")*)"|([^",]*))}
  {Ten Thousand,10000, 2710 ,,,,,,,"10,000","It's ""10 Grand"", baby",10K} 0 offset))
  (set 'offset (+ offset (length $0)))
  (if (= offset lastOffset)
    (inc offset)
    (begin
        (set 'lastOffset offset)
        (push $0 lst -1))))
... so far it has passed all tests.

Fritz
Posts: 66
Joined: Sun Sep 27, 2009 12:08 am
Location: Russia

Re: find-all and empty strings

Post by Fritz »

As far as I can see, we will not be able to locate two different strings (empty and non-empty) at the same offset.

Code: Select all

(find-all "((?<=FILE:)[0-9]*)|(sample.gif)" "FILE:sample.gif")
But, probably, there will be no such situations in real life.

Locked