Page 1 of 1
find-all and empty strings
Posted: Thu Apr 08, 2010 5:53 pm
by Fritz
For some reason find-all operator stops on empty strings:
Code: Select all
> (find-all "(?<=,)[^,]*(?=,)" ",1,10,,100,1000,")
("1" "10" "")
I did expect here:
I think, somewhere inside find-all operator is a check: (if (= found-string prev-string) stop-search)
Is it a feature to make programmers use "parse" operator for empty string searching?
Re: find-all and empty strings
Posted: Thu Apr 08, 2010 7:43 pm
by Lutz
'find-all' stops searching if a regular expression doesn't advance the search-offset and the instance found is of length zero. Without this condition 'find-all' would run into an infinite loop on certain regular expressions.
When using 'find-all', the safest method is, to use the regular expression to define the
contents of the token, not the borders surrounding it, e.g:
Code: Select all
(find-all "[^,]+" ",1,10,,100,1000," ) => ("1" "10" "100" "1000")
When using 'parse' define the
borders in-between the tokens:
Code: Select all
(parse ",1,10,,100,1000," ",") => ("" "1" "10" "" "100" "1000" "")
... then you could use 'clean' to get rid empty strings:
Code: Select all
(clean empty? (parse ",1,10,,100,1000," ",")) => ("1" "10" "100" "1000")
Re: find-all and empty strings
Posted: Thu Apr 08, 2010 8:16 pm
by Fritz
Here is an example from Friddle`s book:
Code: Select all
(find-all {(^|(?<=,))("(([^"]|"")*)"|([^",]*))}
{Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K})
It will be very hard to make parse operator to process quotes correctly. And I can not just ignore empty fields, becouse their place is important. (I use some complex regexp when parsing data in csv, rtf etc).
However, it is not too important. I can just replace ",," to the ",_empty_," before applying find-all.
Re: find-all and empty strings
Posted: Fri Apr 09, 2010 8:02 am
by Lutz
Here is a solution using 'find' with offset:
Code: Select all
(set 'offset 0)
(while (set 'pos (find {(^|(?<=,))("(([^"]|"")*)"|([^",]*))}
{Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K} 0 offset ))
(push $0 lst -1)
(println "->" (offset $0) "<-")
(inc offset (+ 1 (length $0)))
)
lst => ("Ten Thousand" "10000" " 2710 " "" "\"10,000\"" "\"It's \"\"10 Grand\"\", baby\"" "10K")
(lst -2) => "\"It's \"\"10 Grand\"\", baby\""
but this solution also shows the dilemma, we are in. After each successful 'find', we calculate the new offset adding the length to the old offset
+1. The +1 operation is, what avoids the infinite loop, but it also has the potential to leave out an important character in a different situation.
In the current example, the character skipped is the comma field delimiter. But in another case the character might be part of the following token and should not be skipped. In above example if we don't skip, we end up in an infinite loop.
Re: find-all and empty strings
Posted: Fri Apr 09, 2010 8:46 am
by Fritz
I guess additional check for zero-length will help:
Code: Select all
(set 'offset 0 'lst '())
(while (set 'pos (find {(^|(?<=,))("(([^"]|"")*)"|([^",]*))}
{Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K} 0 offset ))
(if (and lst (= (last lst) $0 ""))
(inc offset)
(begin
(set 'offset (+ offset (length $0)))
(push $0 lst -1))))
As far as I know, there is a metacharacter \G in the Perl:
"\G Match only where previous m//g left off (works only with /g)"
I have no experience with PCRE standarts, but, may be, there are some preset rules — how should search engine process empty strings?
Re: find-all and empty strings
Posted: Fri Apr 09, 2010 9:28 am
by Lutz
Unfortunately this strategy will swallow legitimate empty strings, if they follow each other. This example will produce the same result:
Code: Select all
{Ten Thousand,10000, 2710 ,,,,,,,"10,000","It's ""10 Grand"", baby",10K}
Re: find-all and empty strings
Posted: Fri Apr 09, 2010 9:56 am
by Fritz
Lutz wrote:Unfortunately this strategy will swallow legitimate empty strings
Yep, I have made an error: I did not took into account difference between starting search position and actual offset of the found item:
Code: Select all
(set 'offset 0 'lst '())
(while (set 'offset (find {(^|(?<=,))("(([^"]|"")*)"|([^",]*))}
{Ten Thousand,10000, 2710 ,,,,,,,"10,000","It's ""10 Grand"", baby",10K} 0 offset))
(set 'offset (+ offset (length $0)))
(if (and lst (= (list offset $0) (last lst)))
(inc offset)
(push (list offset $0) lst -1)))
(silent
(println "--- Found: " (length lst) " ---")
(map println lst))
Re: find-all and empty strings
Posted: Fri Apr 09, 2010 12:02 pm
by Lutz
Checking the previous offset seems to be sufficient to detect the loop condition, as it implies that a length zero string was found at the same offset.
This is how it will be implemented in C in the next development version:
Code: Select all
(set 'offset 0 'lastOffset 0 'lst '())
(while (set 'offset (find {(^|(?<=,))("(([^"]|"")*)"|([^",]*))}
{Ten Thousand,10000, 2710 ,,,,,,,"10,000","It's ""10 Grand"", baby",10K} 0 offset))
(set 'offset (+ offset (length $0)))
(if (= offset lastOffset)
(inc offset)
(begin
(set 'lastOffset offset)
(push $0 lst -1))))
... so far it has passed all tests.
Re: find-all and empty strings
Posted: Fri Apr 09, 2010 12:26 pm
by Fritz
As far as I can see, we will not be able to locate two different strings (empty and non-empty) at the same offset.
Code: Select all
(find-all "((?<=FILE:)[0-9]*)|(sample.gif)" "FILE:sample.gif")
But, probably, there will be no such situations in real life.