newlispfanclub.alh.net

Posted: **Thu Apr 08, 2010 5:53 pm**

For some reason find-all operator stops on empty strings:

> (find-all "(?<=,)[^,]*(?=,)" ",1,10,,100,1000,")
("1" "10" "")

I did expect here:

Code: Select all

("1" "10" "" "100" "1000")

I think, somewhere inside find-all operator is a check: (if (= found-string prev-string) stop-search)

Is it a feature to make programmers use "parse" operator for empty string searching?

Posted: **Thu Apr 08, 2010 7:43 pm**

'find-all' stops searching if a regular expression doesn't advance the search-offset and the instance found is of length zero. Without this condition 'find-all' would run into an infinite loop on certain regular expressions.

When using 'find-all', the safest method is, to use the regular expression to define the contents of the token, not the borders surrounding it, e.g:

Code: Select all

(find-all "[^,]+" ",1,10,,100,1000," ) => ("1" "10" "100" "1000")

When using 'parse' define the borders in-between the tokens:

Code: Select all

(parse ",1,10,,100,1000," ",") => ("" "1" "10" "" "100" "1000" "")

... then you could use 'clean' to get rid empty strings:

Code: Select all

(clean empty? (parse ",1,10,,100,1000," ",")) => ("1" "10" "100" "1000")

Posted: **Thu Apr 08, 2010 8:16 pm**

Here is an example from Friddle`s book:

Code: Select all

(find-all {(^|(?<=,))("(([^"]|"")*)"|([^",]*))}
 {Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K})

It will be very hard to make parse operator to process quotes correctly. And I can not just ignore empty fields, becouse their place is important. (I use some complex regexp when parsing data in csv, rtf etc).

However, it is not too important. I can just replace ",," to the ",_empty_," before applying find-all.

Posted: **Fri Apr 09, 2010 8:02 am**

Here is a solution using 'find' with offset:

Code: Select all

(set 'offset 0)
(while (set 'pos (find {(^|(?<=,))("(([^"]|"")*)"|([^",]*))}
	{Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K} 0 offset ))
	(push $0 lst -1)
        (println "->" (offset $0) "<-")
	(inc offset (+ 1 (length $0)))
)


lst => ("Ten Thousand" "10000" " 2710 " "" "\"10,000\"" "\"It's \"\"10 Grand\"\", baby\"" "10K")

(lst -2) => "\"It's \"\"10 Grand\"\", baby\""

but this solution also shows the dilemma, we are in. After each successful 'find', we calculate the new offset adding the length to the old offset +1. The +1 operation is, what avoids the infinite loop, but it also has the potential to leave out an important character in a different situation.

In the current example, the character skipped is the comma field delimiter. But in another case the character might be part of the following token and should not be skipped. In above example if we don't skip, we end up in an infinite loop.

Posted: **Fri Apr 09, 2010 8:46 am**

I guess additional check for zero-length will help:

Code: Select all

(set 'offset 0 'lst '())
(while (set 'pos (find {(^|(?<=,))("(([^"]|"")*)"|([^",]*))}
  {Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K} 0 offset ))
  (if (and lst (= (last lst) $0 ""))
    (inc offset)
    (begin
      (set 'offset (+ offset (length $0)))
      (push $0 lst -1))))

As far as I know, there is a metacharacter \G in the Perl:

"\G Match only where previous m//g left off (works only with /g)"

I have no experience with PCRE standarts, but, may be, there are some preset rules — how should search engine process empty strings?

Posted: **Fri Apr 09, 2010 9:28 am**

Unfortunately this strategy will swallow legitimate empty strings, if they follow each other. This example will produce the same result:

Code: Select all

{Ten Thousand,10000, 2710 ,,,,,,,"10,000","It's ""10 Grand"", baby",10K}

Posted: **Fri Apr 09, 2010 9:56 am**

Lutz wrote:Unfortunately this strategy will swallow legitimate empty strings

Yep, I have made an error: I did not took into account difference between starting search position and actual offset of the found item:

Code: Select all

(set 'offset 0 'lst '())
(while (set 'offset (find {(^|(?<=,))("(([^"]|"")*)"|([^",]*))}
  {Ten Thousand,10000, 2710 ,,,,,,,"10,000","It's ""10 Grand"", baby",10K} 0 offset))
  (set 'offset (+ offset (length $0)))
  (if (and lst (= (list offset $0) (last lst)))
    (inc offset)
    (push (list offset $0) lst -1)))

(silent 
  (println "--- Found: " (length lst) " ---")
  (map println lst))

Posted: **Fri Apr 09, 2010 12:02 pm**

Checking the previous offset seems to be sufficient to detect the loop condition, as it implies that a length zero string was found at the same offset.

This is how it will be implemented in C in the next development version:

Code: Select all

(set 'offset 0 'lastOffset 0 'lst '())
(while (set 'offset (find {(^|(?<=,))("(([^"]|"")*)"|([^",]*))}
  {Ten Thousand,10000, 2710 ,,,,,,,"10,000","It's ""10 Grand"", baby",10K} 0 offset))
  (set 'offset (+ offset (length $0)))
  (if (= offset lastOffset)
    (inc offset)
    (begin
        (set 'lastOffset offset)
        (push $0 lst -1))))

... so far it has passed all tests.

Posted: **Fri Apr 09, 2010 12:26 pm**

As far as I can see, we will not be able to locate two different strings (empty and non-empty) at the same offset.

Code: Select all

(find-all "((?<=FILE:)[0-9]*)|(sample.gif)" "FILE:sample.gif")

But, probably, there will be no such situations in real life.

newlispfanclub.alh.net

find-all and empty strings

find-all and empty strings

Re: find-all and empty strings

Re: find-all and empty strings

Re: find-all and empty strings

Re: find-all and empty strings

Re: find-all and empty strings

Re: find-all and empty strings

Re: find-all and empty strings

Re: find-all and empty strings