Why won't this find work?

Q&A's, tips, howto's
Locked
methodic
Posts: 58
Joined: Tue May 10, 2005 5:04 am

Why won't this find work?

Post by methodic »

I have a webpage I am trying to parse... I slurp it with get-url, replace all whitespace with one space (newlines, tabs, multiple spaces, etc)... here is the part I am trying to grab: "<br> <br> Last Login:&nbsp; 7/20/2006<br> </td>"

here is my find: (find "<br> Last Login:&nbsp; (.*)<br>" txt 0)

and it returns up to 7/20/2006, but it doesn't stop at the <br>, it keeps going past it to the end of the file... I am using multiple finds on this page, and every of them return fine, are the forward slashes in the date screwing up find?

thanks.

methodic
Posts: 58
Joined: Tue May 10, 2005 5:04 am

Post by methodic »

Nevermind, I see what I did wrong, sorry. :)

I guess that's a *feature* of find. ;)

cormullion
Posts: 2038
Joined: Tue Nov 29, 2005 8:28 pm
Location: latiitude 50N longitude 3W
Contact:

Post by cormullion »

I've found <b>regex</b> useful when working with this stuff:

Code: Select all

(set 'txt "<br> <br> Last Login:&nbsp; 7/20/2006<br> </td>")
(regex "<br> Last Login:&nbsp; (.*)<br>" txt 0) 
;-> ("<br> Last Login:&nbsp; 7/20/2006<br>" 5 36 "7/20/2006" 28 9)
which tells you you want the $1 value rather than the $0:

Code: Select all

(find "<br> Last Login:&nbsp; (.*)<br>" txt 0) 
;-> 5
(println $0)
;-><br> Last Login:&nbsp; 7/20/2006<br> 
(println $1)
;->7/20/2006
I'm still struggling with this myself...! Yellow RegExp-Do belt worn with pride... ;-)

Lutz
Posts: 5289
Joined: Thu Sep 26, 2002 4:45 pm
Location: Pasadena, California
Contact:

Post by Lutz »

In both cases $0 and $1 show the same: $0 the whole area covered by the pattern and $1 the parenthesized subpattern:

Code: Select all

> (regex "<br> Last Login:&nbsp; (.*)<br>" txt 0) 
("<br> Last Login:&nbsp; 7/20/2006<br>" 5 36 "7/20/2006" 28 9)

> $0
"<br> Last Login:&nbsp; 7/20/2006<br>"
> $1
"7/20/2006"

> (find "<br> Last Login:&nbsp; (.*)<br>" txt 0) 
5

> $0
"<br> Last Login:&nbsp; 7/20/2006<br>"
> $1
"7/20/2006"
> 
In the list returned by regex imagine 3 members are always grouped together: string, offset and length.

The first group then corresponds to $0 the next to $1 etc.

Lutz

Lutz
Posts: 5289
Joined: Thu Sep 26, 2002 4:45 pm
Location: Pasadena, California
Contact:

Post by Lutz »

to Methodic:

the .* operator will always grab as much as it can and still satify the pattern. You can use the option 512 to invert greediness or put an ? after the star as in .*?

Code: Select all

> (find "a.*c" "abbbbcbbbcd" 0)
0
> $0
"abbbbcbbbc"

> (find "a.*c" "abbbbcbbbcd" 512)
0
> $0
"abbbbc"
> 

> (find "a.*?c" "abbbbcbbbcd" 0)
0
> $0
"abbbbc"
and here the same with parenthesized subexpressions to isolate the subpattern:

Code: Select all

> (find "a(.*)c" "abbbbcbbbcd" 0)
0
> $1
"bbbbcbbb"

> (find "a(.*?)c" "abbbbcbbbcd" 0)
0
> $1
"bbbb"
Lutz

methodic
Posts: 58
Joined: Tue May 10, 2005 5:04 am

Post by methodic »

Ah, thats what I was looking for, the (.*?)

thanks so much!

Locked