scrape url and replace to make new list

joejoe · Post by **joejoe** » Tue Nov 08, 2011 4:27 am

Hi - I am using Cormullion's beautiful Intro to nL example:

http://en.wikibooks.org/wiki/Introducti ... _web_pages

He shows:

Code: Select all

(set 'the-source (get-url "http://www.apple.com"))
(replace {src="(http\S*?jpg)"} the-source (push $1 images-list -1) 0)
(println images-list)

and so I am trying to modify it to pull different info from a different web page:

Code: Select all

(set 'the-source (get-url "http://nukene.ws/headlines"))
(replace {<h2>\S*?</h2>} the-source (push $1 images-list -1) 0)
(println images-list)

(Essentially I am after the headline article titles between the <h2>...</h2>

I am getting

Code: Select all

nil

as an answer, and Im suspecting I have an incorrect regex.

Ive tried various versions of the above using the \ in front of various character without success.

Would one be so kind as to point me my shortcoming? Much appreciated and thank you a lot! :0)

joejoe · Post by **joejoe** » Tue Nov 08, 2011 4:31 am

Tried this too:

Code: Select all

(set 'the-source (get-url "http://nukene.ws/headlines"))
(replace {(<h2>\S*?</h2>)} the-source (push $1 images-list -1) 0)
(println images-list)

cormullion · Post by **cormullion** » Tue Nov 08, 2011 10:28 am

Ah yes, the Joy of Regex...

I think your problem is that there are no headlines without whitespace on that page. The original code you copied is looking for URLs, which can't have whitespace. So \S* will find URLs. However, the text between the h2 tags always contains spaces (unless there was a one word headline, such as "Boom"). Hence no match, because \S is looking for non-white-space only.

You've also noticed that the parentheses inside regex patterns correspond to the $1.. tags you use in the action expressions. Without those, $1 doesn't refer to the results of the regex search.

And you need to watch out for backslashes in regex patterns, too. They have to be doubled if you use double quotes...

If you want to experiment with regex in newLISP, take a look at grepper.lsp, an interactive regex tester tuned for newLISP... It's somewhere on http://github.com/cormullion/newlisp-projects.

joejoe · Post by **joejoe** » Tue Nov 08, 2011 8:53 pm

cormullion, thanks!

With your notes, I managed to get at what I was after with this:

Code: Select all

(set 'the-source (get-url "http://nukene.ws/headlines"))
(replace {<h2>(.+)<\/h2>} the-source (push $1 images-list -1) 0)
(println images-list)

Im getting strange characters, that I guess are complex text characters that dont render on the shell:

Code: Select all

 "11/8/2011 Tokyo Starts Burning Radioactive Waste from Other Areas â¦ Tokyo Go                                                               vernor Tells Residents to âShut Upâ and Stop Complaining About It"
 "Japan, France consider nuclear power costs" "Japan Times: People âfed up wit                                                               h the shroud of secrecyâ in Fukushima â Starting to smuggle in journalists â                                                                Must rely on media for help"

Also I see a log of double box characters next to the â characters, as well as ' a lot.

Am I correct that I would need another regex to process these correctly, or at least transform them into normal quotation marks? I am going to be using these titles to post to a web form.

Thanks again for the regex pointers, 'mullion! ;0)

cormullion · Post by **cormullion** » Tue Nov 08, 2011 10:11 pm

Unicode characters processed by newLISP vary in appearance depending on how you output them - the newLISP manual has more, under Unicode. To over-simplify, console output shows them escaped, but printed output shows them converted. For example, compare:

Code: Select all

Geiger-Muller \206\178+\206\179 G-M SI-3 BG TUBE COUNTER 10pcs new

Geiger-Muller β+γ G-M SI-3 BG TUBE COUNTER 10pcs new

Of course, your system should be UTF-8 and you should use fonts that have Unicode characters. The web page you're looking at is utf-8 encoded.

As for the ampersands, these are HTML encodings for characters outside the restricted ASCII set, and would be converted using any standard HTML-encode/decode function. I think there's a few knocking about...

joejoe · Post by **joejoe** » Tue Nov 08, 2011 11:37 pm

Thanks again Cormullion!

I appreciate the guidance, always.

newlispfanclub.alh.net

scrape url and replace to make new list

scrape url and replace to make new list

Re: scrape url and replace to make new list

Re: scrape url and replace to make new list

Re: scrape url and replace to make new list

Re: scrape url and replace to make new list

Re: scrape url and replace to make new list