scrape url and replace to make new list

Q&A's, tips, howto's
Locked
joejoe
Posts: 173
Joined: Thu Jun 25, 2009 5:09 pm
Location: Denver, USA

scrape url and replace to make new list

Post by joejoe »

Hi - I am using Cormullion's beautiful Intro to nL example:

http://en.wikibooks.org/wiki/Introducti ... _web_pages

He shows:

Code: Select all

(set 'the-source (get-url "http://www.apple.com"))
(replace {src="(http\S*?jpg)"} the-source (push $1 images-list -1) 0)
(println images-list)
and so I am trying to modify it to pull different info from a different web page:

Code: Select all

(set 'the-source (get-url "http://nukene.ws/headlines"))
(replace {<h2>\S*?</h2>} the-source (push $1 images-list -1) 0)
(println images-list)
(Essentially I am after the headline article titles between the <h2>...</h2>

I am getting

Code: Select all

nil
as an answer, and Im suspecting I have an incorrect regex.

Ive tried various versions of the above using the \ in front of various character without success.

Would one be so kind as to point me my shortcoming? Much appreciated and thank you a lot! :0)

joejoe
Posts: 173
Joined: Thu Jun 25, 2009 5:09 pm
Location: Denver, USA

Re: scrape url and replace to make new list

Post by joejoe »

Tried this too:

Code: Select all

(set 'the-source (get-url "http://nukene.ws/headlines"))
(replace {(<h2>\S*?</h2>)} the-source (push $1 images-list -1) 0)
(println images-list)

cormullion
Posts: 2038
Joined: Tue Nov 29, 2005 8:28 pm
Location: latiitude 50N longitude 3W
Contact:

Re: scrape url and replace to make new list

Post by cormullion »

Ah yes, the Joy of Regex...

I think your problem is that there are no headlines without whitespace on that page. The original code you copied is looking for URLs, which can't have whitespace. So \S* will find URLs. However, the text between the h2 tags always contains spaces (unless there was a one word headline, such as "Boom"). Hence no match, because \S is looking for non-white-space only.

You've also noticed that the parentheses inside regex patterns correspond to the $1.. tags you use in the action expressions. Without those, $1 doesn't refer to the results of the regex search.

And you need to watch out for backslashes in regex patterns, too. They have to be doubled if you use double quotes...

If you want to experiment with regex in newLISP, take a look at grepper.lsp, an interactive regex tester tuned for newLISP... It's somewhere on http://github.com/cormullion/newlisp-projects.

joejoe
Posts: 173
Joined: Thu Jun 25, 2009 5:09 pm
Location: Denver, USA

Re: scrape url and replace to make new list

Post by joejoe »

cormullion, thanks!

With your notes, I managed to get at what I was after with this:

Code: Select all

(set 'the-source (get-url "http://nukene.ws/headlines"))
(replace {<h2>(.+)<\/h2>} the-source (push $1 images-list -1) 0)
(println images-list)
Im getting strange characters, that I guess are complex text characters that dont render on the shell:

Code: Select all

 "11/8/2011 Tokyo Starts Burning Radioactive Waste from Other Areas … Tokyo Go                                                               vernor Tells Residents to “Shut Up” and Stop Complaining About It"
 "Japan, France consider nuclear power costs" "Japan Times: People “fed up wit                                                               h the shroud of secrecy” in Fukushima — Starting to smuggle in journalists â                                                               €” Must rely on media for help"
Also I see a log of double box characters next to the â characters, as well as ' a lot.

Am I correct that I would need another regex to process these correctly, or at least transform them into normal quotation marks? I am going to be using these titles to post to a web form.

Thanks again for the regex pointers, 'mullion! ;0)

cormullion
Posts: 2038
Joined: Tue Nov 29, 2005 8:28 pm
Location: latiitude 50N longitude 3W
Contact:

Re: scrape url and replace to make new list

Post by cormullion »

Unicode characters processed by newLISP vary in appearance depending on how you output them - the newLISP manual has more, under Unicode. To over-simplify, console output shows them escaped, but printed output shows them converted. For example, compare:

Code: Select all

Geiger-Muller \206\178+\206\179 G-M SI-3 BG TUBE COUNTER 10pcs new

Geiger-Muller β+γ G-M SI-3 BG TUBE COUNTER 10pcs new
Of course, your system should be UTF-8 and you should use fonts that have Unicode characters. The web page you're looking at is utf-8 encoded.

As for the ampersands, these are HTML encodings for characters outside the restricted ASCII set, and would be converted using any standard HTML-encode/decode function. I think there's a few knocking about...

joejoe
Posts: 173
Joined: Thu Jun 25, 2009 5:09 pm
Location: Denver, USA

Re: scrape url and replace to make new list

Post by joejoe »

Thanks again Cormullion!

I appreciate the guidance, always.

Locked