Page 1 of 1
scrape url and replace to make new list
Posted: Tue Nov 08, 2011 4:27 am
by joejoe
Hi - I am using Cormullion's beautiful Intro to nL example:
http://en.wikibooks.org/wiki/Introducti ... _web_pages
He shows:
Code: Select all
(set 'the-source (get-url "http://www.apple.com"))
(replace {src="(http\S*?jpg)"} the-source (push $1 images-list -1) 0)
(println images-list)
and so I am trying to modify it to pull different info from a different web page:
Code: Select all
(set 'the-source (get-url "http://nukene.ws/headlines"))
(replace {<h2>\S*?</h2>} the-source (push $1 images-list -1) 0)
(println images-list)
(Essentially I am after the headline article titles between the <h2>...</h2>
I am getting
as an answer, and Im suspecting I have an incorrect regex.
Ive tried various versions of the above using the \ in front of various character without success.
Would one be so kind as to point me my shortcoming? Much appreciated and thank you a lot! :0)
Re: scrape url and replace to make new list
Posted: Tue Nov 08, 2011 4:31 am
by joejoe
Tried this too:
Code: Select all
(set 'the-source (get-url "http://nukene.ws/headlines"))
(replace {(<h2>\S*?</h2>)} the-source (push $1 images-list -1) 0)
(println images-list)
Re: scrape url and replace to make new list
Posted: Tue Nov 08, 2011 10:28 am
by cormullion
Ah yes, the Joy of Regex...
I think your problem is that there are no headlines without whitespace on that page. The original code you copied is looking for URLs, which can't have whitespace. So \S* will find URLs. However, the text between the h2 tags always contains spaces (unless there was a one word headline, such as "Boom"). Hence no match, because \S is looking for non-white-space only.
You've also noticed that the parentheses inside regex patterns correspond to the $1.. tags you use in the action expressions. Without those, $1 doesn't refer to the results of the regex search.
And you need to watch out for backslashes in regex patterns, too. They have to be doubled if you use double quotes...
If you want to experiment with regex in newLISP, take a look at grepper.lsp, an interactive regex tester tuned for newLISP... It's somewhere on
http://github.com/cormullion/newlisp-projects.
Re: scrape url and replace to make new list
Posted: Tue Nov 08, 2011 8:53 pm
by joejoe
cormullion, thanks!
With your notes, I managed to get at what I was after with this:
Code: Select all
(set 'the-source (get-url "http://nukene.ws/headlines"))
(replace {<h2>(.+)<\/h2>} the-source (push $1 images-list -1) 0)
(println images-list)
Im getting strange characters, that I guess are complex text characters that dont render on the shell:
Code: Select all
"11/8/2011 Tokyo Starts Burning Radioactive Waste from Other Areas ⦠Tokyo Go vernor Tells Residents to âShut Upâ and Stop Complaining About It"
"Japan, France consider nuclear power costs" "Japan Times: People âfed up wit h the shroud of secrecyâ in Fukushima â Starting to smuggle in journalists â Must rely on media for help"
Also I see a log of double box characters next to the â characters, as well as ' a lot.
Am I correct that I would need another regex to process these correctly, or at least transform them into normal quotation marks? I am going to be using these titles to post to a web form.
Thanks again for the regex pointers, 'mullion! ;0)
Re: scrape url and replace to make new list
Posted: Tue Nov 08, 2011 10:11 pm
by cormullion
Unicode characters processed by newLISP vary in appearance depending on how you output them - the newLISP manual has more, under Unicode. To over-simplify, console output shows them escaped, but printed output shows them converted. For example, compare:
Code: Select all
Geiger-Muller \206\178+\206\179 G-M SI-3 BG TUBE COUNTER 10pcs new
Geiger-Muller β+γ G-M SI-3 BG TUBE COUNTER 10pcs new
Of course, your system should be UTF-8 and you should use fonts that have Unicode characters. The web page you're looking at is utf-8 encoded.
As for the ampersands, these are HTML encodings for characters outside the restricted ASCII set, and would be converted using any standard HTML-encode/decode function. I think there's a few knocking about...
Re: scrape url and replace to make new list
Posted: Tue Nov 08, 2011 11:37 pm
by joejoe
Thanks again Cormullion!
I appreciate the guidance, always.