Hello All,
Anyone tried to build a html de-tagger from string/buffer?
Norman.
de-tag
Try this:
On smaller buffers (replace "<.+?>" buff " " 0) may be just enough code, but on bigger numbers the line-by-line solution will be much faster.
You may also want to try the 'greedy' option 512 instead of 0 for better/faster results.
Lutz
Code: Select all
(define (strip-html buff)
(set 'page '())
(dolist (lne (parse buff "\r\n|\n" 0))
(push (replace "<.+?>" lne " " 0) page))
(join (reverse page) "\n"))
(strip-html (get-url "http://newlisp.org"))
You may also want to try the 'greedy' option 512 instead of 0 for better/faster results.
Lutz
I did some benchmarks (strip-html buff) versus (replace "<.+?>" buff " " 512) and it turns out that the simpler solution without splitting into lines is also the fastest on 'newlisp_manual.html' a 400Kbyte file.
The biggest speedup is using the greedy option 512, cutting the time to a quarter:
(strip-html buff) -> 12 seconds
(replace "<.+?>" buff " " 0) 13 seconds
(replace "<.+?>" buff " " 512) 3 seconds !!!
Lutz
ps: still don't forget the push/join method, which sometimes is superior (see base64 example). Also: 'greedy' gives different output!
The biggest speedup is using the greedy option 512, cutting the time to a quarter:
(strip-html buff) -> 12 seconds
(replace "<.+?>" buff " " 0) 13 seconds
(replace "<.+?>" buff " " 512) 3 seconds !!!
Lutz
ps: still don't forget the push/join method, which sometimes is superior (see base64 example). Also: 'greedy' gives different output!
-
- Posts: 58
- Joined: Sat Jun 10, 2006 5:34 am
Code: Select all
> (regex {<.*>} "Be <b>very</b> careful.")
("<b>very</b>" 3 11)
> (regex {<.*?>} "Be <b>very</b> careful.")
("<b>" 3 3)
> (regex {<.*>} "Be <b>very</b> careful." 512)
("<b>" 3 3)
> (regex {<.*?>} "Be <b>very</b> careful." 512)
("<b>very</b>" 3 11)
Code: Select all
> (setq str "<foo\nbar>")
"<foo\nbar>"
> (replace {<.*>} str "" 0)
"<foo\nbar>"
> str
"<foo\nbar>"
> (replace {<.*>} str "" 4)
""
> str
""