Replacing HTML codes

kanen · Post by **kanen** » Sun Apr 17, 2011 7:25 pm

I have a series of lists, which contain something like:

"Telefónica to Sell Spanish Assets"

I want to be able to quickly change all these codes to their proper ascii characters.

Code: Select all

(set 'x "Telef&#243;nica to Sell Spanish Assets")
(find {&#(.*);} x 0)  ; reference only
(regex {&#(.*);} x)
; > ("&#243;" 5 6 "243" 7 3)
> $1
; > "243"
(char (int $1))
; > "ó"

My question?

How can I do the above in one quick pass, where I get all the {&#(.*);} and replace them with (char (int $1)) ... without doing a loop through everything over and over until the (regex) finds nothing?

I feel like I am missing some (map) or (replace) iterative operator.

xytroxon · Post by **xytroxon** » Sun Apr 17, 2011 8:50 pm

Code: Select all

(setq text "Telef&#243;nica to Sell Spanish Assets")

; regex flag = 0
(replace "&#.*?;" text (char (int (2 -1 $it))) 0)
(println text)
;-> Telefónica to Sell Spanish Assets
(exit)

-- xytroxon

kanen · Post by **kanen** » Sun Apr 17, 2011 10:27 pm

kanen wrote:I feel like I am missing some (map) or (replace) iterative operator.

In other words, $it is the iterative operator I was missing. :)

d'oh!

xytroxon wrote:

Code: Select all

(setq text "Telef&#243;nica to Sell Spanish Assets")

; regex flag = 0
(replace "&#.*?;" text (char (int (2 -1 $it))) 0)
(println text)
;-> Telefónica to Sell Spanish Assets
(exit)

-- xytroxon

Lutz · Post by **Lutz** » Mon Apr 18, 2011 12:44 am

In other words, $it is the iterative operator I was missing. :)

No, this one would work too:

Code: Select all

> (set 'x "Telef&#243;nica to Sell Spanish Assets")
"Telef&#243;nica to Sell Spanish Assets"
> (replace {&#(\d+);} x (char (int $1)) 0)
"Telefónica to Sell Spanish Assets"
>

'$it' is just a replacement for '$0', 'replace' always iterates through all occurrences found:

http://www.newlisp.org/downloads/newlis ... ml#replace

and here about the anaphoric '$it':

http://www.newlisp.org/downloads/newlis ... em_symbols

xytroxon · Post by **xytroxon** » Mon Apr 18, 2011 7:50 am

It's a little more complicated then what I posted from memory... Handling all the HTML special entity codes in html or rss docs is a pain!

Besides decimal codes, there are...

hexadecimal codes ->  ... ÿ ... &0x150; etc...

common codes -> & < >   etc...

foreign language -> ¡ ¿ € etc...

http://tlt.its.psu.edu/suggestions/inte ... ehtml.html

http://webdesign.about.com/library/bl_htmlcodes.htm

A more general solution:

Code: Select all

(define (HTML-special-chars str)
; code here
)

(replace "&.*?;" text (HTML-special-chars $1) 0)

And I've seen & in rss docs requiring a second pass...

-- xytroxon

newlispfanclub.alh.net

Replacing HTML codes

Replacing HTML codes

Re: Replacing HTML codes

Re: Replacing HTML codes

Re: Replacing HTML codes

Re: Replacing HTML codes