html parser

Dmi · Post by **Dmi** » Sat Apr 29, 2006 9:34 pm

I wrote one a few days ago: http://en.feautec.pp.ru/store/libs/tags.lsp

It is able to parse structured tagged text like an html, is aware of unclosed tags and uses regexp for tags definition.

example:
for html:

Code: Select all

<html><body>
test
<table align=center><tr><td>test1</td><tr><td>test2<td>test3</table>
</body></html>

here closed and unclosed tags are present

with syntax rules:

Code: Select all

; tag format: (tag-sym tag-pattern tag-open|close|self (closes-tag closes-tag))
; tag-open - open a sublist and lead it
; tag-close - close a sublists and don't leave himself
; tag-self - close a sublists and leave himself
(set 'html-tags '(
      (table "<table(| [^>]*)>" tag-open ())
                  (table/ "</table>" tag-close (table th tr td))
                  (tr "<tr(| [^>]*)>" tag-open (tr th td))
                  (tr/ "</tr>" tag-close (tr th td))
                  (th "<th(| [^>]*)>" tag-open (th td))
                  (th/ "</th>" tag-close (th td))
                  (td "<td(| [^>]*)>" tag-open (th td))
                  (td/ "</td>" tag-close (th td))
                  (br "<br>" tag-self ())
                  (hr "<hr(| [^>]*)>" tag-self ())
                  (p "<p>" tag-self ())))

You can get following:

Code: Select all

> (set 'htm (TAGS:parse-tags TAGS:html-tags (read-file "example.html")))

("<html><body>\ntest\n" TAGS:table TAGS:tr TAGS:td "test1" TAGS:td/ TAGS:tr
  TAGS:td "test2" TAGS:td "test3" TAGS:table/ "\n</body></html>\n")

Text is parsed and defined tags are replaced with symbols. One-dimension list.

Code: Select all

> (TAGS:structure-tags TAGS:html-tags htm)

(TAGS:data "<html><body>\ntest\n"
  (TAGS:table (TAGS:tr (TAGS:td "test1"))
    (TAGS:tr (TAGS:td "test2") (TAGS:td "test3")))
  "\n</body></html>\n")

Preparsed list is converted to nested list according to defined tagging rules.
With such nested list, parsing of html-tables becames relatively useful...