Parsing Markup Tags. Code Optimization

Tim Johnson · Post by **Tim Johnson** » Sat Dec 26, 2009 8:07 pm

I've written a newlisp function that processes a string and returns a list of of elements where markup
tags are separated from plaintext.
There is an implementation featured here:
http://newlispfanclub.alh.net/forum/vie ... =16&t=3386
But it doesn't handle javascript code.
The function works for me. I expect, that once I put it to work, it
will need some tweaking. However, I have thus far used newlisp only intermittenly and I would deeply
appreciate it if some of you newlisp veterans would review this code and suggest optimizations. I have
based this on a function that I wrote for python (rebol has this feature builtin), but I would like suggestions as
to how to make the code more "newlispish". Such suggestions would certainly contribute to my overall
grasp of newlisp.
And also, I want to wish you all the best for this holiday season and for the New Year.
code follows:

Code: Select all

;; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
;;  @syntax [-parse-markup <str>-] 
;;  @Description Parse 'str' into alternating plain text and markup elements
;;  @Returns a list. Adjacent tags are seperate elements.
;; >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
(define (parse-markup str)
  (let ((res '())(buf "")(inTag)(inScript)(chr)(nxt)(ndx -1)(endp (- (length str) 1))
      ;; "Private" functions
      (data (fn()(not(empty? buf))))             ;; test for data in temporary buffer
      (add2buf (fn()(set 'buf(append buf chr)))) ;; append char to buffer
      (sflag (fn()(set 'inScript (if (find "javascript" (lower-case buf))))))
      (add2res 
        (fn(c)  ;; Add buffer and re-initialize. Set 'inScript flag
          (sflag)              ;; set/unset 'inScript
          (push buf res)       ;; Add buffer to results
          (set 'buf "")        ;; Reinitialize the buffer
          (if c (add2buf))))) ;; End 'let initialization form 
    (dostring (c str)     ;; scan string char-by-char
      (inc ndx)           ;; position of char
      (set 'chr (char c)) ;; one-char string
      (if (< ndx endp)    ;; keep track of next char to process
        (set 'nxt (str (+ ndx 1))))
      (cond
        ((= chr "<")      ;; Begin a tag insertion if not javascript
          (cond 
            (inScript       ;; Still processing javascript code
              (cond
                ((= nxt "/")  ;; Finishing javascript code block.
                  (set 'inTag true 
                       'inScript nil)    ;; set boolean flags
                  (cond 
                    ((data)              ;; if buffer has data, push and clear
                      (add2res chr))     ;; Add buffer to results and re-initialize with char
                     (true (add2buf))))  ;; add char to empty buffer
                 (true (add2buf))))      ;; Keep filling 'buf
            (true            ;; Not in javascript code block. Starting new tag
              (set 'inTag true)
              (cond 
                ((data)               ;; If 'buf has data. 
                  (add2res chr))      ;;    push and re-initialize with char
                (true (add2buf))))))  ;; Buffer is empty, keep filling 'buf
        ((and (= chr ">") (not inScript)) ;; finishing a tag.
          (set 'inTag nil) 
          (sflag)       ;; set flags
          (add2buf)     ;; add char to buffer
          (add2res))    ;; Push buffer and reinitialize
        (true           ;; still in script block
          (add2buf))))  ;; just add to 'buf, end dostring/outermost cond
      (if (data)        ;; If data in 'buf, add to result
        (add2res))
      (reverse res)))

unixtechie · Post by **unixtechie** » Sun Dec 27, 2009 10:23 am

do not understand at all what you are trying to do here.

1. If you need to separate tags from text, then:
(a) "canonize" the text by adding "\n" (newline) after each ">"
(b) consider each line in the substituted text your needed outcome.

That's it.

This is implemented with exactly 2 operators, "read-file" and
"replace", then read the modified buffer string line-by-line with something like "regex" with offsets.

2. If you do not wish to slurp the file into memory,
(a) use "search" to get to needed positions and "seek" to keep a list of offsets,
(b) next read your strings jumping between the known offsets.

That's all.

You write in abstractions - can you give primitive examples (input - expected output) of what you are trying to achieve?

cormullion · Post by **cormullion** » Sun Dec 27, 2009 2:18 pm

Tim - nice code, and a pleasant xmas gift! :)

I've yet to look at it closely, but it looks good. I'll see if I can run it over some sample pages sometime.

When I see (reverse..) I look to see if there's been pushing to the front or end of lists or strings - if you push to the end you can sometimes omit the reverse.

unixtechie - I think the problem is that simple parsing of HTML by angle brackets usually breaks when the page contains Javascript code. (Not sure what the standards say, but for practical reasons it doesn't matter...)

Tim Johnson · Post by **Tim Johnson** » Sun Dec 27, 2009 4:31 pm

Hopefully by this time unixtechie groks the issue with javascript....
the code here http://newlispfanclub.alh.net/forum/vie ... =16&t=3386 is both
shorter and much faster, but doesn't handle the javascript.
And I'm sure that is the solution that unixtechie refers to.
A more complete (language agnostic) solution might be something like this pseudo-code:

Code: Select all

(define (load-markup s)
  (if (find "<script" s 1) (parse-markup-the-hard-way s)
     (parse-markup-with-regexes s))

And cormullion, I note your comment about using 'reverse on the result set:
That was deliberate. I pondered whether is would be faster to use reverse once
than using (push item list -1) every time.

thanks folks.

xytroxon · Post by **xytroxon** » Sun Dec 27, 2009 5:56 pm

Tim Johnson wrote:And cormullion, I note your comment about using 'reverse on the result set:
That was deliberate. I pondered whether is would be faster to use reverse once
than using (push item list -1) every time.

Lutz has optimized newLISP for the (push item list -1) form... (So you don't need to use coding tricks ;)

-- xytroxon

TedWalther · Post by **TedWalther** » Sun Dec 27, 2009 6:42 pm

There is an O'Reilly book called "Javascript: The Good Parts". It includes a nice parser in the back, only takes 3 pages of very clean code. Perhaps it would be best to include a javascript parser inside your parser?

How about the built-in xml-parse function? Would it be easier for you to do that, and then manipulate the s-ml expression tree directly?

Ted

Tim Johnson · Post by **Tim Johnson** » Sun Dec 27, 2009 6:57 pm

xytroxon wrote:
Lutz has optimized newLISP for the (push item list -1) form... (So you don't need to use coding tricks ;)
-- xytroxon

Understood.
Thanks!

Tim Johnson · Post by **Tim Johnson** » Sun Dec 27, 2009 7:12 pm

TedWalther wrote:There is an O'Reilly book called "Javascript: The Good Parts". It includes a nice parser in the back, only takes 3 pages of very clean code. Perhaps it would be best to include a javascript parser inside your parser?
Ted

I do a lot with javascript <sigh!>
I have javascript functions that load data into forms, but there are gotchas when the HTML source is rendered
dynamically. An example of where I have had problems is when a page is rendered and then an interior
form is rendered via AJAX, I have had problems finding an event to attach a handler to. Also, javascript
makes me cautious. After all, when one is writing client-side code, one has to consider any number
(potentially millions) of interpreters running inside of any number (and my customers do hope millions) of browsers.

Whereas, when I write server-side code, I only have to consider one interpreter, given that the
interpreters should behave identically across a small number of different operating systems.

TedWalther wrote: How about the built-in xml-parse function? Would it be easier for you to do that, and then manipulate the s-ml expression tree directly?
Ted

Consider the following code: (And look out for wrapped strings)

Code: Select all

 (set 'res(xml-parse "abcdefghijk<Script type=\"Javascript\">var a=1; if(a > 1)alert(\"Yes\");else alert(\"No\");</script><div>lmnopq</font>rstuvwxyz"))
(println "Using xml-parse: " res)
(set 'res(parse-markup "abcdefghijk<Script type=\"Javascript\">var a=1; if(a > 1)alert(\"Yes\");else alert(\"No\");</script><div>lmnopq</font>rstuvwxyz"))
(println "Using parse-markup: " res)

Result:

Code: Select all

Using xml-parse: nil
Using parse-markup: ("abcdefghijk" "<Script type=\"Javascript\">" "var a=1; if(a > 1)alert(\"Yes\");else alert(\"No\");"
 "</script>" "<div>" "lmnopq" "</font>" "rstuvwxyz")

But then, maybe this is a function of my lack of familiarity with xml-parse, because it has a very complex
interface. Preliminary tests that I made indicated that I could use 'xml-parse to process the separated tags, which is
another piece in my objective.
thanks
tim

unixtechie · Post by **unixtechie** » Mon Dec 28, 2009 7:26 am

still there is much talk "about" the issue, but no specifications.
Tell using very short one-line examples what is input and what is the expected output - otherwise all talk is useless.

Supposing you got this as input:

Code: Select all

<fieldset><legend><a href="javascript:;" onmousedown="toggleCombined('18');">
<font class='lnum'><i>(18)</i></font>&nbsp; Markup of code and documentation sections </a>&nbsp;<font class='lnum' size=-1><sub><i>(line 962)</i></sub></font> <font size=-2><i><a href='#tocancor'>toc</a></i></font><a name='18'></a></legend></fieldset>
<p>
<div id="18" style="display:none">   
<p>
    <b> <i> Markup </i> </b><br>
</div>
</fieldset>

What do you expect as "correct" output for your task?
Please explain what you are expecting.

Tim Johnson · Post by **Tim Johnson** » Mon Dec 28, 2009 4:48 pm

unixtechie wrote:still there is much talk "about" the issue, but no specifications.
Tell using very short one-line examples what is input and what is the expected output - otherwise all talk is useless.

Supposing you got this as input:
Code: Select all
<fieldset><legend><a href="javascript:;" onmousedown="toggleCombined('18');">
(18)&nbsp; Markup of code and documentation sections </a>&nbsp;(line 962) <a href='#tocancor'>toc</a><a name='18'></a></legend></fieldset>

<div id="18" style="display:none"> 

 Markup 
</div>
</fieldset>
What do you expect as "correct" output for your task?
Please explain what you are expecting.

The output is as follows:

Code: Select all

res ==> ("<fieldset>" "<legend>" "<a href=\"javascript:;\" onmousedown=\"toggleCombined('18');\">" " <font class='lnum'><i>(18)" "</i>" "</font>" "&nbsp; Markup of code and documentation sections " "</a>" "&nbsp;" "<font class='lnum' size=-1>" "<sub>" "<i>" "(line 962)" "</i>" "</sub>" "</font>" " " "<font size=-2>" "<i>" "<a href='#tocancor'>" "toc" "</a>" "</i>" "</font>" "<a name='18'>" "</a>" "</legend>" "</fieldset>" "<p>" "<div id=\"18\" style=\"display:none\">" "<p>" " " "<b>" " " "<i>" " Markup " "</i>" " " "</b>" "<br>" " " "</div>" " " "</fieldset>" "'")

And is correct to my specs.
Here is a shorter input example:

Code: Select all

(set 'res(parse-markup "<form method =\"POST\" action=\"http://localhost/cgi-bin/render.lsp\">Password:&nbsp;<input type=\"password\" name=\"pwd\"></form>"))

And here is the result. And is what I want:

Code: Select all

res ==> ("<form method =\"POST\" action=\"http://localhost/cgi-bin/render.lsp\">" "Password:&nbsp;" "<input type=\"password\" name=\"pwd\">" "</form>")

My original intent was to solicit comments on the correctness, efficiency and appropriate style of my code.

cormullion · Post by **cormullion** » Mon Dec 28, 2009 5:09 pm

Presumably you can use (push chr buf -1) rather than (append ... Haven't checked but it might be OK.

Also, I think dostring has a built-in indexing - $idx - this might be usable and save you running your own counter.

That cond structure is deep - but why not!? :)

Tim Johnson · Post by **Tim Johnson** » Mon Dec 28, 2009 6:41 pm

cormullion wrote:Presumably you can use (push chr buf -1) rather than (append ... Haven't checked but it might be OK.

Cool. Would save some 'set forms

cormullion wrote: Also, I think dostring has a built-in indexing - $idx - this might be usable and save you running your own counter.

Of course!

cormullion wrote: That cond structure is deep - but why not!? :)

Would there be another approach that you would recommend? (other than going so deep into 'cond)
Thanks.
-----------
I will implement your suggestions and look forward to your further comments.
Cheers
tim

Tim Johnson · Post by **Tim Johnson** » Mon Dec 28, 2009 7:53 pm

Maybe I should elaborate further on my originating intent for this posting. I see that cormullion "gets it", but
I fear that other may not:

First some history: As a web programmer, I have written a lot of modules in both python and rebol.
I need that same functionality from newlisp if I am going to step out with some serious web programming
assets in newlisp.

One of the modules that I have written in both rebol and python inputs an html document or a portion of an html document
and outputs a writable data structure. This module has many applications for me, my company and my clientele.

The starting point for such a module is to reduce this input to a list in which plain text is
separated from tags and individual tags are separated from each other. Anyone reading this should be
able to see the examples that I illustrated for unixtechie.

Rebol provides a native (part of the binary) function, called load/markup that does exactly what is described
in the previous paragraph. Python does not. Therefore I had to write my own function for that purpose.

In writing my first draft of this functionality in newlisp, I used my python code as a "prototype". In fact,
those of you who have been around this business for some years might remember when python was introduced
as a "prototyping" tool. However, the result is "pythonish" rather than "newlispish", I.E. is not idiomatic to
newlisp.

When cormullion introduces the suggestion of using $idx instead of a counter, he is pointing me in the
"newlispish" direction. Then, in turn, I can apply his suggestions to other newlisp code that I might write
and this becomes a valuable tutorial for me and hopefully is helpful to others.
How am I doing so far? Do you all understand what I am after? :)
thanks
tim

newlispfanclub.alh.net

Parsing Markup Tags. Code Optimization

Parsing Markup Tags. Code Optimization

do not understand

Re: Parsing Markup Tags. Code Optimization

Re: Parsing Markup Tags. Code Optimization

Re: Parsing Markup Tags. Code Optimization

Re: Parsing Markup Tags. Code Optimization

Re: Parsing Markup Tags. Code Optimization

Re: Parsing Markup Tags. Code Optimization

No specifications still

Re: No specifications still

Re: Parsing Markup Tags. Code Optimization

Re: Parsing Markup Tags. Code Optimization

Re: Parsing Markup Tags. Code Optimization