About parse and scannig

Q&A's, tips, howto's
Locked
Maurizio
Posts: 52
Joined: Mon Jul 28, 2003 3:06 pm
Location: Italy

About parse and scannig

Post by Maurizio »

I try to parse a text file separating every "word"
according to a regular expression.
The code is reported below
What I see, is that even with a moderate text file (190k)
it is very slow.
I suppose this is due to the continuos string splitting after each
succesfull parse.
Any suggestion ?
Btw, should be fine to have a "scan" function, that works like
"parse", but in which you specify the " token" syntax instead of the separator one.
Regards
Maurizio.

Code: Select all

(define (scan expr str)
  (let  ((lst '())   
         (res '()))
     (while (set 'res (regex expr str))
       (push (first res) lst -1)
       (set 'str (slice str (+ (first(rest res)) (first (rest(rest res)))))))
     lst))
     
      
(define (go)
   (let  ((expr  {[[:alpha:]]+[[:alnum:]]*|[[:digit:]]+|"(""|[^"])*"|'(''|[^'])*'|[[:graph:]]} ))
     (scan expr (read-file "myfile.txt"))))
      
[/code]

Lutz
Posts: 5289
Joined: Thu Sep 26, 2002 4:45 pm
Location: Pasadena, California
Contact:

Post by Lutz »

from a program http://www.newlisp.org/code/get-normans.txt on the contributions page, the following trick:

Code: Select all

;; get the text for scanning
(set 'txt (read-file "myfile.txt"))

;; set the the regular expression pattern
(set 'expr {[[:alpha:]]+[[:alnum:]]*|[[:digit:]]+|"(""|[^"])*"|'(''|[^'])*'|[[:graph:]]} )

;; push all tokens onto 'tokens'
(replace expr txt (push $1 tokens -1) 0)

;; the list in 'tokens' contains all tokens found
This should be many times faster, because 'replace' walks internally through all the text. The work is done in the replacement expression which pushes the token found to the end of the list and returns the token itself, so 'txt' will be unchanged afterwards.

Basically 'replace' is the scanner function in this solution.

Lutz

Maurizio
Posts: 52
Joined: Mon Jul 28, 2003 3:06 pm
Location: Italy

Post by Maurizio »

Thank you very much.

btw, it seems to me
(push $0 tokens -1)

and not
(push $1 tokens -1)

Regards
Maurizio

Lutz
Posts: 5289
Joined: Thu Sep 26, 2002 4:45 pm
Location: Pasadena, California
Contact:

Post by Lutz »

Yes, of course $0, in the original program I extracted a subexpression. I wonder how big the speed improvement will be for you?

Lutz

Maurizio
Posts: 52
Joined: Mon Jul 28, 2003 3:06 pm
Location: Italy

Post by Maurizio »

The speed improvement is about 4/5 times
I mean, before your suggestion the parse time was about 5 seconds,
now only 1 sec.
(I'm just experimenting, I really dont need to parse very big files)

Thanks,

Maurizio

eddier
Posts: 289
Joined: Mon Oct 07, 2002 2:48 pm
Location: Blue Mountain College, MS US

Post by eddier »

That's interesting. I use replace on regular expressions very often and I find that this operation is very fast. I would have thought that the time would go down drastically, like maybe to 1/100 of a second.

Eddie

Maurizio
Posts: 52
Joined: Mon Jul 28, 2003 3:06 pm
Location: Italy

Post by Maurizio »

I've seen a certain improvement reorganizing the alternatives in the
regex so that the patterns that recognize the more frequents tokens are
the firsts in the expression.
anyway I'm trying with a 130kb file with 15600 tokens.
Now the scan time is about 1 second.
(pentium 4 - 2.80 ghz, + hyperthreading)


Maurizio

eddier
Posts: 289
Joined: Mon Oct 07, 2002 2:48 pm
Location: Blue Mountain College, MS US

Post by eddier »

Yes, different ways of writing regular expressions can drastically change performance. Complex expressions involving a bunch of "|"s will obviously take longer.

Eddie

Locked