parsing large files (> 5GB)

jopython · Post by **jopython** » Wed Dec 21, 2011 9:22 pm

Parsing large log files (using line-by-line) takes a looong time(several times slow compared to Perl).

------------------
(while (read-line file)
(if (regex p1 (current-line) 0x10000)
(inc foo)))

------------------
Is there anything I could use within newlisp to shorten the time for reading files? Say, Is it possible to read files in big chunks(say 10MB) and then parse that portion in memory for faster access?

Lutz · Post by **Lutz** » Fri Dec 23, 2011 12:21 am

In the next development or stable release version (January 1012), at least 'read-line' from STDIN (read-line) will be 3 times faster than in the old version reading from STDIN. If you can process your logfiles like this: process < thelogfile.txt, this will help you.

Until then, yes, reading big chungs of memory and doing a:

Code: Select all

(dolist (line (parse chunk "\n")) 
    ...
)

will be much faster.

ps: even in the old version (read-line) via STDIN is already about 3 times faster than reading from channel.

jopython · Post by **jopython** » Fri Dec 23, 2011 8:34 pm

The difference is really huge 8 Minutes in newlisp vs 2 seconds in Perl for a 100MB log file.
I am not a fan of Perl. But i am forced to use it because of its text processing performance.

$ time ./apache.lsp xaa
3885

real 8m2.346s
user 1m28.495s
sys 6m30.581s

$ cat apache.lsp

Code: Select all

;(set 'yesterd (date (date-value) -480 "%d/%b/%Y.+"))
(set 'yesterd [text]10/Dec/2011.+[/text])
(set 'reg [text]GET /index.html  HTTP/1.1[/text])
(set 'pattern_str (append yesterd reg))
(set 'p1 (regex-comp pattern_str))
(set 'file (open ((main-args) 2) "read")) ; the argument to the script
(while (read-line file)
        (if (regex p1 (current-line) 0x10000)
          (inc foo))) 
(if foo (println foo) (println 0))
(exit)

Code: Select all

$ time perl -lne 'BEGIN{$h=0;}if (m/10\/Dec\/2011\S+\s+\S+\]\s+\x22GET\s+\/index.html\s+HTTP\/1.1/ox){$h++;}END{print $h}' xaa

3885

real 0m2.704s
user 0m2.285s
sys 0m0.387s

Lutz · Post by **Lutz** » Fri Dec 23, 2011 9:54 pm

That is really a big difference, but I don't believe that generally newLISP is much slower than Perl in text processing. It's not what you hear and if you look at benchmarks here: http://www.newlisp.org/benchmarks/, you will see that differences are minor and wherever you see them line-by-line file reading is involved. But that doesn't explain the huge difference you are seeing.

I calculate about 120ms processing time per line! Even with slower read-line I/O that just doesn't make sense.

I generated the following test file test.txt:

Code: Select all

(set 'file (open "test.txt" "w"))
(dotimes (i 100000) (write-line file (join (map string (rand 100 100)))))

using this program:

Code: Select all

#!/usr/bin/newlisp

(set 'chan (open (main-args -1) "r"))
(println "time:" (time (while (read-line chan) 
    (inc lcnt) 
    (inc cnt (length (current-line)))
)))

(println "read " lcnt " lines and " cnt " characters")
(exit)

and run it:

Code: Select all

~> ./readchannel test.txt
time:11787.971
read 100000 lines and 18999043 characters
~>

Which is about a 120 micro seconds per line. Adding simple regular expressions made less than 1 % of a difference.

Now using STDIN to feed the file:

Code: Select all

#!/usr/bin/newlisp

(println "time:" (time (while (read-line) 
    (inc lcnt) 
    (inc cnt (length (current-line)))
)))

(println "read " lcnt " lines and " cnt " characters")
(exit)

running it:

Code: Select all

~> ./readstdin < test.txt
time:264.203
read 100000 lines and 18999043 characters

which is about 2.6 micro seconds per line.

The difference of the two methods is, that the fast method uses stream reading fgetc() while the slower method used file handle based read(handle, &chr, 1).

What is the experience of others doing text processing with newLISP?

Ps: all measurements with 10.3.3 and Mac OSX 10.7.2 , on 10.3.10, the faster one takes 1 micro sec or less per line.

jopython · Post by **jopython** » Fri Dec 23, 2011 10:05 pm

Hmm..

./readchannel test.txt
time:87445.189
read 100000 lines and 19000792 characters

$ ./readstdin < test.txt
time:1704.849
read 100000 lines and 19000792 characters

This is a Ultrasparc 25.

jopython · Post by **jopython** » Fri Dec 23, 2011 10:12 pm

Now using the stdin method for the original script it went down from 8 minutes to 13 secs. Phew.

Code: Select all

time ./apache.lsp < xaa
3885

real    0m13.088s
user    0m12.662s
sys     0m0.243s

Looks like fgetc method is a bad idea.

Lutz · Post by **Lutz** » Fri Dec 23, 2011 10:20 pm

Yes, and on 10.3.10 and after you will get down to about 4 seconds versus about 2.7 on Perl and this is more in line with the benchmarks done earlier.

cormullion · Post by **cormullion** » Fri Dec 23, 2011 10:36 pm

Code: Select all
$ ./readstdin < test.txt
time:1704.849
read 100000 lines and 19000792 characters
This is a Ultrasparc 25.

That still seems very slow. Cf:

Code: Select all

$ ./readstdin < test.txt
time:186.978
read 100000 lines and 18999056 characters

on an iMac. Perhaps your newLISP installation went wrong somewhere...

Generally newLISP is not quite as quick as Perl if you write Perl-y newLISP, but better if you write newLISP-y newLISP. Even if it's not quite as quick it seems more fun to write.

jopython · Post by **jopython** » Sat Dec 24, 2011 1:36 am

That still seems very slow.

Yes they(UltraSparc iii) are slow. They belong to the 2001 era. SUN(now Oracle) sparcs are generally not optimized for single threaded performance. Infact the SPARC cpus did even feature out-out-order execution until recently.

newlispfanclub.alh.net

parsing large files (> 5GB)

parsing large files (> 5GB)

Re: parsing large files (> 5GB)

Re: parsing large files (> 5GB)

Re: parsing large files (> 5GB)

Re: parsing large files (> 5GB)

Re: parsing large files (> 5GB)

Re: parsing large files (> 5GB)

Re: parsing large files (> 5GB)

Re: parsing large files (> 5GB)