parsing large files (> 5GB)
parsing large files (> 5GB)
Parsing large log files (using line-by-line) takes a looong time(several times slow compared to Perl).
------------------
(while (read-line file)
(if (regex p1 (current-line) 0x10000)
(inc foo)))
------------------
Is there anything I could use within newlisp to shorten the time for reading files? Say, Is it possible to read files in big chunks(say 10MB) and then parse that portion in memory for faster access?
------------------
(while (read-line file)
(if (regex p1 (current-line) 0x10000)
(inc foo)))
------------------
Is there anything I could use within newlisp to shorten the time for reading files? Say, Is it possible to read files in big chunks(say 10MB) and then parse that portion in memory for faster access?
Re: parsing large files (> 5GB)
In the next development or stable release version (January 1012), at least 'read-line' from STDIN (read-line) will be 3 times faster than in the old version reading from STDIN. If you can process your logfiles like this: process < thelogfile.txt, this will help you.
Until then, yes, reading big chungs of memory and doing a: will be much faster.
ps: even in the old version (read-line) via STDIN is already about 3 times faster than reading from channel.
Until then, yes, reading big chungs of memory and doing a:
Code: Select all
(dolist (line (parse chunk "\n"))
...
)
ps: even in the old version (read-line) via STDIN is already about 3 times faster than reading from channel.
Re: parsing large files (> 5GB)
The difference is really huge 8 Minutes in newlisp vs 2 seconds in Perl for a 100MB log file.
I am not a fan of Perl. But i am forced to use it because of its text processing performance.
$ time ./apache.lsp xaa
3885
real 8m2.346s
user 1m28.495s
sys 6m30.581s
$ cat apache.lsp
3885
real 0m2.704s
user 0m2.285s
sys 0m0.387s
I am not a fan of Perl. But i am forced to use it because of its text processing performance.
$ time ./apache.lsp xaa
3885
real 8m2.346s
user 1m28.495s
sys 6m30.581s
$ cat apache.lsp
Code: Select all
;(set 'yesterd (date (date-value) -480 "%d/%b/%Y.+"))
(set 'yesterd [text]10/Dec/2011.+[/text])
(set 'reg [text]GET /index.html HTTP/1.1[/text])
(set 'pattern_str (append yesterd reg))
(set 'p1 (regex-comp pattern_str))
(set 'file (open ((main-args) 2) "read")) ; the argument to the script
(while (read-line file)
(if (regex p1 (current-line) 0x10000)
(inc foo)))
(if foo (println foo) (println 0))
(exit)
Code: Select all
$ time perl -lne 'BEGIN{$h=0;}if (m/10\/Dec\/2011\S+\s+\S+\]\s+\x22GET\s+\/index.html\s+HTTP\/1.1/ox){$h++;}END{print $h}' xaa
real 0m2.704s
user 0m2.285s
sys 0m0.387s
Re: parsing large files (> 5GB)
That is really a big difference, but I don't believe that generally newLISP is much slower than Perl in text processing. It's not what you hear and if you look at benchmarks here: http://www.newlisp.org/benchmarks/, you will see that differences are minor and wherever you see them line-by-line file reading is involved. But that doesn't explain the huge difference you are seeing.
I calculate about 120ms processing time per line! Even with slower read-line I/O that just doesn't make sense.
I generated the following test file test.txt:
using this program:
and run it:
Which is about a 120 micro seconds per line. Adding simple regular expressions made less than 1 % of a difference.
Now using STDIN to feed the file:
running it:
which is about 2.6 micro seconds per line.
The difference of the two methods is, that the fast method uses stream reading fgetc() while the slower method used file handle based read(handle, &chr, 1).
What is the experience of others doing text processing with newLISP?
Ps: all measurements with 10.3.3 and Mac OSX 10.7.2 , on 10.3.10, the faster one takes 1 micro sec or less per line.
I calculate about 120ms processing time per line! Even with slower read-line I/O that just doesn't make sense.
I generated the following test file test.txt:
Code: Select all
(set 'file (open "test.txt" "w"))
(dotimes (i 100000) (write-line file (join (map string (rand 100 100)))))
Code: Select all
#!/usr/bin/newlisp
(set 'chan (open (main-args -1) "r"))
(println "time:" (time (while (read-line chan)
(inc lcnt)
(inc cnt (length (current-line)))
)))
(println "read " lcnt " lines and " cnt " characters")
(exit)
Code: Select all
~> ./readchannel test.txt
time:11787.971
read 100000 lines and 18999043 characters
~>
Now using STDIN to feed the file:
Code: Select all
#!/usr/bin/newlisp
(println "time:" (time (while (read-line)
(inc lcnt)
(inc cnt (length (current-line)))
)))
(println "read " lcnt " lines and " cnt " characters")
(exit)
Code: Select all
~> ./readstdin < test.txt
time:264.203
read 100000 lines and 18999043 characters
The difference of the two methods is, that the fast method uses stream reading fgetc() while the slower method used file handle based read(handle, &chr, 1).
What is the experience of others doing text processing with newLISP?
Ps: all measurements with 10.3.3 and Mac OSX 10.7.2 , on 10.3.10, the faster one takes 1 micro sec or less per line.
Re: parsing large files (> 5GB)
Hmm..
./readchannel test.txt
time:87445.189
read 100000 lines and 19000792 characters
$ ./readstdin < test.txt
time:1704.849
read 100000 lines and 19000792 characters
This is a Ultrasparc 25.
./readchannel test.txt
time:87445.189
read 100000 lines and 19000792 characters
$ ./readstdin < test.txt
time:1704.849
read 100000 lines and 19000792 characters
This is a Ultrasparc 25.
Re: parsing large files (> 5GB)
Now using the stdin method for the original script it went down from 8 minutes to 13 secs. Phew.
Looks like fgetc method is a bad idea.
Code: Select all
time ./apache.lsp < xaa
3885
real 0m13.088s
user 0m12.662s
sys 0m0.243s
Re: parsing large files (> 5GB)
Yes, and on 10.3.10 and after you will get down to about 4 seconds versus about 2.7 on Perl and this is more in line with the benchmarks done earlier.
-
- Posts: 2038
- Joined: Tue Nov 29, 2005 8:28 pm
- Location: latiitude 50N longitude 3W
- Contact:
Re: parsing large files (> 5GB)
That still seems very slow. Cf:This is a Ultrasparc 25.Code: Select all
$ ./readstdin < test.txt time:1704.849 read 100000 lines and 19000792 characters
Code: Select all
$ ./readstdin < test.txt
time:186.978
read 100000 lines and 18999056 characters
Generally newLISP is not quite as quick as Perl if you write Perl-y newLISP, but better if you write newLISP-y newLISP. Even if it's not quite as quick it seems more fun to write.
Re: parsing large files (> 5GB)
That still seems very slow.
Yes they(UltraSparc iii) are slow. They belong to the 2001 era. SUN(now Oracle) sparcs are generally not optimized for single threaded performance. Infact the SPARC cpus did even feature out-out-order execution until recently.
Yes they(UltraSparc iii) are slow. They belong to the 2001 era. SUN(now Oracle) sparcs are generally not optimized for single threaded performance. Infact the SPARC cpus did even feature out-out-order execution until recently.