Parse very big XML file (Openstreetmap)

Q&A's, tips, howto's
Locked
hilti
Posts: 140
Joined: Sun Apr 19, 2009 10:09 pm
Location: Hannover, Germany
Contact:

Parse very big XML file (Openstreetmap)

Post by hilti »

Hi!

Does anyone have experience in parsing large OSM (Openstreetmap) files? I'm trying to parse them with (xml-parse) but I get an error from newLISP telling me that there's not enough memory for (read-file)

The file is 32GB (gigabytes!).

Here's the error message:

Code: Select all

newlisp -m 4096 -s 10000 parse.lsp 
newlisp(18433) malloc: *** mmap(size=4258476032) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug

ERR: not enough memory in function read-file
Thanks for any suggestion.
Marc
--()o Dragonfly web framework for newLISP
http://dragonfly.apptruck.de

conan
Posts: 52
Joined: Sat Oct 22, 2011 12:14 pm

Re: Parse very big XML file (Openstreetmap)

Post by conan »

DISCLAIMER: I haven't worked with big XML files.

It seems you have to split your XML file but I don't know how that will affect xml-parse.

However, from the manual:
Using a call back function

Normally, xml-parse will not return until all parsing has finished. Using the func-callback option, xml-parse will call back after each tag closing with the generated S-expression and a start position and length in the source XML:
Maybe that could help.

rickyboy
Posts: 607
Joined: Fri Apr 08, 2005 7:13 pm
Location: Front Royal, Virginia

Re: Parse very big XML file (Openstreetmap)

Post by rickyboy »

If you know something about the makeup of the file, using search with regexes might help:

http://www.newlisp.org/downloads/newlis ... tml#search

Maybe this way you can grab certain "chunks" from the file, pass the chunk to xml-parse, and then (optionally) write out (save ?) the chunk out to another file. If you're still memory constrained, you may not want to accumulate "chunks" on the heap (I guess you could remember each chunk with the same symbol (in the loop); in doing that, I'm not sure how fast the old chunks would get garbage collected).

Good luck. I'll bet someone has a better idea though. :)
(λx. x x) (λx. x x)

Locked