..some time ago I accidentally read a message of some sysadmins on the main BBC website. They were talking about developing another "perl framework" to take care of multiple problems, one of which seemed too many files in their web directories, they became unmanageable.Shouldn't we talk not only about abstractions and principles, but also see if and how NewLisp could be used for scripting in the spirit of perl?
There is a very simple solution, but for some incomprehensible reason few people seem to be aware of it: use tar files, and index them. You'll be able to access contents in a tiny fraction of a second, so in effect your tar becomes an equivalent of a compressed read-only filesystem with random file access.
When "tar" itself reads the archive, it linearly scans the full length, and therefore it is way too slow.
I myself use it as a backend for storing archives of, for example, last year's postings on a blogging site, or to keep other collections of documents.
One obvious advantage is that you stop relying on unbundled software (many assume a relational db for storage), and with the ability of NewLisp to create tiny standalone executables one can create a really portable cgi application with a web interface.
One might add indexed tar storage to the NewLisp wikis to manage large file and postings collections.
1. The idea
Tar files are a concatenation of their files, with added headers and written in 512-byte blocks, so the end of the content section can be padded with \000s
Usually you'll want to keep gzip-compressed files inside (or alternatively make a tar of uncompressed files and compress the whole bundle, that is discussed later)
To index a tar therefore one needs to read the archive in 512 pieces, check for the "magic string" - the word "ustar" at position 517 will signal the header of a next member file. Then the indexer should read bytes 0-100 of the header, which will contain the name of the included file, padded with \000s.
There is more information in the header - Wikipedia is a quickest reference, if one wishes to avoid dense descriptions.
Next the indexer counts the 512 blocks until the next header with a filename, and the result, written to a file in a form convenient for the reader program will look like
Meaning for the reader program - wind/seek 1235*512 bytes from the beginning, then get 14 blocks (until the next member file starts), and process to rid them of the header and padding garbage....
archive/dir/some/file/name 1235 14
archive/dir2/some/other/file/name 3214 3
...
A skeleton indexer
I offer a simple "skeleton script" below. My primary uses of the scripts are not as command-line utilities, so the argument reading is primitive etc.
Code: Select all
#!/usr/bin/newlisp
;;@module index-tar
(if ( != (length (main-args)) 3)
( and (println "\n\tUSAGE: " (main-args 1) " Tar_file_name\n") (exit)) )
(set 'l_a (2 (main-args)))
(if (file? (set 'tarball_file (pop l_a)))
(set 'fh_tarfile (open tarball_file "read"))
(and (println "\n\tcould not open tar file\n") (exit)) )
(set 'cnt 0)
(set 'prev_offset 0)
(set 'prev_name "dummy")
(set 'str_accumulator "" )
After experimentation, i found that the fastest NL can provide - is writing into a memory buffer (a string).
redundant string cast wrote: One "trick" I use is - after reading the filename from the first 100 bytes of the header - which is a string - I cast it as strng again:The extracted filename is padded with \000s:Code: Select all
(set 'prev_name (string (0 100 str_processed_block)) )
"archive/my/file/name\000\000\000\000\000\000....\000
0---------------------------------------------------------100
Casting this string as string again seemed an inexpensive way to truncate the padding.
P.S. Correction and speedup the fastest way to create the string is through the use of "format" command and without explicit casts. The script below is updated
Code: Select all
(while (read-buffer fh_tarfile 'tar_chunk 4096)
(dotimes (block_cnt 8)
(set 'str_processed_block ((* block_cnt 512) 512 tar_chunk))
(and
(= (257 5 str_processed_block) "ustar") ; --> this is the header block
(if ( = cnt 0 )
(set 'prev_name (string (0 100 str_processed_block)) )
(write-buffer str_accumulator
(format "%s %d %d\n" prev_name prev_offset (- cnt prev_offset) ) )
)
(set 'prev_name (string (0 100 str_processed_block) ))
(set 'prev_offset cnt)
)
(inc 'cnt)
); /end of dotimes/ - 512k inspection of chunks
);end of while - 4k chunks gobbling
(write-buffer str_accumulator
(format "%s %d %d\n" prev_name prev_offset (- (- cnt 1) prev_offset) ) )
(write-buffer 1 str_accumulator) ; /*result to STDOUT*/
(close fh_tarfile)
(exit)
the length of the last file in the archive will be wrong (too large), but it won't interfere with correct file extraction.
......USAGE: scriptname tar_file_name > index_file_name
The indexer is sort of slow. For my tests I use a 90MB tar of tiny (5-50kb) text/html files, gzipped before they were put into the archive. The whole archive uncompressed was 250MB and consists of roughly 31k files.
The indexer goes through 90MB and 37000 items (dirnames, some trash add to the 31000 useful files) in roughly 3.9 seconds on an old machine (pentium 500MHZ).
Indexers written in C do that in roughly 1.0-1.3 seconds.
Because indexing is a one-time and infrequent job, I decided I could really live with it if the main script for the project, file listing/search and extraction from this as if "read-only tar filesystem" were fast enough.
2. List and get files quickly from large tar archives
I found two ways to scan files for their contents fast in NL:
(a) use the "search" operator. To speed it up roughly 3 times one might want to reset a buffer size in the newlisp code (see my previous posting), and then it becomes comparable to perl up to thousands of lines. It will lag after that
(b) slurp the contents of the whole file into memory, and use string operators to search for information in it.
Below are the two short functions, "f_search_in_file" and "f_slurp_and_search" that do that.
The remaining code snippets can be concatenated into one working skeleton script, which is fed the index file in the format shown above and the tar file.
Depending on the options, the script will either find filenames (by its substrings etc - using regexps), the number of lines of output is given as options too - or cut out the files. So first adjust your regexp to "list" what you need, then change the option to "get" the file.
In this skeleton script I simply cut the file out. With the header and trailing padding it's just a small one-file tar archive itself (or - of as many files as you decided to extract at once).USAGE:
<prg> list|get search|slurp from_match_num to_match_num index_file tar_file
for example:
(a) ./script.lsp list slurp 100 120 '2006-.*' index.txt archive.toz.tar
(find btw 100th and 120th matches for string 2006- )
(b) ./script.lsp get slurp 10000 10001 '/2006-.*' index.txt archive.toz.tar |tar xOf - |zcat
cut out the 1000th match by reading the index file into memory (a faster find in this case), getting the record, then using the offset and length to get it
I could untar in NL, but because tar itself is a tiny 140k utility, I simply redirect into "tar xOf - " to get the stream on STDOUT, and because files inside the archive are gzipped, one might decompress them with zcat (also tiny), or serve as "gz" file.
Most web browsers support gzip compression
Code: Select all
#!/usr/bin/newlisp
;;@module .......
#------FUNCS---------
(define (f_search_in_file)
(set 'v_cnt 0)
(while (search fh_index (string str_pattern) 0)
(inc 'v_cnt)
(if ( > v_cnt (int from_rec_num))
(apply f_output (list (read-line fh_index)) )
(read-char fh_index)
)
(and
( >= v_cnt (int to_rec_num))
(close fh_index)
(exit) ) );--end of while = grep for given file--
(close fh_index)
); /* end of f_search */
Code: Select all
(define (f_slurp_and_search)
(set 'cnt 0)
(replace (string str_pattern) ; -- replace what
(read-file index_file) ; -- replace where
(and ; -- pseudo-replace with what actions
(inc 'cnt)
(if (> (int cnt) (int from_rec_num))
(apply f_output (list $0)))
;(println $0))
;(f_get $0))
(and (>= (int cnt) (int to_rec_num))
(exit) ); --end of and
); end of "and" = replace-actions
0); end of replace
) ; /*end of f_slurp_and_search*/
Next comes the primitive function to cut the contents out of the tar:
Code: Select all
(define (f_get str_tar_entry)
(set 'str_parsed (parse str_tar_entry))
(seek fh_tarball ( * (int (nth -2 1 str_parsed)) 512 ))
(read-buffer fh_tarball 'tar_item (* (int (nth -1 1 str_parsed)) 512))
; writing the cutout (which is tar itself) on STDOUT
(write-buffer 1 tar_item )
; writing into a tar file
;....
; writing and untarring into fsystem
;....
); /*end of f_get*/
Main sets command-line options.
Code: Select all
#------MAIN----------
(if ( != 9 (length (main-args)))
( and
(println "\n\tUSAGE: " (main-args 1) " 'list|get' 'slurp|search' 'From_rec_num' 'To_rec_num' 'RegexpPattern' 'Indexname' 'Tarballname'\n")
(exit))
)
; -- these are globals: no args passing to the funcs
(set 'l_a (2 (main-args)))
(set 'grep_untar (pop l_a))
(set 'slurp_search (pop l_a))
(set 'from_rec_num (pop l_a))
(set 'to_rec_num (pop l_a))
(set 'str_pattern (pop l_a) )
(if
(= grep_untar "list") (constant 'f_output 'println)
(= grep_untar "get") (constant 'f_output 'f_get)
(println "\n\t action not defined: list|get
grep-like find files -- or get a cutout tar of file(s)
)
(if (file? (set 'index_file (pop l_a)))
(set 'fh_index (open index_file "read"))
(println "\n\t could not find the index file")
)
(if (file? (set 'tarball (pop l_a)))
(set 'fh_tarball (open tarball "read"))
(println "\n\t could not find the tar file")
)
(if
(= slurp_search "slurp") (f_slurp_and_search)
(= slurp_search "search") (f_search_in_file)
(println "\n\t type of processing not defined: slurp|search")
)
(exit)
3. The results
The results of this experiment are good, the script is quite usable.
I know of 2 other utilities for indexing tars.
One is called "tarix" and its home is sourceforge.net. It's written in C, but the author made an unfortunate choice: claiming a need for compatibility, he did read index file line-by-line, but rather character-by-character. As a result extraction is slow: on my machine with 90MB tar (and 2MB index file of 37k entries, 31k useful entries) it gets the needed line from the index in appr. 0.300 seconds (and then seeking and cutting out is sufficiently fast)
The "tarix" author, however, also implemented an idea put forward by the author of the (well-known) "zlib" library. It's also possible to index in the similar manner compressed files!
Basically, if instead of tar of many compressed file1.gz ... fileX.gz you use tar.gz, it's possible to create a double index. So when extracting a file, (a) find it in the index, (b) stream uncompression from the point before the file begins in the tar, to a point after it ends, and cut out the file tar blocks the roughly way we have done here.
Zlib author has prototype utility to show how it can be done, and "tarix" is the only program I know that actually picked up the idea.
Another note: tarix' invocation is quirky, it's not easy to use manually, however it may be Ok in the scripts.
My tests of the scripts above are much better. Extraction is
0.035 - 0.070 - 0.090 - 0.110 of a second for the majority of the files.
Getting files from the far end of the tar archive is 0.200 second for "slurp" and 0.600 second for "search"
The third set of utilities for such applications I found in some scientific software for working with biology data. I had to tear out the sources from the project tree, and they use some interesting tricks with memory management. Therefore, they are not usable without their accompanying library, around 200k (the utilities themselves are tiny 50-60k).
Those can extract files in tens of microseconds (e.g. 0.025 - 0.070 up to 0.100), even for the farthest files.
So, this humble and rather quick NewLisp hack comes as a strong second, and is especially good for wiki and blog uses, where new files are accessed frequently, and farthest and old are read much less.
I believe using indexed tars is a really good solution to management of file collections, for websites, or, if anyone adds necessary scripts to a file manager (such as "Midnight Commander" many use on Linux), as a means to browse and read from tars without the need to linearly scan (slooow) and/ot extract their contents into temporary directories (overfilling your file systems).
some results on an old 500MHZ machine
P.S. Comparison with sqlite indexer (an excellent indexer program) shows that unless you know precisely the filename you're looking for, you (or the web site user) are likely to search for a substring.1. Grepping for files:
list/grep one file (end of index) -- "search": 0m0.675s
list/grep one file (end of index) -- "slurp": 0m0.243s
This is what the utility is not supposed to do. In real web usage extraction is done by pages (say, 20 at a time):
grep/find/list 12300 filenames - "search" 0m3.072s
grep/find/list 12300 filenames - "slurp" 0m1.246s
2. Extracting files
get/extract one file from the end (13000th match) -- "search" 0m0.675s
get/extract one file from the end (13000th match) -- "slurp" 0m0.243s
Extraction from beginning or the middle of the file:
get/extract one 122th matching file (search) 0m0.043s
get/extract one 122th matching file (slurp) 0m0.065s
get/extract one 633th matching file (search) 0m0.088s
get/extract one 633th matching file (slurp) 0m0.089s
get/extract one 1220th matching file (search) 0m0.100s
get/extract one 1220th matching file (slurp) 0m0.108s
If indexed sqlite access is very quick (0.015s for my test file), any searches with "LIKE username%" will simply scan the index linearly, and in my tests sqlite provided worse results than the described utility.
P.P.S. One can index files not only by their filenames, but by other keys (categories, date, some key words etc.) The speed of reading the index, obviously, depends on the size of the index, so all unnecessary information should be cut out.