Vocabulary / Wordlist / Index

Q&A's, tips, howto's
Locked
didi
Posts: 166
Joined: Fri May 04, 2007 8:24 pm
Location: Germany

Vocabulary / Wordlist / Index

Post by didi »

This is my first try to generate a vocabulary :

Code: Select all

; mparse_words_asc_lower_case.lsp  dmemos 11.jan.2009
( silent    ; no console-echo
 ( println ( date ) ) 
 ( change-dir "C:\\Documents and Settings\\didi\\My Documents\\newLISP" )
 ( set 'src_txt ( read-file  "wrnpc12.txt" ))  ;  war_and_peace
 ( set 'word_char [text]abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_[/text])
 ( set 'sep_char [text] \n\r\t!\"#$%'()*+,-./:;<=>?@[]~"&/\\[/text])
 ( set 'lineout "" )
 ( set 'out_lst '() )

 ( replace "\r\n" src_txt " " ) ; replace all CR-LF with " " 
 ( set 'src_txt ( lower-case src_txt ))

 ( while ( < 0 (length src_txt) )
   ( set 'x ( pop src_txt ))
   ( if ( find x word_char ) 
       ( push x lineout -1)            ; word_char found
       ( if ( find x sep_char )        ; else no word_char
         ( if ( < 0 (length lineout))                                  
           ( begin
            ( push lineout out_lst -1)
            ( set 'lineout "")))))
 )

 ( set 'word_list (sort (unique out_lst)))
 ( write-file "word_list_wrnpc12.txt" (string word_list) ) 
)
( println ( date ))
( println  "bye " ) 
For me this is really short and fast enough, but i'm sure that the expert-newLISPer can do it better.
I tried some extreme big texts , 600kB of text needs 7min . The book "war and peace" is over 3MB in ascii-text , is it faster to divide it in parts and then join the lists ?

Lutz
Posts: 5289
Joined: Thu Sep 26, 2002 4:45 pm
Location: Pasadena, California
Contact:

Post by Lutz »

You could do a 600 kilobyte file in a 1/3 of a second:

Code: Select all

(sort (unique (find-all "\\w+" (read-file "Sherlock.txt"))))
or in 2 seconds with getting the text from the net:

Code: Select all

(sort (unique (find-all "\\w+" 
    (read-file "http://www.gutenberg.org/dirs/etext99/advsh12.txt"))))
;-)

cormullion
Posts: 2038
Joined: Tue Nov 29, 2005 8:28 pm
Location: latiitude 50N longitude 3W
Contact:

Post by cormullion »

Lutz wins... :) Or:

Code: Select all

(bayes-train (parse (read-file "big.txt") "\\W" 0) 'Vocab)
(map first (Vocab))
I think it's smarter to hand over the hard work to the regex engine in newLISP, which is fast, rather than try to split the words up in newLISP...

didi
Posts: 166
Joined: Fri May 04, 2007 8:24 pm
Location: Germany

Post by didi »

OK - now the whole 'war and peace' wordlist-generating needs below 7s .

At least my sort - unique -line was not bad ;-)

Locked