Here is a variation with some explanations:
Code: Select all
(while (read-line)
(bayes-train (parse (current-line) "\\s+" 0) 'Counts))
(dolist (each (Counts))
(println (each 0) " " (each 1 0)))
; try it out
newlisp counter.lsp < sometext.txt
the program uses the same hash-tree data structure as yours, but uses newLISP's built-in 'bayes-train' function to do the counting. Just like the hash function it will prepend an underscore to the symbol, but hide it when getting the word list using (Counts).
The example programs in the link mostly split by white-space, which would be "\\s+", but I prefer "\\W+" which does a better job rejecting funny characters, i.e. when parsing from HTML text.
The print functions is almost the same as yours. Sorting is really not necessary, because the list produced by (Count) is already sorted. I need and extra index 0, because 'bayes-train' parenthesizes the counts (there could be more counts in a list). As a bonus you have the toal of all words in Counts:total.