## approximate string matching using trigrams

Q&A's, tips, howto's

### approximate string matching using trigrams

My friend Ric Szopa wrote a "similarity" function in clojure to compare how similar two strings are using the trigrams approach (find all the sequential 3-letter combos in both words and measure the overlap):

Code: Select all
`# trigrams in clojure (author: Ric Szopa)(defn- trigrams [str] (let [n 3       padding (repeat (dec n) \\$)       cleaned (remove #(Character/isWhitespace %) (concat padding(clojure.string/lower-case str)))       freqs (frequencies (partition n 1 cleaned))       norm (Math/sqrt (reduce + (map #(Math/pow % 2) (vals freqs))))]   (into {} (for [[k v] freqs] [k (/ v norm)]))))(defn similarity [left right] (if (= left right)   1.0   (let [left (trigrams left)         right (trigrams right)]     (apply + (for [[trigram value] left] (* value (or (right trigram) 0)))))))`

I decided to try writing it in newlisp...

Code: Select all
`(define (ngrams d str, (norm 0) (str-clean (append "  " (lower-case (replace " " str "")))))  (dotimes (i (- (length str-clean) 2))   (bayes-train (unpack "s3" (+ i (address str-clean))) d))  (dotree (x d true)    (inc norm (pow ((context d x) 0))))  (setq norm (sqrt norm))  (dotree (x d true)    (context d x (div ((context d x) 0) norm)))   d)(define (similarity left right, (accum 0))  (if (= left right)     1.0     (let ((left (ngrams 'Left left))         (right (ngrams 'Right right)))        (dotree (s left true)         (inc accum (mul (context Left s) (or (context Right s) 0))))      accum)))#  (similarity "banana" "bananana") => 0.9722718241`

...which turned out to be only a few lines longer.

In order to speed it up I tried the following strategies:
* store data set in a context and iterate over it using dotree (instead of list + dolist)
* i used low-level "unpack" for faster string iteration
* use built-in methods when possible (eg bayes-train)

One consequence of all the dotree's is that I felt like I ended up doing a lot more stateful variable manipulation (eg creating a variable and then calling "inc" on it within the block of the 'dotree') than I would have in other dialects of lisp (where I would have used eg 'reduce').

My question for the forum is: does anyone have any tips or tricks for optimizing newlisp code (using trigrams as an example) or corrections to my approach?

Take care

Nick
nallen05

Posts: 21
Joined: Sun Apr 19, 2009 10:12 pm