## approximate string matching using trigrams

Q&A's, tips, howto's
nallen05
Posts: 21
Joined: Sun Apr 19, 2009 10:12 pm

### approximate string matching using trigrams

My friend Ric Szopa wrote a "similarity" function in clojure to compare how similar two strings are using the trigrams approach (find all the sequential 3-letter combos in both words and measure the overlap):

Code: Select all

``````# trigrams in clojure (author: Ric Szopa)
(defn- trigrams [str]
(let [n 3
cleaned (remove #(Character/isWhitespace %) (concat padding
(clojure.string/lower-case str)))
freqs (frequencies (partition n 1 cleaned))
norm (Math/sqrt (reduce + (map #(Math/pow % 2) (vals freqs))))]
(into {} (for [[k v] freqs] [k (/ v norm)]))))

(defn similarity [left right]
(if (= left right)
1.0
(let [left (trigrams left)
right (trigrams right)]
(apply + (for [[trigram value] left] (* value (or (right trigram) 0)))))))
``````
I decided to try writing it in newlisp...

Code: Select all

``````(define (ngrams d str, (norm 0) (str-clean (append "  " (lower-case (replace " " str "")))))
(dotimes (i (- (length str-clean) 2))
(bayes-train (unpack "s3" (+ i (address str-clean))) d))
(dotree (x d true)
(inc norm (pow ((context d x) 0))))
(setq norm (sqrt norm))
(dotree (x d true)
(context d x (div ((context d x) 0) norm)))
d)

(define (similarity left right, (accum 0))
(if (= left right)
1.0
(let ((left (ngrams 'Left left))
(right (ngrams 'Right right)))
(dotree (s left true)
(inc accum (mul (context Left s) (or (context Right s) 0))))
accum)))

#  (similarity "banana" "bananana") => 0.9722718241
``````
...which turned out to be only a few lines longer.

In order to speed it up I tried the following strategies:
* store data set in a context and iterate over it using dotree (instead of list + dolist)
* i used low-level "unpack" for faster string iteration
* use built-in methods when possible (eg bayes-train)

One consequence of all the dotree's is that I felt like I ended up doing a lot more stateful variable manipulation (eg creating a variable and then calling "inc" on it within the block of the 'dotree') than I would have in other dialects of lisp (where I would have used eg 'reduce').

My question for the forum is: does anyone have any tips or tricks for optimizing newlisp code (using trigrams as an example) or corrections to my approach?

Take care

Nick