approximate string matching using trigrams

nallen05 · Post by **nallen05** » Mon Apr 30, 2012 8:21 pm

My friend Ric Szopa wrote a "similarity" function in clojure to compare how similar two strings are using the trigrams approach (find all the sequential 3-letter combos in both words and measure the overlap):

Code: Select all

# trigrams in clojure (author: Ric Szopa)
(defn- trigrams [str]
 (let [n 3
       padding (repeat (dec n) \$)
       cleaned (remove #(Character/isWhitespace %) (concat padding
(clojure.string/lower-case str)))
       freqs (frequencies (partition n 1 cleaned))
       norm (Math/sqrt (reduce + (map #(Math/pow % 2) (vals freqs))))]
   (into {} (for [[k v] freqs] [k (/ v norm)]))))

(defn similarity [left right]
 (if (= left right)
   1.0
   (let [left (trigrams left)
         right (trigrams right)]
     (apply + (for [[trigram value] left] (* value (or (right trigram) 0)))))))

I decided to try writing it in newlisp...

Code: Select all

(define (ngrams d str, (norm 0) (str-clean (append "  " (lower-case (replace " " str "")))))
  (dotimes (i (- (length str-clean) 2))
	(bayes-train (unpack "s3" (+ i (address str-clean))) d))
  (dotree (x d true)
    (inc norm (pow ((context d x) 0))))
  (setq norm (sqrt norm))
  (dotree (x d true)
    (context d x (div ((context d x) 0) norm)))
	d)

(define (similarity left right, (accum 0))
  (if (= left right)
	  1.0
	  (let ((left (ngrams 'Left left))
			(right (ngrams 'Right right)))
	  	(dotree (s left true)
	 	  (inc accum (mul (context Left s) (or (context Right s) 0))))
		accum)))

#  (similarity "banana" "bananana") => 0.9722718241

...which turned out to be only a few lines longer.

In order to speed it up I tried the following strategies:
* store data set in a context and iterate over it using dotree (instead of list + dolist)
* i used low-level "unpack" for faster string iteration
* use built-in methods when possible (eg bayes-train)

One consequence of all the dotree's is that I felt like I ended up doing a lot more stateful variable manipulation (eg creating a variable and then calling "inc" on it within the block of the 'dotree') than I would have in other dialects of lisp (where I would have used eg 'reduce').

My question for the forum is: does anyone have any tips or tricks for optimizing newlisp code (using trigrams as an example) or corrections to my approach?

Take care

Nick