newlispfanclub.alh.net

Posted: **Mon Apr 30, 2012 8:21 pm**

My friend Ric Szopa wrote a "similarity" function in clojure to compare how similar two strings are using the trigrams approach (find all the sequential 3-letter combos in both words and measure the overlap):

Code: Select all

# trigrams in clojure (author: Ric Szopa)
(defn- trigrams [str]
 (let [n 3
       padding (repeat (dec n) \$)
       cleaned (remove #(Character/isWhitespace %) (concat padding
(clojure.string/lower-case str)))
       freqs (frequencies (partition n 1 cleaned))
       norm (Math/sqrt (reduce + (map #(Math/pow % 2) (vals freqs))))]
   (into {} (for [[k v] freqs] [k (/ v norm)]))))

(defn similarity [left right]
 (if (= left right)
   1.0
   (let [left (trigrams left)
         right (trigrams right)]
     (apply + (for [[trigram value] left] (* value (or (right trigram) 0)))))))

I decided to try writing it in newlisp...

Code: Select all

(define (ngrams d str, (norm 0) (str-clean (append "  " (lower-case (replace " " str "")))))
  (dotimes (i (- (length str-clean) 2))
	(bayes-train (unpack "s3" (+ i (address str-clean))) d))
  (dotree (x d true)
    (inc norm (pow ((context d x) 0))))
  (setq norm (sqrt norm))
  (dotree (x d true)
    (context d x (div ((context d x) 0) norm)))
	d)

(define (similarity left right, (accum 0))
  (if (= left right)
	  1.0
	  (let ((left (ngrams 'Left left))
			(right (ngrams 'Right right)))
	  	(dotree (s left true)
	 	  (inc accum (mul (context Left s) (or (context Right s) 0))))
		accum)))

#  (similarity "banana" "bananana") => 0.9722718241

...which turned out to be only a few lines longer.

In order to speed it up I tried the following strategies:
* store data set in a context and iterate over it using dotree (instead of list + dolist)
* i used low-level "unpack" for faster string iteration
* use built-in methods when possible (eg bayes-train)

One consequence of all the dotree's is that I felt like I ended up doing a lot more stateful variable manipulation (eg creating a variable and then calling "inc" on it within the block of the 'dotree') than I would have in other dialects of lisp (where I would have used eg 'reduce').

My question for the forum is: does anyone have any tips or tricks for optimizing newlisp code (using trigrams as an example) or corrections to my approach?

Take care

Nick

newlispfanclub.alh.net

approximate string matching using trigrams

approximate string matching using trigrams