newlispfanclub.alh.net

Posted: **Sat Feb 04, 2006 8:03 am**

I"m trying to create a newLISP version of a little Ruby script that I quite liked, with the same structure and at least equivalent speed:

Code: Select all

class Histogram
	def initialize
		@tally = Hash.new(0)
	end
	def analyze(s)
		s.split.each do |word| 
			myword = word.downcase.gsub(/[^a-z0-9]/,"")
			@tally[myword] = @tally[myword] + 1 if !myword.empty?
		end
		@tally.sort
	end
end
analysis = Histogram.new.analyze(File.new(ARGV[0]).read)
print analysis.length

The hash table/symbol stuff isn't tha problem, but I've found that the way I'm using parse to build the word hash is much slower, because I'm using regular expressions.

This is what I have so far:

Code: Select all

(context 'HASH) 
(context 'MAIN)

(dolist (file-name (2 (main-args))) 
	(set 'buff (parse (read-file file-name) "\s" 8))
	(dolist (the-word buff) 
		(set 'the-word (replace "[^a-z0-9]" (lower-case the-word) "" 0)))
		(if (!= (context HASH (string "_" the-word)) nil)
			(begin
				(set 'occurrences (context HASH (string "_" the-word)))
				
				(if (!= occurrences nil)
					(set (context HASH (string "_" the-word)) (inc 'occurrences))))
			(begin
				(set (context HASH (string "_" the-word)) 1))))
(sort (symbols HASH))
(dolist (s (symbols HASH))
	(push (list (name (sym s HASH)) (eval (sym s HASH))) word-list -1))
(println (length word-list))

This sort of works, but it's so much slower than the Ruby one that I'd like to know how to make it faster. Any clues or suggestions welcome!

Posted: **Sun Feb 05, 2006 6:35 am**

I tried something... Maybe you get some ideas...

Code: Select all

(setq word-list '())

(context 'HASH)
(context 'MAIN)

(define (word-count1 files , buff occurrences)
  (set 'word-list '())
  (dolist (file-name files)
     (set 'buff (parse (read-file file-name) {\s} 8))
     (dolist (the-word buff)
        (set 'the-word (replace "[^a-z0-9]" (lower-case the-word) "" 0))
        (if (!= the-word "")
          (if (set 'occurrences (context HASH (append "_" the-word)))
            (context HASH (append "_" the-word) (+ 1 occurrences))
            (context HASH (append "_" the-word) 1)))))

  (dolist (s (sort (symbols HASH)))
     (push (list (name (sym s HASH)) (eval (sym s HASH))) word-list -1))

  (length word-list))


(define (word-count2 files , buff uniq-buff)
  (set 'word-list '())
  (let (not-empty? (lambda (s) (!= s ""))
        cleanup    (lambda (s) (replace "[^a-z0-9]" (lower-case s) "" 0)))
    (dolist (file-name files)
      (set 'buff (filter not-empty? (map cleanup (parse (read-file file-name) {\s} 0))))
      (set 'uniq-buff (unique buff))
      (set 'word-list (append word-list (map list uniq-buff (count uniq-buff buff))))))

  (set 'word-list (sort (unique word-list)))
  (length word-list))


(println (time (setq cnt (word-count1 (2 (main-args))))) " ms")
(println cnt " words")
(println)
(println (time (setq cnt (word-count2 (2 (main-args))))) " ms")
(println cnt " words")

Fanda

Posted: **Mon Feb 06, 2006 7:42 pm**

Hey Fanda - thanks for the post. I worked through your code, and thought there were some great ideas! It's always good to see an expert at work...

It's interesting to explore the trade-off between coding elegance and performance. It often seems that the most elegant and attractive high-level code is the least efficient. I noticed that the filter and map functions seemed to add some time to the execution, although I like the way they clarify the code. I suppose this is what expert programmers spend their time doing.

Anyway, I eventually found that the main reason for the relative slowness of my newLISP version was the regex parsing of the source file, which I didn't really need to do in this particular case in this way. This version is nearly twice as fast as the Ruby original, but slowed down whenever I tried to introduce filter or map:

Code: Select all

(context 'HASH)
(context 'MAIN)

(dolist (file-name (rest (rest (main-args))))
	(set 'data (parse (replace "\r\n" (read-file file-name) " ") " "))
	(dolist (the-word data)
		(set 'the-word (replace "[^a-z0-9]" (lower-case the-word) "" 0))
		(set 'occurrences (context HASH the-word))
		(if (> occurrences 0)
				(set (sym the-word HASH)  (+ 1 occurrences))
				(set (sym the-word HASH) 1))))
(sort (symbols HASH))
(dolist (s (symbols HASH))
	(push (list (name (sym s HASH)) (eval (sym s HASH))) word-list -1))
(println (length word-list))
(exit)

Thanks for your help![/code]

Posted: **Wed Feb 08, 2006 7:21 am**

Your version is much faster than mine, so who is an expert now? ;-)

Anyway, I can see 3 things happenning in your code:

1) (sort (symbols HASH)) isn't really gonna sort the symbols - it will only sort the temporary variable returned by (symbols HASH).
Currently it's sorted because symbols are implicitly sorted inside the context. It's better to write:

Code: Select all

(dolist (s (sort (symbols HASH)))

2) parse and replace are returning "" (empty string) and this string is being counted too. Try doing:

Code: Select all

(println {"": } (context HASH ""))

(It means, that (length word-list) is off by 1).

3) (set 'occurrences (context HASH the-word)) will return either a number or nil.
You could rewrite

Code: Select all

(if (> occurrences 0) ...

to just

Code: Select all

(if occurrences ...

I just wanted you to be aware of it, don't take it seriously ;-)

Fanda

Posted: **Wed Feb 08, 2006 8:03 am**

map and filter can run fast, but not always. If you are using a map, you can do some optimizations.

Let's use a timing function:

Code: Select all

;;
;; Time expressions for N times
;;
(define (time-expr _expressions _N)
  (if (not (apply = (map eval _expressions)))
    (begin
      (map (lambda (_e) (println _e " => " (eval _e))) _expressions)
      (throw-error "Return values of timed expressions do not match!")))

  (println)

  (letn (times (map (lambda (_e) (time (eval _e) _N)) _expressions)
         lentimes (apply max (map (lambda (x) (length (string x))) times))
         maxtime (apply max times))
    (if (= maxtime 0)
      (println "All times are 0!")
      (map (lambda (t _e) (println (format (append "%" (string lentimes) "d") t) " ms : " (format "%5.1f" (mul (div t maxtime) 100.0)) "% | " _e )) times _expressions))))

If you need to get the first value in the nested list 'lst', you can do:

Code: Select all

(map first lst)

This is quite fast. The rule is - try to stay away from mapping a lambda function. Doing that means that you are calling a function, passing an argument(s) and returning back from it. All that takes some time. Let's see timings for different cases:

Code: Select all

(setq lst (dup '(1 2 3) 1000))

(time-expr '((map first lst)
             (map (lambda (x) (first x)) lst)
             (begin (setq r '()) (dolist (x lst) (push (first x) r -1)) r)) 500)

----------------
266 ms :  60.7% | (map first lst)
438 ms : 100.0% | (map (lambda (x) (first x)) lst)
406 ms :  92.7% | (begin 
 (setq r '()) 
 (dolist (x lst) 
  (push (first x) r -1)) r)

Another example - add 2 to every element of a list:

Code: Select all

(setq lst (sequence 1 1000))

(time-expr '((map (lambda (x) (+ 2 x)) lst)
             (map + (dup 2 (length lst)) lst)
             (begin (setq r '()) (dolist (x lst) (push (+ 2 x) r -1)) r)) 800)

----------------
500 ms : 100.0% | (map (lambda (x) (+ 2 x)) lst)
328 ms :  65.6% | (map + (dup 2 (length lst)) lst)
328 ms :  65.6% | (begin 
 (setq r '()) 
 (dolist (x lst) 
  (push (+ 2 x) r -1)) r)

The trick is: Instead of using a lambda function, make additional list filled with constants:

Code: Select all

(map + (dup 2 (length lst)) lst)

And another example - multiply list elements by 2:

Code: Select all

(setq lst (sequence 1 1000))

(time-expr '((map (lambda (x) (* 2 x)) lst)
             (map * (dup 2 (length lst)) lst)
             (map + lst lst)
             (begin (setq r '()) (dolist (x lst) (push (+ x x) r -1)) r)) 800)

----------------
484 ms : 100.0% | (map (lambda (x) (* 2 x)) lst)
328 ms :  67.8% | (map * (dup 2 (length lst)) lst)
328 ms :  67.8% | (map + lst lst)
375 ms :  77.5% | (begin 
 (setq r '()) 
 (dolist (x lst) 
  (push (+ x x) r -1)) r)

Timings vary if run several times, but still show a general picture faster > slower.

Fanda

Posted: **Wed Feb 08, 2006 9:08 am**

Hey Fanda - cool! I've got some homework study this week... :-)

Thanks!

Posted: **Mon Feb 13, 2006 8:24 am**

cormullion wrote:Anyway, I eventually found that the main reason for the relative slowness of my newLISP version was the regex parsing of the source file

By some spooky coincidence the latest version of newLISP (8.8) runs a regex parsing of a source file much faster now. So I can revert to the earlier version, which is now nearly three times faster than the Ruby original...

Great work, Lutz!

newlispfanclub.alh.net

read-file speed

read-file speed