Counting chars in a string

Jeff · Post by **Jeff** » Tue Aug 14, 2007 8:55 pm

I've got a program I wrote in Python/C that aggregates counts of chars in a string, but I thought I'd take a crack at it in newLisp as well. I want to end up with a list (or array, hash, whatever) of (character (count of character)) for a string. The string I'm testing with is War and Peace (here's a free copy - http://artfulcode.nfshost.com/charcount ... _peace.txt). The function I'm using is:

Code: Select all

(define (count-chars-in-string str)
  (let ((seq (sequence 0 127)))
       (map (fn (i) (cons (char i) (length (find-all (format "\\x%x" i) str))))
            seq)))

I've tried it by iterating through the string (which is the tack I took in C, since a string is really an int array), incrementing an 128-member array with the char's int cast as the index. That takes minutes in newLisp. In C, I'm doing this in under a tenth of a second. I should at least be able to get it under half a second. The current implementation, above, gives me about 3.5 seconds, which is still ridiculously slow.

I also tried using read-char and read-buffer, but neither gave me the results I wanted. Anyone have any ideas?

By the way, here is a url of the type of results I am planning on ending up with: http://artfulcode.nfshost.com/charcounter. This uses Python with a C module that does the counting. The python only pushes the string to the module.

Lutz · Post by **Lutz** » Tue Aug 14, 2007 10:07 pm

This solution is slightly faster, and may scale better too. The counts are in a hash Lex -> Lex:_a Lex:_b etc.

Code: Select all

(define (count-chars-in-string str)
	(bayes-train (explode str) 'Lex)
	(map list (map name (symbols 'Lex)) (map eval (symbols 'Lex)))
)

normally 'bayes-train is used on several lists at the same time, putting all counts found in a list, but it also can be used counting in just one list as shown here.

Lutz

Jeff · Post by **Jeff** » Wed Aug 15, 2007 12:07 pm

Lutz,

That's a helpful use of the function to know about, but that still doesn't give me a real performance boost. Exploding the string is too expensive.

Is there no way to iterate quickly over the chars in a string? What is newLisp's internal representation of a string like? If there isn't a way to do so, could a dostring function be added?

Also, I tried to import my C library I used with Python to do the work for me and had some real difficulty. Here is the C lib:

Code: Select all

#include <stdio>
#include <stdlib>
#include <string>
#define CHARS 128

typedef struct tally
{
	char *text;
	int counts[CHARS];
} TALLY;

TALLY init_counts(char *str);
void count_chars(TALLY *t);
int count_of_char(TALLY *t, char c);

TALLY init_count(char *str)
{
	int i;
	TALLY t;
	t.text = str;
	for (i = 0; i < CHARS; i++)
	{
		t.counts[i] = 0;
	}
	count_chars(&t);
	return t;
}

void count_chars(TALLY *t)
{
	int i;
	for (i = 0; i <strlen>text); i++)
	{
		t->counts[(int) t->text[i]]++;
	}
}

int count_of_char(TALLY *t, char c)
{
	return t->counts[(int) c];
}

After importing init_counts, I run (set 'foo (init_counts very-large-string)). This does not seem to work at all. I get nothing back and the action hangs. I also tried a C function that only takes a string and prints the first 100 chars, just to test the newLisp code, and it seems like passing a string to a library function takes an inordinate amount of time (which is why I was wondering about newLisp's internal string representation).

Any ideas? I would love to get rid of python's overhead in my app by using newLisp and additionally have the ability to run the software in an easy-to-create gui, but 3.5 seconds is way too expensive here.

Lutz · Post by **Lutz** » Wed Aug 15, 2007 4:15 pm

On this page you find a small C program for the character counter and how to compile and import it into newLISP:

http://newlisp.org/index.cgi?CountCharacters

I tried to post it here but the formatting got all mixed up.

Lutz

jrh · Post by **jrh** » Wed Aug 15, 2007 4:22 pm

Jeff wrote:Is there no way to iterate quickly over the chars in a string? What is newLisp's internal representation of a string like?

Strings and text appear as contiguous areas of memory in newLISP. Fastest iteration I've found while staying in newLISP is to index into them from the base address:

Code: Select all

(time (setq buf (dup "\000" 3000000)))

(time (for (x 0 (- (length buf) 1)) (get-char (+ (address buf) x))))

Ryon · Post by **Ryon** » Wed Aug 15, 2007 4:40 pm

I tried to post it here but the formatting got all mixed up.

Does this look right? Some wrapping, but readable?

Code: Select all

#include <stdio.h>
#include <memory.h>

int cnts[128];

int * count_characters(char * str, int length)
   {
   int i;

   memset((void *)cnts, 0, 128 * sizeof(int));

   for(i = 0; i < length; i++)
       cnts[str[i]] += 1;

   return(cnts);
   }


compile it to a shared library


on Mac OS X:
   gcc count-characters.c -bundle -o count-characters.so

on other UNIX
   gcc count-characters.c -shared -o count-characters.so


now use it:

newLISP v.9.2.0 on OSX UTF-8, execute 'newlisp -h' for more info.

> (define count-chars (import "count-characters.so" "count_characters"))
count_characters <8CEF0>

> (unpack (dup "lu" 128) (count-chars (read-file "war-and-piece.txt") 3217389))
(0 0 0 0 0 0 0 0 0 0 67418 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
 0 0 511976 3938 17974 2 1 3 0 7526 640 640 339 1 39921 5850 30686 
 26 230 348 163 55 25 49 51 38 169 54 1003 1148 1 2 1 3138 2 6204 
 3638 1783 2021 1867 1908 1247 4016 7404 320 1191 687 3288 3627 1638 
 6117 35 2704 2974 6464 278 939 2896 348 1265 108 47 0 47 0 1 0 199012 
 30984 59225 116122 312451 52818 49877 162871 166004 2214 19194 95740 
 58261 180228 190868 38854 2300 144967 159746 219083 65006 25940 
 56197 3719 44945 2282 0 0 0 1 0)
> 

does it on my MacMini/PPC in 0.129 seconds.

Lutz

Jeff · Post by **Jeff** » Wed Aug 15, 2007 4:49 pm

Thank you all for your help.

Jrh: I tried that, too. I just couldn't get the speed I needed. Especially when on each iteration I needed to increment the value at an array index. With a quad-core g5 (2 * 2.5 GHz), it still takes around 3.5 seconds :(

Lutz: Thanks! That is what I was looking for. I was still using the struct as a way of housing the array so that it could be returned from the function (as you can see, I'm not hugely experienced with C; additionally, the lib was originally built to work with the Python API so that it would have a more object-like return value that could be extended as needed).

I was only ever able to return a struct to newLisp successfully when I only returned a pointer to a global instance of the struct. When I tried to return a locally created struct itself, the interpreter died with no messages. Any idea what I did wrong there?

Lutz · Post by **Lutz** » Wed Aug 15, 2007 4:55 pm

was only ever able to return a struct to newLisp successfully when I only returned a pointer to a global instance of the struct.

Yes, that is the only way to do it. All return values from C library imports are 32 bit, so for bigger data objects a pointer has to be returned.

Thanks Ryon for re-posting the code here in readable form.

Lutz

Jeff · Post by **Jeff** » Wed Aug 15, 2007 8:31 pm

Thanks again for all the help. I just added a version of the character counter form on my website using newLisp as a cgi.

The host only supports cgi (since the virtual host doesn't actually exist until it gets a request). That means that with each request I have the overhead of launching the interpreter.

The newlisp version, while it takes about the same amount of time to process the text itself, takes more than a half of a second less to actually return the data.

Here is the newLisp version: http://artfulcode.nfshost.com/lex/

(don't bookmark that url, it will disappear)

Jeff · Post by **Jeff** » Wed Aug 15, 2007 8:33 pm

PPS: 'sort is excellent. This was the first time I ever implemented a sort using s-expressions rather than imperative code, and it is much, much cleaner.

Jeff · Post by **Jeff** » Mon Aug 20, 2007 2:29 am

Yay! I figured out a way to iterate quickly over a string. It's not nearly as fast as C, but it seems to out-perform the other ways so far. First, I get the address of the string as a pointer. I can get the char at that address using get-char. Then I can get the next char using (get-char (+ 1 (address string))). Here is a function to count chars moderately quickly (beats bayes-train because it doesn't explode the string):

Code: Select all

(define (count-chars str , ptr chars c)
  (set 'chars (array 128 (dup 0 128)))
  (set 'ptr (address str))
  (for (i 0 (length str))
       (set 'c (get-char (+ ptr i)))
       (nth-set (chars c) (+ 1 (chars c))))
  chars)

(set 'str (read-file "/path/to/war_and_peace.txt"))
(set 'chars (count-chars str))
(for (i 0 127)
     (println (char i) ": " (chars i)))

One question, though... since inc expects a symbol, is there a way to use inc on an array index? I.e., (inc (arr x y))?

Lutz · Post by **Lutz** » Mon Aug 20, 2007 2:42 pm

The use of an arrray for counting is an important step, even more so when using a few hundred or thousand elements in the counting array.

There is another 2 modifications to get it another 20% faster using 'unpack' to create a vector of ASCII numbers from the string, and using a feature of 'nth-set', which gives the old contents in $0

Code: Select all

(define (count-chars str , nums chars c)
  (set 'chars (array 128 (dup 0 128)))
  (set 'nums (unpack (dup "c" (length str)) str))
  (dolist (c nums)
       (nth-set (chars c) (+ $0 1)))
  chars)

Lutz

Jeff · Post by **Jeff** » Mon Aug 20, 2007 3:04 pm

I get another 18% beyond that using only unsigned integers ("b" instead of "c"). Thanks! This is getting pretty fun :)

Lutz · Post by **Lutz** » Wed Aug 29, 2007 8:59 pm

The rewritten 'count in 9.2.1 is about 3-4 times faster than before when counting 100 in 1,000,000 elements. On the 'war_and_peace.txt' character count about 30% faster on a Core Duo 2 1.8Ghz when testing against counting with 'unpack and char-indexed 'array, and of course a lot easier to code.

Code: Select all

(set 'str (read-file "war_and_peace.txt"))
; old way on 9.2.0
(set 'cnts (array 128 (dup 0 128)))
(dolist (i (unpack (dup "b" (length str)) str)))
	(nth-set (cnts i) (+ $0 1)))

-> 1270 ms

; using count on 9.2.1
(count (sequence 0 127) (unpack (dup "b" (length str)) str))

-> 812 ms

Lutz

Jeff · Post by **Jeff** » Wed Aug 29, 2007 11:36 pm

Wow! Where can I get some of that?! Thanks a bunch, Lutz. It's nice to work in a language where the author takes a personal interest in how the language is used.