Passing multiple lists to bayes-train?

methodic · Post by **methodic** » Thu Jan 15, 2009 7:52 am

Does anyone have any tips of passing bayes-train multiple lists with data content instead of passing it in one fell swoop with an eval-string?

Instead of this:
(bayes-train (t 0) (t 1) (t 2) (t 3) (t 4) 'L)

Something like this would be preferred:
(apply bayes-train t)

Thanks in advance!

cormullion · Post by **cormullion** » Thu Jan 15, 2009 5:04 pm

Hi methodic. I'm not totally clear what you mean. You can do this:

Code: Select all

(set 'j '("this" "that" "other"))
(set 'k '("this" "that" "other"))
(bayes-train j k 'L)

to do mutiple lists at the moment. But each list represents a different category.

Also, remember that bayes-train is cumulative, so you can keep doing batches of words into the same context, if you keep the number of categories the same...

methodic · Post by **methodic** » Thu Jan 15, 2009 9:16 pm

Basically I'm trying to compare multiple blocks of text horizontally. Say I have 10 documents, and each document contains 5 paragraphs. I'd like to be able to iterate through each document and compare the paragraphs to each other.

Here is a crude example:

Code: Select all

(set 'terms "how are you doing")
(dolist (d docs)
  (set 'p (sort (unique (find-all "\\w+" d))))
  ;; now p contains a list of each paragraph
  (bayes-train (p 0) (p 1) (p 2) (p 3) (p 4) 'L)
  (bayes-query (parse terms " ") 'L)
  (delete 'L)
)

Basically I don't want to compare documents, I want to compare the paragraphs INSIDE the documents to each other. Hope that clears it up?

methodic · Post by **methodic** » Thu Jan 15, 2009 10:56 pm

Ideally I'd like to be able to pass an index into bayes-train to specify the location of the data:

Code: Select all

(set 'i 0)
(dolist (p parsed_data)
  (bayes-train p i 'L)
  (inc i)
)

cormullion · Post by **cormullion** » Thu Jan 15, 2009 11:33 pm

Perhaps this is cheating...

Code: Select all

(set 'paras '(
  ("a" "b" "c" "both")
  ("x" "y" "z" "both")))

(apply bayes-train (push 'K paras -1))
(bayes-query '("c" "a") K)

;-> (1 0)

:)

methodic · Post by **methodic** » Thu Jan 15, 2009 11:48 pm

cormullion wrote:Perhaps this is cheating...

Code: Select all

(set 'paras '(
  ("a" "b" "c" "both")
  ("x" "y" "z" "both")))

(apply bayes-train (push 'K paras -1))
(bayes-query '("c" "a") K)

;-> (1 0)

:)

You just became my new favorite person, this is absolutely what I needed.... THANK YOU THANK YOU THANK YOU! :)

cormullion · Post by **cormullion** » Fri Jan 16, 2009 8:12 pm

;)

Lutz - isn't this another example like we had recently, of apply not being quite versatile enough... What's the ideal solution? Another function? Or is a stronger curry the answer?

Lutz · Post by **Lutz** » Fri Jan 16, 2009 8:45 pm

The case demonstrated is very isolated and doesn't really show what else is going on for the task at hand. In the end it may be smarter to create a function for that application taking the lists for the categories and the context where to put he 'bayes-train' results and probably some other parameters.

methodic · Post by **methodic** » Sat Jan 17, 2009 2:12 am

Hey Lutz, could you please clarify the statement in the manual for bayes-query:

When using the default R.A. Fisher Chi² mode, nonexistent tokens will skew results toward equal probability in all categories.

Thanks in advance.

Lutz · Post by **Lutz** » Sat Jan 17, 2009 1:07 pm

When using Fisher's Chi-2, differences in the probability numbers for each category are less pronounced and tend to be closer to each other, moving closer to an average.

methodic · Post by **methodic** » Tue Jan 20, 2009 2:27 am

I'm not entirely sure on how the bayes stuff works, but if you had trained a document that you had parsed into paragraphs, and then passed those paragraphs as a parsed list of strings into bayes-train and passed in tokens like ("amd" "athlon" "processor") -- I am assuming those lists (parsed paragraphs) that contain a higher frequency of these words will be rated higher and that if one of the paragraphs didn't contain any of these tokens, that it would be rated as zero, correct?

That being said (if I understand correctly), would it be wise to strip any of the lists that don't contain at least one of the tokens from what you replied with earlier? Or is it better to keep all the data there so that the matches are rated higher?

Thanks in advance.

Lutz · Post by **Lutz** » Tue Jan 20, 2009 3:31 am

If a query contains words which do not occur in a certain category, then the chain Bayes method would drive the probability of that category completely down to zero, even if other words of that query list would occur frequently in that category. This is the reason you should Fisher's Chi-2 method and then there is no need to eliminate anything from the data or the query list of words. A paragraph with high frequency of other words in the query will still score high using Fisher's Chi-2 method.

There are situations where one would eliminate certain tokens from the data set to tune performance. But there is also the danger to eliminate important data wrongly without knowing.

If you are working on some kind of automatic classification problem, it is important to define a validating procedure, to which to compare your Bayesian classification. E.g. when categorizing emails into spam and not-spam, you would classify emails by human judgement then compare the results to your Bayesian classification method processing the same data. This will give you and indication how well your method is performing.

It is hard to give any more advice without knowing more about the nature of the problem you are investigating and the data you are using. Perhaps consider involving somebody knowledgeable in statistical methods and experimental design.

methodic · Post by **methodic** » Tue Jan 20, 2009 4:16 am

Thanks for the clarification. What I am trying to do is essentially figure out a way to parse a document, and based on terms, figure out which paragraph is most relevant when passing in some tokens. I would like the paragraph that matches on the most number of tokens to be rated higher than a paragraph with only one or two.

Am I wrong in trying to use the bayes functions for this approach? It seemed like you could also use this to approach for term frequencies.

Example: take the following lines:
AMD makes processors that are geared towards the consumer.
AMD has many fabricating plants all around the world.

I want to push those two documents into the same bayes-train context (ie: not compare side-by-side, but as a whole), and then if my tokens are ("AMD" "processors"), obviously I would want the first line because it contains both AMD and processors, where the latter only contains AMD.

I hope that cleared some stuff up. I'm not looking for help on statistics, just wondering if these bayes can be used, or if I should start looking for a different method. :)

xytroxon · Post by **xytroxon** » Tue Jan 20, 2009 7:02 am

Some background material on the Bayesian technique you are wondering about...

http://en.wikipedia.org/wiki/Bayesian_spam_filtering

Bayesian spam filtering has become a popular mechanism to distinguish illegitimate spam email from legitimate email (sometimes called "ham" or "bacn").

A form of this technique should work for finding desired material as well as it does for excluding the undesired... And newLISP gives us a hand up on the required math :)

-- xytroxon

cormullion · Post by **cormullion** » Tue Jan 20, 2009 6:38 pm

Slightly tangential and for what it's worth: When I was working on my comment-spam tester for a web site, I whipped up a quick interactive tester, so that I could type or paste in different phrases and see their results... You can see a read-out at the top of the bayes scores....

I probably learnt something, but I've forgotten what it was now... :)

Code: Select all

#!/usr/bin/env newlisp

(if (= ostype "Win32")
   (load (string (env "PROGRAMFILES") "/newlisp/guiserver.lsp"))
   (load "/usr/share/newlisp/guiserver.lsp"))

(gs:init)
(gs:frame 'bayes-interactive 100 100 600 400 "bayes-interactive")
(gs:set-border-layout 'bayes-interactive)
(gs:text-pane 'string-input 'textfield-handler "text/plain")
(gs:set-size 'string-input 400 400)
(gs:text-area 'text-output 'gs:no-action)
(gs:panel 'controls)
(gs:set-grid-layout 'controls 1 1) 
(gs:button 'aButton 'abutton-action "Change corpus")
(gs:label 'bayes-score-a "")
(gs:label 'bayes-score-b "")

(gs:add-to 'controls 'aButton 'bayes-score-a  'bayes-score-b)
(gs:add-to 'bayes-interactive 'controls "north" 'string-input "center")

(define (abutton-action id)
  (gs:open-file-dialog 'bayes-interactive 'openfile-action))

(define (openfile-action id op file)
  (when (= op "open")
     (when file 
        (set 'f (base64-dec file))
        (gs:set-text 'text-output (base64-dec file))
        (load f)
        (gs:set-text 'bayes-interactive f)
        ; see below...
        (set 'default-corpus 'corpus-context))))

(define (textfield-handler id key dot mark)
  (and
    (!= key 65535) ; not cursor
    (gs:get-text id 'gettextcallback-handler)))

(define (run-bayes-query s)
  (let ((source-text (map lower-case (parse s "\\W" 0))))
    (catch 
      (begin        
        (gs:set-text 'bayes-score-a (format { RA Fisher method: %4.2f %4.2f} 
          (bayes-query source-text 'MAIN:spam)))
        (gs:set-text 'bayes-score-b (format { Second method: %4.2f %4.2f} 
          (bayes-query source-text 'MAIN:spam true))))
    'error-sym)))

(define (gettextcallback-handler id text)
  (and
     text
     (= id "MAIN:string-input")
     default-corpus
     (set 'strng (base64-dec text))
     (set 'result (run-bayes-query strng))
     (gs:set-text 'text-output (string result))))

; defaults
(set 'default-corpus-file "/Users/me/corpus.lsp")
(load default-corpus-file)

; problem: how do we know what the lexicon we've just loaded is called?
(set 'corpus-context 'MAIN:spam)
(set 'default-corpus corpus-context)

(gs:set-text 'bayes-interactive default-corpus-file)
(gs:set-text 'string-input [text][/text])

(gs:set-visible 'MAIN:bayes-interactive true)
(gs:listen)
; eof