Page 1 of 1

Bayes query results just 0.5 or 1

Posted: Sat Feb 01, 2014 10:55 pm
by hilti
I don't get it ... since two hours I'm trying to train some website data, but don't get the expected results.
(bayes-query) returns 0.5 or 1

I'm using the complex pattern in (parse) to split up german texts, too.

Here's my code:

Code: Select all

(setq textdata "Now think about your brain. It’s a long running program running on very complex and error prone hardware. How does your brain keep itself sane over time? The answer may be found in something we spend a third of our lives doing. Sleep.")

(setq text (parse (lower-case textdata) "[^a-z0-9äöüß]+" 0))
(bayes-train text 'DICT)
(bayes-query (parse (lower-case "dsd skjsd ksdjkds sdkj") "[^a-z0-9äöüß]+" 0) 'DICT)
And the result

Code: Select all

"Now think about your brain. It\226\128\153s a long running program running on very complex and error prone hardware. How does your brain keep itself sane over time? The answer may be found in something we spend a third of our lives doing. Sleep."
("now" "think" "about" "your" "brain" "it" "s" "a" "long" "running" "program" "running" 
 "on" "very" "complex" "and" "error" "prone" "hardware" "how" "does" "your" "brain" 
 "keep" "itself" "sane" "over" "time" "the" "answer" "may" "be" "found" "in" "something" 
 "we" "spend" "a" "third" "of" "our" "lives" "doing" "sleep" "")
(45)
(0.5)
I'm expecting 0, because this phrase doesn't exist in my training data.

Am I missing some switch?

Re: Bayes query results just 0.5 or 1

Posted: Sun Feb 02, 2014 1:39 am
by Lutz
You should have at least two groups in your dictionary to get meaningful data and both should have approximately the same number of tokens and have much more tokens than we have in this example. The numbers are not exact probabilities, but should be looked at by comparing the numbers in the different groups to each other. Also experiment using the ‘true’ flag for chain-based Bayesian versus Fishers Chi2 of calculating probabilities.

Code: Select all

(setq textdata "Now think about your brain. It’s a long running program running on very complex and error prone hardware. How does your brain keep itself sane over time? The answer may be found in something we spend a third of our lives doing. Sleep.")

(setq textdata2 "Now think about your next vacation. It's a long time that you didn't have one. How does your brain stay healthy without vacation. There are so many things, you can do in a summer vacation. It's better then to sleep our lives away.")

(setq text (parse (lower-case textdata) "[^a-z0-9äöüß]+" 0))
(setq text2 (parse (lower-case textdata2) "[^a-z0-9äöüß]+" 0))
(bayes-train text text2 'DICT)

; probs from Fisher's Chi2
(bayes-query (parse (lower-case "How does your brain keep itself sane")) DICT)
(bayes-query (parse (lower-case "How does your brain stay healthy")) DICT)
(bayes-query (parse (lower-case "How does your brain")) DICT)
; now the same with chain Bayesian
(bayes-query (parse (lower-case "How does your brain keep itself sane")) DICT true)
(bayes-query (parse (lower-case "How does your brain stay healthy")) DICT true)
(bayes-query (parse (lower-case "How does your brain")) DICT true)
gives you this:

Code: Select all

(45 47) <- number of tokens

(0.994129159386854 0.00587084061314597) <- phrase comes from 1st group
(0.0569603080654478 0.943039691934552) <- phrase comes from 2nd group
(0.595589294104956 0.404410705895044) <- phrase occurs in both groups

(1 0) <- chain Bayesian reacts strongly non-existing tokens
(0 1) <- chain Bayesian reacts strongly non-existing tokens
(0.695000518791985 0.304999481208016) <- sharper distinction in chain Bayesian method
When doing the query "How does your brain”, the numbers for the first group are slightly higher, although the same phrase occurs in both groups. This happens because certain words, e.g. “brain” occur more frequently in group one.