Bayes query results just 0.5 or 1

Q&A's, tips, howto's

Bayes query results just 0.5 or 1

Postby hilti » Sat Feb 01, 2014 10:55 pm

I don't get it ... since two hours I'm trying to train some website data, but don't get the expected results.
(bayes-query) returns 0.5 or 1

I'm using the complex pattern in (parse) to split up german texts, too.

Here's my code:

Code: Select all
(setq textdata "Now think about your brain. It’s a long running program running on very complex and error prone hardware. How does your brain keep itself sane over time? The answer may be found in something we spend a third of our lives doing. Sleep.")

(setq text (parse (lower-case textdata) "[^a-z0-9äöüß]+" 0))
(bayes-train text 'DICT)
(bayes-query (parse (lower-case "dsd skjsd ksdjkds sdkj") "[^a-z0-9äöüß]+" 0) 'DICT)


And the result

Code: Select all
"Now think about your brain. It\226\128\153s a long running program running on very complex and error prone hardware. How does your brain keep itself sane over time? The answer may be found in something we spend a third of our lives doing. Sleep."
("now" "think" "about" "your" "brain" "it" "s" "a" "long" "running" "program" "running"
 "on" "very" "complex" "and" "error" "prone" "hardware" "how" "does" "your" "brain"
 "keep" "itself" "sane" "over" "time" "the" "answer" "may" "be" "found" "in" "something"
 "we" "spend" "a" "third" "of" "our" "lives" "doing" "sleep" "")
(45)
(0.5)


I'm expecting 0, because this phrase doesn't exist in my training data.

Am I missing some switch?
--()o Dragonfly web framework for newLISP
http://dragonfly.apptruck.de
hilti
 
Posts: 140
Joined: Sun Apr 19, 2009 10:09 pm
Location: Hannover, Germany

Re: Bayes query results just 0.5 or 1

Postby Lutz » Sun Feb 02, 2014 1:39 am

You should have at least two groups in your dictionary to get meaningful data and both should have approximately the same number of tokens and have much more tokens than we have in this example. The numbers are not exact probabilities, but should be looked at by comparing the numbers in the different groups to each other. Also experiment using the ‘true’ flag for chain-based Bayesian versus Fishers Chi2 of calculating probabilities.

Code: Select all
(setq textdata "Now think about your brain. It’s a long running program running on very complex and error prone hardware. How does your brain keep itself sane over time? The answer may be found in something we spend a third of our lives doing. Sleep.")

(setq textdata2 "Now think about your next vacation. It's a long time that you didn't have one. How does your brain stay healthy without vacation. There are so many things, you can do in a summer vacation. It's better then to sleep our lives away.")

(setq text (parse (lower-case textdata) "[^a-z0-9äöüß]+" 0))
(setq text2 (parse (lower-case textdata2) "[^a-z0-9äöüß]+" 0))
(bayes-train text text2 'DICT)

; probs from Fisher's Chi2
(bayes-query (parse (lower-case "How does your brain keep itself sane")) DICT)
(bayes-query (parse (lower-case "How does your brain stay healthy")) DICT)
(bayes-query (parse (lower-case "How does your brain")) DICT)
; now the same with chain Bayesian
(bayes-query (parse (lower-case "How does your brain keep itself sane")) DICT true)
(bayes-query (parse (lower-case "How does your brain stay healthy")) DICT true)
(bayes-query (parse (lower-case "How does your brain")) DICT true)

gives you this:
Code: Select all
(45 47) <- number of tokens

(0.994129159386854 0.00587084061314597) <- phrase comes from 1st group
(0.0569603080654478 0.943039691934552) <- phrase comes from 2nd group
(0.595589294104956 0.404410705895044) <- phrase occurs in both groups

(1 0) <- chain Bayesian reacts strongly non-existing tokens
(0 1) <- chain Bayesian reacts strongly non-existing tokens
(0.695000518791985 0.304999481208016) <- sharper distinction in chain Bayesian method


When doing the query "How does your brain”, the numbers for the first group are slightly higher, although the same phrase occurs in both groups. This happens because certain words, e.g. “brain” occur more frequently in group one.
Lutz
 
Posts: 5276
Joined: Thu Sep 26, 2002 4:45 pm
Location: Pasadena, California


Return to newLISP in the real world

Who is online

Users browsing this forum: No registered users and 1 guest