If we only consider the fact that my dog will bark "80.00%" when somone is at the gate, then we leave ourselves vulnerable to the fact that people are rarely at the gate, or that Bowser (my dog) also barks at random other stuff. To account for the "Prior" evidence such as how many people visit each day on average - regardless of dogs - then we need Bayes.

There are a lot of insights one can derive from the Bayes' theorems. And yet my interest is purely about Machine Learning and looking for an entrypoint to start learning about it.

I first heard of Bayes theorem when I came across a CLI tool for converting CSV files into Ledger files (ie: https://github.com/cantino/reckon) and more specifically it was a Naive Bayes Classifier. The tool would try to classify those transactions (from CSV), into categories that you had already added in your target file (ledger). Using the past choices you made to help guide new ones is really cool! But how did it work?

I didn't understand it, but the equation used looked delightfully simple and that was inspiring...

P(A|B) = \frac{ \text{P(A)} \cdot \text{P(B|A)} } {P(B)}

...well maybe a little mysterious, but at least there weren't too many fancy symbols!

The example

I want to know the chance that my dog barking is because there's someone at the gate. Another way to phrase that is, what is the probability that someone is at the gate, given that my dog is barking?

If we only consider the fact that my dog will bark "80.00%" when somone is at the gate, then we leave ourselves vulnerable to the fact that pepole are rarely at the gate, or that Bowser (my dog) also barks at random other stuff. To account for the "Prior" evidence such as how many people visit each day on average - regardless of dogs - then we need Bayes.

We will plug in the numbers and see what we get after asking ourselves a few questions. But first, need to figure out that notation to better follow along.

The notation

First we have to make sense of the notation. The way to read P(A|B) is: The probablity of A occuring, given that B occured. This is known as a Conditional probability. The other thing we can see is P(A), this reads as the probability of A occuring. That should be a larger number than P(A|B), since it is about when A occurs, regardless of the world around it.

The other tool we need is P(not B), which reads what is the probability of B NOT occuring. There is other notation for that, but this one reads clearest to me and is still considered okay. And you can combine them, so P(A|not B) reads, probability of A given that B is not present/occured.

Okay, let's run through an example.

Plugging in the numbers

I want to know if a person at gate occurs when dog barking -> P(Gate|Bark)

P(Gate|Bark)=\frac{\text{P(Gate)} \cdot \text{P(Bark|Gate)} } {P(Bark)}

How often does person at gate happen, in general? -> P(Gate) = 7.00%
How often does dog barking happen when person at gate ? -> P(Bark|Gate) = 80.00%
How often does dog barking happen when no person at gate ? -> P(Bark|not Gate) = 15.00%

Chance of dog barking in general is

P(Bark) = P(Bark|Gate) * P(Gate) + P(Bark|not Gate) * P(not Gate)
= 80.00%_ 7.00%+ 15.00%_ (100% - 7.00%)
= 19.55%

Probability of a person at gate given dog barking

P(Gate|Bark) = P(Gate) * P(Bark|Gate) / P(Bark)
= 7.00%* 80.00%/ 19.55%
= 28.64%

So 28.64% is kinda of less than you initially think, given how likely it is for Bowser to bark at a person at the gate.

Jargon

\text{Posterior} = \frac{\text{Prior of A} \cdot \text{Inverse of Hyp.}}{\text{Marginal probability of B}}

For even more reference, here is a glossary.

Term	Description
Posterior	The updated probability, after considering the new evidence.
Prior	The probability before considering the evidence at all.
Marginal probability	The probability of the evidence being present, regardless of the hypothesis.
Conditional probability	The probability of event A happening, given event B has happened/is happening. Written as P(A\|B).