Bayesian Midterm
Normalizing Constant Question
What's the overall chance of observing data X?
Heavier Prior means we have
a small sample size n informative prior
Normalizing constant pdf
f(x)
Chain structure 'iter'
# iterations *First-half are always thrown-out as "warm-up" samples (takes time before starts to produce values that mimic random sample...that's why always double)
Chain structure 'chains'
# parallel chains to run
'stan()' arguments
1) model structure: must specify the structure, tuning parameter, and observed data 2) chain structure
Monte Carlo Simulation:
1) simulate pi from prior (data.frame(pi....)) 2) Simulate x from pi [mutate(x=rbinom()) 3) Choose filter on certain variable [filter(x=8)]
Heavier Data means we have
A large sample size n vague prior
Bayesian Analysis
Combines our prior information with the observed data to construct updated or posterior information about a r.v θ
MCMC limitation
Complicated to implement
Sequential Bayesian Analysis
Continue updating our posterior understanding by each time our new posterior becomes our new prior, etc.
Frequentist interpretation of Data
Data alone should drive our outgoing information
Bayesian interpretation of Data
Data should be weighed against our incoming information
Posterior mean
E(π|X = x) = ∫ f(π)f(π|x) dπ
Mode
Highest point so most plausible π
Likelihood Function question
How compatible is r.v θ with data x?
Shape of curve means
How spread out our value is...and our confidence peak = most plausible value of π spread = how certain/confident we are *Skinny = pretty certain
Likelihood Interpretation
If (x = 0, θ = 0)...probability that X = 0 if θ = 0 is ...
Frequentist interpretation of Questions Asked
If Hypothesis isn't correct, what is the chance that I observed these data?
THE CONTINUOUS UNIFORM DISTRIBUTION
If continuous RV X is uniformly distributed across the interval [a,b]
Bayesian interpretation of Questions Asked
In light of data, what are the chances that hypothesis is correct?
Likelihood Function PDF
L(θ|x) := f(x|θ) x | θ ~ ...() Where data, x, is known and θ is unknown
Posterior f(θ|x) will be large if
L(θ|x) is large or f(θ) is large or both *Either needs to be highly likely or a lot of evidence
Posterior f(θ|x) will be small if
L(θ|x) is small or f(θ) is small or both *Small prior chance of happening or inconsistent data
THE NORMAL DISTRIBUTION
Let X be an RV with a bell-shaped distribution centered at μ and with variance σ2
THE EXPONENTIAL DISTRIBUTION
Let X be the waiting time until a given event occurs
THE BINOMIAL (& BERNOULLI) DISTRIBUTION
Let discrete RV X be the number of successes in n trials where: the trials are independent; each trial has an equal probability p of success
THE GEOMETRIC DISTRIBUTION
Let discrete RV X be the number of trials until the 1st success where: the trials are independent; each trial has an equal probability p of success
THE POISSON DISTRIBUTION
Let discrete RV X∈{0,1,2,...} be the number of events in a given time period. The outcome of X depends upon parameter λ>0, the rate at which the events occur
x | π ~ ...()
Likelihood
π | x ~ ...()
Posterior *Upon observing X
Informative Prior
Posterior will be more influenced by prior *Typically tuned using expert information
Vague Prior
Posterior will resemble Data X/likelihood *Large Variance
π ~ ...()
Prior
Never accept proposal
Straight line because never move
THE BETA DISTRIBUTION
The Beta distribution can be used to model a continuous RV X that's restricted to values in the interval [0,1]
THE GAMMA DISTRIBUTION
The Gamma distribution is often appropriate for modeling positive RVs X (X>0) with right skew. For example, let X be the waiting time until a given event occurs s times. The outcome of X depends upon both the shape parameter s>0 (the number of events we're waiting for) and the rate parameter r>0 (the rate at which the events occur)
Interpreting 95% CI
There's a 95% posterior probability that lambda is in my interval [] given my data
Step 1 in Bayesian Analysis
Tune Prior Model
Frequentist...making inferences about pi?
Using only data x
Bayes...making inferences about pi?
Using posterior which combines our prior and data x *Entire pdf is the estimate for π
Prior question
What do we understand about θ before observing data x?
Posterior Question
What do we understand about θ now that we've observed data x?
THE DISCRETE UNIFORM DISTRIBUTION
discrete RV X is equally likely to be any value in the discrete set S
Density + Histogram
distribution of chain values
Posterior predictive model
f(x' | x) = ∫f(x'|π)f(π | x) dπ f(x'|x) = Σf(x'| θ)f(θ|x) Observe x' if θ * posterior plausibility of θ given original data
LTP
f(x) = Σf(x,θ) = Σf(x|θ)f(θ) = ΣL(θ|x)f(θ) *Σ over all θ f(x) = Integral version
Independent Variables
f(x|y) = f(x) or f(x,y) = f(x)f(y)
Conditional Models
f(x|y) = f(x,y) / f(y) = 1
Prior pdf
f(θ)
Posterior PDF
f(θ|x) = f(θ)L(θ|x) / f(x) or f(θ|x) = f(θ)f(x|θ) / f(x)
Frequentist...what is pi?
fixed, unknown quantity
Prior definition
incoming/prior information
Likelihood Defintion
is a function of θ that measures the relative likelihood of the model parameter being θ given that we observed data X
Trace plot
longitudinal behavior of chains
Normalizing Constant Definition
measures the overall chance of observing X = x across all possible values of θ, taking into account the prior plausibility of each possible θ. Specifically by the Law of Total Probability
MAP
mode argmaxπf(π|x) so where the posterior is maximized
Bigger sample size means
more confident *More data more mathematical influence • If n is sufficiently large will see a convergence to general consensus
Making Inferences...hypothesis testing
posterior assessment about a claim regarding π
Making Inferences...prediction
predicting new observations from the model
Conjugate Prior
produces a posterior. model in the same family
Bayes...what is pi?
r.v we can model using a pdf
Making Inferences...interval estimation
range of posterior plausible values of π
Frequentist interpretation of probability
relative frequency of repeatable event
Bayesian interpretation of probability
relative plausibility of an event
Monte Carlo What is does:
simulating (x, pi) pairs from prior/likelihood
Making Inferences...point estimation
single posterior estimate of π
Why simulation?
some models becomes too difficult to derive from hand, so simulation is needed
Variance
spread *Small sample size = wide plausible range
Mean
where pi is the most common
Monte Carlo
{θˆ1, ... , θˆn} is a random sample of size N from posterior f(θ|x) where θ is: i.i.d • Independent • Identically Distributed (drawn directly from posterior f(θ|x))
Markov Chain Monte Carlo (MCMC)
{θˆ1, ... , θˆn} is not a random sample from the posterior f(θ|x) but can be designed to mimic one 1) Chain values are dependent ... θˆ(i+1) is drawn from a model that depends on the current state, θî, g(θˆ(i+1)|θˆi) 2) Chains are not drawn from the posterior but converge to it which provides a good approximation
Monte Carlo Limitations
• Doesn't work if data is uncommon • Computationally inefficient • Can breakdown when sample space of x is continuous or large • Can't model more than one parameter θ
Monte Carlo Filtering Drawbacks
• Doesn't work if we have a small sample size • Can actually cut down our sample size • Certain values produce more accurate results
A good MCMC
• Even distribution with not too much variance • will tour around the state space of possible θ values in a way so that the visits to the different spots provide a good approximation of the true but unknown posterior