# A Gentle Tutorial in Bayesian Statistics.pdf

#### Short Description

Exposure to Bayesian Stats...

#### Description

A Gentle Tutorial in Bayesian Statistics Theo Kypraios http://www.maths.nott.ac.uk/∼tk School of Mathematical Sciences − Division of Statistics

Division of Radiological and Imaging Sciences Away Day 1 / 29

Warning

This talk includes about 5 equations (hopefully not too hard!) about 10 figures.

This tutorial should be accessible even if the equations might look hard.

2 / 29

Outline of the Talk

The need for (statistical) modelling; two examples (a linear model/tractography) introduction to statistical inference (frequentist); introduction to the Bayesian approach to parameter estimation; more examples and Bayesian inference in practice conclusions.

3 / 29

Use of Statistics in Clinical Sciences (1) Examples include: Sample Size Determination Comparison between two (or more) groups t-tests, Z-tests; Analysis of variance (ANOVA); tests for proportions etc;

Receiver Operating Characteristic (ROC) curves; Clinical Trials; ...

4 / 29

Use of Statistics in Clinical Sciences (1) Examples include: Sample Size Determination Comparison between two (or more) groups t-tests, Z-tests; Analysis of variance (ANOVA); tests for proportions etc;

Receiver Operating Characteristic (ROC) curves; Clinical Trials; ...

4 / 29

Use of Statistics in Clinical Sciences (1) Examples include: Sample Size Determination Comparison between two (or more) groups t-tests, Z-tests; Analysis of variance (ANOVA); tests for proportions etc;

Receiver Operating Characteristic (ROC) curves; Clinical Trials; ...

4 / 29

Use of Statistics in Clinical Sciences (1) Examples include: Sample Size Determination Comparison between two (or more) groups t-tests, Z-tests; Analysis of variance (ANOVA); tests for proportions etc;

Receiver Operating Characteristic (ROC) curves; Clinical Trials; ...

4 / 29

Use of Statistics in Clinical Sciences (1) Examples include: Sample Size Determination Comparison between two (or more) groups t-tests, Z-tests; Analysis of variance (ANOVA); tests for proportions etc;

Receiver Operating Characteristic (ROC) curves; Clinical Trials; ...

4 / 29

Use of Statistics in Clinical Sciences (2) One (of the best) ways(s) to describe some data is by fitting a (statistical) model. Examples include: (linear/logistic/loglinear) regression models; survival analysis; longitudinal data analysis; infectious disease modelling; image/shape analysis; ...

5 / 29

Use of Statistics in Clinical Sciences (2) One (of the best) ways(s) to describe some data is by fitting a (statistical) model. Examples include: (linear/logistic/loglinear) regression models; survival analysis; longitudinal data analysis; infectious disease modelling; image/shape analysis; ...

5 / 29

Use of Statistics in Clinical Sciences (2) One (of the best) ways(s) to describe some data is by fitting a (statistical) model. Examples include: (linear/logistic/loglinear) regression models; survival analysis; longitudinal data analysis; infectious disease modelling; image/shape analysis; ...

5 / 29

Use of Statistics in Clinical Sciences (2) One (of the best) ways(s) to describe some data is by fitting a (statistical) model. Examples include: (linear/logistic/loglinear) regression models; survival analysis; longitudinal data analysis; infectious disease modelling; image/shape analysis; ...

5 / 29

Use of Statistics in Clinical Sciences (2) One (of the best) ways(s) to describe some data is by fitting a (statistical) model. Examples include: (linear/logistic/loglinear) regression models; survival analysis; longitudinal data analysis; infectious disease modelling; image/shape analysis; ...

5 / 29

Use of Statistics in Clinical Sciences (2) One (of the best) ways(s) to describe some data is by fitting a (statistical) model. Examples include: (linear/logistic/loglinear) regression models; survival analysis; longitudinal data analysis; infectious disease modelling; image/shape analysis; ...

5 / 29

Aims of Statistical Modelling: A Simple Example Perhaps we can fit a straight line?

1.0

y = α + βx + error

0.8

●● ●●

● ● ● ● ●● ●● ●

● ●

0.6 0.4

●●

● ●● ●

● ● ●

● ● ● ●

●●

● ● ●● ● ●●●● ● ●● ● ●● ● ● ● ● ● ●● ● ● ●● ●● ● ●

● ●

● ●

● ●● ●

0.2

response (y)

● ●● ● ● ● ●● ● ●●● ● ● ● ● ●

−2

−1

0

explanatory (x)

1

2

6 / 29

An Example in DW-MRI

Suppose that we are interested in tractography. We use the diffusion tensor to model local diffusion within a voxel. The (model) assumption made is that local diffusion could be modelled with a 3D Gaussian distribution whose variance-covariance matrix is proportional to the diffusion tensor, D.

7 / 29

An Example in DW-MRI The resulting diffusion-weighted signal, µi along a gradient direction gi with b-value bi is modelled as: µi = S0 exp {−bi gT i Dg} where

(1)

 D11 D12 D13 D =  D21 D22 D23  D31 D32 D33

S0 is the signal with no diffusion weight gradients applied (i.e. b0 = 0). The eigenvectors of D give an orthogonal coordinate system and define the orientation of the ellipsoid axes. The eigenvalues of D give the length of these axes. If we sort the eigenvalues by magnitude we can derive the the orientation of the major axis of the ellipsoid and the orientation of the minor axes. 8 / 29

An Example in DW-MRI Although this may look a bit complicate, actually, it can be written in terms of a linear model.

Taken from {Sotiropoulos, Jones, Bai + K (2010)} 9 / 29

Aims of Statistical Modelling Models have parameters some of which (if not all) are unknown, e.g. α and β. In statical modelling we are interested in inferring (e.g. estimating) the unknown parameters from data → inference. Parameter estimation needs be done in a formal way. In other words we ask ourselves the question: what are the best values for α and β such that the proposed model (straight line) best describe the observed data? Should we only look for a single estimate for (α, β)? No! Why? Because there may be many pairs (α, β) (often not very different from each other) which may equally well describe the data → uncertainty 10 / 29

Aims of Statistical Modelling Models have parameters some of which (if not all) are unknown, e.g. α and β. In statical modelling we are interested in inferring (e.g. estimating) the unknown parameters from data → inference. Parameter estimation needs be done in a formal way. In other words we ask ourselves the question: what are the best values for α and β such that the proposed model (straight line) best describe the observed data? Should we only look for a single estimate for (α, β)? No! Why? Because there may be many pairs (α, β) (often not very different from each other) which may equally well describe the data → uncertainty 10 / 29

Aims of Statistical Modelling Models have parameters some of which (if not all) are unknown, e.g. α and β. In statical modelling we are interested in inferring (e.g. estimating) the unknown parameters from data → inference. Parameter estimation needs be done in a formal way. In other words we ask ourselves the question: what are the best values for α and β such that the proposed model (straight line) best describe the observed data? Should we only look for a single estimate for (α, β)? No! Why? Because there may be many pairs (α, β) (often not very different from each other) which may equally well describe the data → uncertainty 10 / 29

The likelihood function The likelihood function plays a fundamental role in statistical inference. In non-technical terms, the likelihood function is a function that when evaluated at a particular point, say (α0 , β0 ), is the probability of observing the (observed) data given that the parameters (α, β) take the values α0 and β0 . Let’s think of a very simple example: Suppose we are interested in estimating the probability of success (denoted by θ) for one particular experiment. Data: Out of 100 times we repeated the experiment we observed 80 successes. What about L(0.1), L(0.7), L(0.99)?

11 / 29

Classical (Frequentist) Inference Frequentist inference tell us that: we should for parameter values that maximise the likelihood function → maximum likelihood estimator (MLE) associate parameter’s uncertainty with the calculation of standard errors . . . . . . which in turn enable us to construct confidence intervals for the parameters. What’s wrong with that? Nothing, but . . . . . . it is approximate, counter-intuitive (data is assumed to be random, parameter is fixed) and often mathematically intractable. 12 / 29

Classical (Frequentist) Inference Frequentist inference tell us that: we should for parameter values that maximise the likelihood function → maximum likelihood estimator (MLE) associate parameter’s uncertainty with the calculation of standard errors . . . . . . which in turn enable us to construct confidence intervals for the parameters. What’s wrong with that? Nothing, but . . . . . . it is approximate, counter-intuitive (data is assumed to be random, parameter is fixed) and often mathematically intractable. 12 / 29

Classical (Frequentist) Inference - Some Issues For instance, we cannot ask (or answer!) questions such as 1. “what is the probability that the (unknown) probability of success in the previous experiment is greater than 0.6?” i.e. compute the quantity P(θ > 0.6) . . . 2. or something like, P(0.3 < θ < 0.9);

Sometime we are interested in (not necessarily) functions of parameters, e.g. θ 1 + θ2 ,

θ1 /(1 − θ1 ) θ2 /(1 − θ2 )

Whilst in some cases, the frequentist approach offers a solution which is not exact but approximate, there other where it cannot or it is very hard to do so. 13 / 29

Bayesian Inference

When drawing inference within a Bayesian framework, the data are treated as a fixed quantity and the parameters are treated as random variables. That allows us to assign to parameters (and models) probabilities, making the inferential framework far more intuitive and more straightforward (at least in principle!)

14 / 29

Bayesian Inference (2) Denote by θ the parameters and by y the observed data. Bayes theorem allows to write: π(θ|y) =

π(y|θ)π(θ) π(y|θ)π(θ) =R π(y) θ π(y|θ)π(θ) dθ

where π(θ|y) denotes the posterior distribution of the parameters given the data; π(y|θ) = L(θ) is the likelihood function; π(θ) is the prior distribution of θ which express our beliefs about the parameters, before we see the data; π(y) is often called the marginal likelihood and plays the role of the normalising constant of the density of the posterior distribution 15 / 29

Bayesian vs Frequentist Inference

Everything is assigned distributions (prior, posterior); we are allowed to incorporate prior information about the parameter . . . which is then updated by using the likelihood function . . . leading to the posterior distribution which tell us everything we need about the parameter.

16 / 29

Bayesian vs Frequentist Inference

Everything is assigned distributions (prior, posterior); we are allowed to incorporate prior information about the parameter . . . which is then updated by using the likelihood function . . . leading to the posterior distribution which tell us everything we need about the parameter.

16 / 29

Bayesian vs Frequentist Inference

Everything is assigned distributions (prior, posterior); we are allowed to incorporate prior information about the parameter . . . which is then updated by using the likelihood function . . . leading to the posterior distribution which tell us everything we need about the parameter.

16 / 29

Bayesian vs Frequentist Inference

Everything is assigned distributions (prior, posterior); we are allowed to incorporate prior information about the parameter . . . which is then updated by using the likelihood function . . . leading to the posterior distribution which tell us everything we need about the parameter.

16 / 29

Bayesian Inference: The Prior One of the biggest criticisms to the Bayesian paradigm is the use of the prior distribution. Choose a very informative prior to come up with favourable results; I know nothing about the parameter; what prior do I choose? Arguments against that criticism: priors should be chosen before we see the data and it is very often the case that there is some prior information available (e.g. previous studies) if we know nothing about the parameter, then we could assign to it a so-called uninformative (or vague) prior; if there is not a lot of data available then the posterior distribution would not be influenced by the prior (too much) and vice versa; 17 / 29

Bayesian Inference: The Posterior

Although Bayesian inference has been around for long time it is only the last two decades that it has really revolutionized the way we do statistical modelling. Although, in principle, Bayesian inference is straightforward and intuitive when it comes to computations it could be very hard to implement it. Thanks to computational developments such as Markov Chain Monte Carlo (MCMC) doing Bayesian inference is a lot easier.

18 / 29

Bayesian Inference: Some Examples

8

10

83/100 successes: interested in probability of success θ

6 4 2 0

posterior

posterior lik prior

0.0

0.2

0.4

0.6 theta

0.8

1.0 19 / 29

Bayesian Inference: Some Examples

8

10

83/100 successes: interested in probability of success θ

6 4 2 0

posterior

posterior lik prior

0.0

0.2

0.4

0.6 theta

0.8

1.0 20 / 29

Bayesian Inference: Some Examples

8

10

83/100 successes: interested in probability of success θ

6 4 2 0

posterior

posterior lik prior

0.0

0.2

0.4

0.6 theta

0.8

1.0 21 / 29

Bayesian Inference: Some Examples 10

8/10 successes: interested in probability of success θ

0

2

4

posterior

6

8

posterior lik prior

0.0

0.2

0.4

0.6 theta

0.8

1.0 22 / 29

Bayesian Inference: Some Examples

10

83/100 successes: interested in probability of success θ

6 4 2 0

posterior

8

posterior lik prior

0.0

0.2

0.4

0.6 theta

0.8

1.0 23 / 29

Comparing Different Hypotheses: Bayesian Model Choice Suppose that we are interested in testing two competing model hypotheses, M1 and M2 . Within a Bayesian framework, the model index M can be treated as a an extra parameter (as well as the other parameters in M1 and M2 . So, it is natural to ask “what is the posterior model probability given the observed data?”, i.e. (M1 |y) or P(M2 |y) Bayes Theorem: P(M1 |y) =

π(y|M1 )π(M1 ) π(D)

where π(y|M1 ) is the marginal likelihood (also called the evidence), π(M1 ) is the prior model probability 24 / 29

Bayesian Model Choice (2) Given a model selection problem in which we have to choose between two models, on the basis of observed data y. . . . . .the plausibility of the two different models M1 and M2 , parametrised by model parameter vectors θ1 and θ2 is assessed by the Bayes factor given by: R π(y|θ1 , M1 )π(θ1 ) dθ1 P(y|M1 ) = Rθ1 P(y|M2 ) θ2 π(y|θ2 , M2 )π(θ2 ) dθ2 The Bayesian model comparison does not depend on the parameters used by each model. Instead, it considers the probability of the model considering all possible parameter values. This is similar to a likelihood-ratio test, but instead of maximizing the likelihood, we average over all the parameters. 25 / 29

Bayesian Model Choice (3) Why bother? An advantage of the use of Bayes factors is that it automatically, and quite naturally, includes a penalty for including too much model structure. It thus guards against overfitting. No free lunch! In practical situations, the calculation of Bayes Factor relies on the employment computationally intensive methods, such Reversible-Jump Markov Chain Monte Carlo (RJ-MCMC) which require a certain amount of expertise from the end-user.

26 / 29

An Example in DW-MRI Analysis We assume that the voxel’s intensity can be modelled by assuming that Si /S0 ∼ N(µi , σ 2 ) where we could consider (at least) two different models: 1. Diffusion Tensor Model (Model 1) assumes that: µi = exp {−bi gT i Dg} 2. Simple Partial Volume Model (Model 2) assumes that: µi = f exp {−bd} + (1 − f ) exp {−bdgi C gT i }

27 / 29

An Example in DW-MRI Analysis (2)

Suppose that we have some measurements (intensities) for each voxel. We could fit the two different models (on the same dataset). Question: How do we tell which model fits the data best taking into account the uncertainty associated with the parameters in each model? Answer: Calculate the Bayes factor!

28 / 29

An Example in DW-MRI Analysis (2)

Suppose that we have some measurements (intensities) for each voxel. We could fit the two different models (on the same dataset). Question: How do we tell which model fits the data best taking into account the uncertainty associated with the parameters in each model? Answer: Calculate the Bayes factor!

28 / 29

Conclusions

Quantification of the uncertainty both in parameter estimation and model choice is essential in any modelling exercise. A Bayesian approach offers a natural framework to deal with parameter and model uncertainty. It offers much more than a single “best fit” or any sort “sensitivity analysis”. There is no free lunch, unfortunately. To do fancy things, often one has to write his/her own computer programs. Software available: R, Winbugs, BayesX . . .

29 / 29