Bayesian programming: Difference between revisions

Latest revision as of 18:26, 19 February 2024

Part of a series on
Bayesian statistics

Posterior = Likelihood × Prior ÷ Evidence
Background
Bayesian inference Bayesian probability Bayes' theorem Bernstein–von Mises theorem Coherence Cox's theorem Cromwell's rule Principle of indifference Principle of maximum entropy
Model building
Weak prior ... Strong prior Conjugate prior Linear regression Empirical Bayes Hierarchical model
Posterior approximation
Markov chain Monte Carlo Laplace's approximation Integrated nested Laplace approximations Variational inference Approximate Bayesian computation
Estimators
Bayesian estimator Credible interval Maximum a posteriori estimation
Evidence approximation
Evidence lower bound Nested sampling
Model evaluation
Bayes factor Model averaging Posterior predictive
Mathematics portal
v t e

Part of a series on statistics
Probability theory

Probability Axioms Determinism System Indeterminism Randomness
Probability space Sample space Event Collectively exhaustive events Elementary event Mutual exclusivity Outcome Singleton Experiment Bernoulli trial Probability distribution Bernoulli distribution Binomial distribution Exponential distribution Normal distribution Pareto distribution Poisson distribution Probability measure Random variable Bernoulli process Continuous or discrete Expected value Variance Markov chain Observed value Random walk Stochastic process
Complementary event Joint probability Marginal probability Conditional probability
Independence Conditional independence Law of total probability Law of large numbers Bayes' theorem Boole's inequality
Venn diagram Tree diagram
v t e

Bayesian programming is a formalism and a methodology for having a technique to specify probabilistic models and solve problems when less than the necessary information is available.

Edwin T. Jaynes proposed that probability could be considered as an alternative and an extension of logic for rational reasoning with incomplete and uncertain information. In his founding book Probability Theory: The Logic of Science^[1] he developed this theory and proposed what he called “the robot,” which was not a physical device, but an inference engine to automate probabilistic reasoning—a kind of Prolog for probability instead of logic. Bayesian programming^[2] is a formal and concrete implementation of this "robot".

Formalism[edit]

A Bayesian program is a means of specifying a family of probability distributions.

Description[edit]

The purpose of a description is to specify an effective method of computing a joint probability distribution on a set of variables

\left\{X_{1},X_{2},\cdots ,X_{N}\right\}

given a set of experimental data

\delta

and some specification

\pi

. This joint distribution is denoted as:

P\left(X_{1}\wedge X_{2}\wedge \cdots \wedge X_{N}\mid \delta \wedge \pi \right)

.^[5]

To specify preliminary knowledge

\pi

, the programmer must undertake the following:

Decomposition[edit]

Given a partition of

\left\{X_{1},X_{2},\ldots ,X_{N}\right\}

containing

K

subsets,

K

variables are defined

L_{1},\cdots ,L_{K}

, each corresponding to one of these subsets. Each variable

L_{k}

is obtained as the conjunction of the variables

\left\{X_{k_{1}},X_{k_{2}},\cdots \right\}

belonging to the

k^{th}

subset. Recursive application of Bayes' theorem leads to:

Conditional independence hypotheses then allow further simplifications. A conditional independence hypothesis for variable

L_{k}

is defined by choosing some variable

X_{n}

among the variables appearing in the conjunction

L_{k-1}\wedge \cdots \wedge L_{2}\wedge L_{1}

, labelling

R_{k}

as the conjunction of these chosen variables and setting:

Such a simplification of the joint distribution as a product of simpler distributions is called a decomposition, derived using the chain rule.

This ensures that each variable appears at the most once on the left of a conditioning bar, which is the necessary and sufficient condition to write mathematically valid decompositions.^{[citation needed]}

Forms[edit]

Each distribution

P\left(L_{k}\mid R_{k}\wedge \delta \wedge \pi \right)

appearing in the product is then associated with either a parametric form (i.e., a function

f_{\mu }\left(L_{k}\right)

) or a question to another Bayesian program

P\left(L_{k}\mid R_{k}\wedge \delta \wedge \pi \right)=P\left(L\mid R\wedge {\widehat {\delta }}\wedge {\widehat {\pi }}\right)

When it is a form

f_{\mu }\left(L_{k}\right)

, in general,

\mu

is a vector of parameters that may depend on

R_{k}

\delta

or both. Learning takes place when some of these parameters are computed using the data set

\delta

An important feature of Bayesian Programming is this capacity to use questions to other Bayesian programs as components of the definition of a new Bayesian program.

P\left(L_{k}\mid R_{k}\wedge \delta \wedge \pi \right)

is obtained by some inferences done by another Bayesian program defined by the specifications

{\widehat {\pi }}

and the data

{\widehat {\delta }}

. This is similar to calling a subroutine in classical programming and provides an easy way to build hierarchical models.

Question[edit]

Given a description (i.e.,

P\left(X_{1}\wedge X_{2}\wedge \cdots \wedge X_{N}\mid \delta \wedge \pi \right)

), a question is obtained by partitioning

\left\{X_{1},X_{2},\cdots ,X_{N}\right\}

into three sets: the searched variables, the known variables and the free variables.

The 3 variables

Searched

Known

and

Free

are defined as the conjunction of the variables belonging to these sets.

made of many "instantiated questions" as the cardinal of

Known

, each instantiated question being the distribution:

Inference[edit]

Given the joint distribution

P\left(X_{1}\wedge X_{2}\wedge \cdots \wedge X_{N}\mid \delta \wedge \pi \right)

, it is always possible to compute any possible question using the following general inference:

where the first equality results from the marginalization rule, the second results from Bayes' theorem and the third corresponds to a second application of marginalization. The denominator appears to be a normalization term and can be replaced by a constant

Z

Theoretically, this allows to solve any Bayesian inference problem. In practice, however, the cost of computing exhaustively and exactly

P\left({\text{Searched}}\mid {\text{Known}}\wedge \delta \wedge \pi \right)

is too great in almost all cases.

which is usually a much simpler expression to compute, as the dimensionality of the problem is considerably reduced by the decomposition into a product of lower dimension distributions.

Example[edit]

Bayesian spam detection[edit]

The problem is very easy to formulate. E-mails should be classified into one of two categories: non-spam or spam. The only available information to classify the e-mails is their content: a set of words. Using these words without taking the order into account is commonly called a bag of words model.

The classifier should furthermore be able to adapt to its user and to learn from experience. Starting from an initial standard setting, the classifier should modify its internal parameters when the user disagrees with its own decision. It will hence adapt to the user's criteria to differentiate between non-spam and spam. It will improve its results as it encounters increasingly classified e-mails.

Variables[edit]

Decomposition[edit]

Starting from the joint distribution and applying recursively Bayes' theorem we obtain:

It can be drastically simplified by assuming that the probability of appearance of a word knowing the nature of the text (spam or not) is independent of the appearance of the other words. This is the naive Bayes assumption and this makes this spam filter a naive Bayes model.

This kind of assumption is known as the naive Bayes' assumption. It is "naive" in the sense that the independence between words is clearly not completely true. For instance, it completely neglects that the appearance of pairs of words may be more significant than isolated appearances. However, the programmer may assume this hypothesis and may develop the model and the associated inferences to test how reliable and efficient it is.

Parametric forms[edit]

To be able to compute the joint distribution, the programmer must now specify the

N+1

distributions appearing in the decomposition:

where

a_{f}^{n}

stands for the number of appearances of the

n^{th}

word in non-spam e-mails and

a_{f}

stands for the total number of non-spam e-mails. Similarly,

a_{t}^{n}

stands for the number of appearances of the

n^{th}

word in spam e-mails and

a_{t}

stands for the total number of spam e-mails.

Identification[edit]

The

N

forms

P(W_{n}\mid {\text{Spam}})

are not yet completely specified because the

2N+2

parameters

a_{f}^{n=0,\ldots ,N-1}

a_{t}^{n=0,\ldots ,N-1}

a_{f}

and

a_{t}

have no values yet.

The identification of these parameters could be done either by batch processing a series of classified e-mails or by an incremental updating of the parameters using the user's classifications of the e-mails as they arrive.

Both methods could be combined: the system could start with initial standard values of these parameters issued from a generic database, then some incremental learning customizes the classifier to each individual user.

Question[edit]

The question asked to the program is: "what is the probability for a given text to be spam knowing which words appear and don't appear in this text?" It can be formalized by:

The denominator appears to be a normalization constant. It is not necessary to compute it to decide if we are dealing with spam. For instance, an easy trick is to compute the ratio:

Bayesian program[edit]

Bayesian filter, Kalman filter and hidden Markov model[edit]

Bayesian filters (often called Recursive Bayesian estimation) are generic probabilistic models for time evolving processes. Numerous models are particular instances of this generic approach, for instance: the Kalman filter or the Hidden Markov model (HMM).

Variables[edit]

Decomposition[edit]

Parametrical forms[edit]

The parametrical forms are not constrained and different choices lead to different well-known models: see Kalman filters and Hidden Markov models just below.

Question[edit]

The typical question for such models is

P\left(S^{t+k}\mid O^{0}\wedge \cdots \wedge O^{t}\right)

: what is the probability distribution for the state at time

t+k

knowing the observations from instant

0

t

The most common case is Bayesian filtering where

k=0

, which searches for the present state, knowing past observations.

However, it is also possible

(k>0)

, to extrapolate a future state from past observations, or to do smoothing

(k<0)

, to recover a past state from observations made either before or after that instant.

Bayesian filters

(k=0)

have a very interesting recursive property, which contributes greatly to their attractiveness.

P\left(S^{t}|O^{0}\wedge \cdots \wedge O^{t}\right)

may be computed simply from

P\left(S^{t-1}\mid O^{0}\wedge \cdots \wedge O^{t-1}\right)

with the following formula:

Another interesting point of view for this equation is to consider that there are two phases: a prediction phase and an estimation phase:

Bayesian program[edit]

Kalman filter[edit]

With these hypotheses and by using the recursive formula, it is possible to solve the inference problem analytically to answer the usual

P(S^{T}\mid O^{0}\wedge \cdots \wedge O^{T}\wedge \pi )

question. This leads to an extremely efficient algorithm, which explains the popularity of Kalman filters and the number of their everyday applications.

When there are no obvious linear transition and observation models, it is still often possible, using a first-order Taylor's expansion, to treat these models as locally linear. This generalization is commonly called the extended Kalman filter.

Hidden Markov model[edit]

Hidden Markov models (HMMs) are another very popular specialization of Bayesian filters.

What is the most probable series of states that leads to the present state, knowing the past observations?

This particular question may be answered with a specific and very efficient algorithm called the Viterbi algorithm.

Applications[edit]

Academic applications[edit]

Since 2000, Bayesian programming has been used to develop both robotics applications and life sciences models.^[7]

Robotics[edit]

Life sciences[edit]

In life sciences, bayesian programming was used in vision to reconstruct shape from motion,^[25] to model visuo-vestibular interaction^[26] and to study saccadic eye movements;^[27] in speech perception and control to study early speech acquisition^[28] and the emergence of articulatory-acoustic systems;^[29] and to model handwriting perception and control.^[30]

Pattern recognition[edit]

Bayesian program learning has potential applications voice recognition and synthesis, image recognition and natural language processing. It employs the principles of compositionality (building abstract representations from parts), causality (building complexity from parts) and learning to learn (using previously recognized concepts to ease the creation of new concepts).^[31]

Possibility theories[edit]

The comparison between probabilistic approaches (not only bayesian programming) and possibility theories continues to be debated.

Possibility theories like, for instance, fuzzy sets,^[32] fuzzy logic^[33] and possibility theory^[34] are alternatives to probability to model uncertainty. They argue that probability is insufficient or inconvenient to model certain aspects of incomplete/uncertain knowledge.

The defense of probability is mainly based on Cox's theorem, which starts from four postulates concerning rational reasoning in the presence of uncertainty. It demonstrates that the only mathematical framework that satisfies these postulates is probability theory. The argument is that any approach other than probability necessarily infringes one of these postulates and the value of that infringement.

Probabilistic programming[edit]

The purpose of probabilistic programming is to unify the scope of classical programming languages with probabilistic modeling (especially bayesian networks) to deal with uncertainty while profiting from the programming languages' expressiveness to encode complexity.

Extended classical programming languages include logical languages as proposed in Probabilistic Horn Abduction,^[35] Independent Choice Logic,^[36] PRISM,^[37] and ProbLog which proposes an extension of Prolog.

It can also be extensions of functional programming languages (essentially Lisp and Scheme) such as IBAL or CHURCH. The underlying programming languages can be object-oriented as in BLOG and FACTORIE or more standard ones as in CES and FIGARO.^[38]

The purpose of Bayesian programming is different. Jaynes' precept of "probability as logic" argues that probability is an extension of and an alternative to logic above which a complete theory of rationality, computation and programming can be rebuilt.^[1] Bayesian programming attempts to replace classical languages with a programming approach based on probability that considers incompleteness and uncertainty.

The precise comparison between the semantics and power of expression of Bayesian and probabilistic programming is an open question.

Contents

Bayesian programming: Difference between revisions

Latest revision as of 18:26, 19 February 2024

Formalism[edit]

Description[edit]

Decomposition[edit]

Forms[edit]

Question[edit]

Inference[edit]

Example[edit]

Bayesian spam detection[edit]

Variables[edit]

Decomposition[edit]

Parametric forms[edit]

Identification[edit]

Question[edit]

Bayesian program[edit]

Bayesian filter, Kalman filter and hidden Markov model[edit]

Variables[edit]

Decomposition[edit]

Parametrical forms[edit]

Question[edit]

Bayesian program[edit]

Kalman filter[edit]

Hidden Markov model[edit]

Applications[edit]

Academic applications[edit]

Robotics[edit]

Life sciences[edit]

Pattern recognition[edit]

Possibility theories[edit]

Probabilistic programming[edit]

See also[edit]

References[edit]

Further reading[edit]

External links[edit]