Conditioning (probability)

Beliefs depend on the available information. This idea is formalized in probability theory by conditioning. Conditional probabilities, conditional expectations, and conditional probability distributions are treated on three levels: discrete probabilities, probability density functions, and measure theory. Conditioning leads to a non-random result if the condition is completely specified; otherwise, if the condition is left random, the result of conditioning is also random.

This article needs additional citations for verification. (May 2009)

Remove ads

Conditioning on the discrete level

Summarize

Perspective

Example: A fair coin is tossed 10 times; the random variable X is the number of heads in these 10 tosses, and Y is the number of heads in the first 3 tosses. In spite of the fact that Y emerges before X it may happen that someone knows X but not Y.

Conditional probability

Given that X = 1, the conditional probability of the event Y = 0 is

\mathbb {P} (Y=0|X=1)={\frac {\mathbb {P} (Y=0,X=1)}{\mathbb {P} (X=1)}}=0.7

More generally,

{\begin{aligned}\mathbb {P} (Y=0|X=x)&={\frac {\binom {7}{x}}{\binom {10}{x}}}={\frac {7!(10-x)!}{(7-x)!10!}}&&x=0,1,2,3,4,5,6,7.\\[4pt]\mathbb {P} (Y=0|X=x)&=0&&x=8,9,10.\end{aligned}}

One may also treat the conditional probability as a random variable, — a function of the random variable X, namely,

\mathbb {P} (Y=0|X)={\begin{cases}{\binom {7}{X}}/{\binom {10}{X}}&X\leqslant 7,\\0&X>7.\end{cases}}

The expectation of this random variable is equal to the (unconditional) probability,

\mathbb {E} (\mathbb {P} (Y=0|X))=\sum _{x}\mathbb {P} (Y=0|X=x)\mathbb {P} (X=x)=\mathbb {P} (Y=0),

namely,

\sum _{x=0}^{7}{\frac {\binom {7}{x}}{\binom {10}{x}}}\cdot {\frac {1}{2^{10}}}{\binom {10}{x}}={\frac {1}{8}},

which is an instance of the law of total probability $\mathbb {E} (\mathbb {P} (A|X))=\mathbb {P} (A).$

Thus, $\mathbb {P} (Y=0|X=1)$ may be treated as the value of the random variable $\mathbb {P} (Y=0|X)$ corresponding to X = 1. On the other hand, $\mathbb {P} (Y=0|X=1)$ is well-defined irrespective of other possible values of X.

Conditional expectation

Given that X = 1, the conditional expectation of the random variable Y is $\mathbb {E} (Y|X=1)={\tfrac {3}{10}}$ More generally,

\mathbb {E} (Y|X=x)={\frac {3}{10}}x,\qquad x=0,\ldots ,10.

(In this example it appears to be a linear function, but in general it is nonlinear.) One may also treat the conditional expectation as a random variable, — a function of the random variable X, namely,

\mathbb {E} (Y|X)={\frac {3}{10}}X.

The expectation of this random variable is equal to the (unconditional) expectation of Y,

\mathbb {E} (\mathbb {E} (Y|X))=\sum _{x}\mathbb {E} (Y|X=x)\mathbb {P} (X=x)=\mathbb {E} (Y),

namely,

\sum _{x=0}^{10}{\frac {3}{10}}x\cdot {\frac {1}{2^{10}}}{\binom {10}{x}}={\frac {3}{2}},

or simply

\mathbb {E} \left({\frac {3}{10}}X\right)={\frac {3}{10}}\mathbb {E} (X)={\frac {3}{10}}\cdot 5={\frac {3}{2}},

which is an instance of the law of total expectation $\mathbb {E} (\mathbb {E} (Y|X))=\mathbb {E} (Y).$

The random variable $\mathbb {E} (Y|X)$ is the best predictor of Y given X. That is, it minimizes the mean square error $\mathbb {E} (Y-f(X))^{2}$ on the class of all random variables of the form f(X). This class of random variables remains intact if X is replaced, say, with 2X. Thus, $\mathbb {E} (Y|2X)=\mathbb {E} (Y|X).$ It does not mean that $\mathbb {E} (Y|2X)={\tfrac {3}{10}}\times 2X;$ rather, $\mathbb {E} (Y|2X)={\tfrac {3}{20}}\times 2X={\tfrac {3}{10}}X.$ In particular, $\mathbb {E} (Y|2X=2)={\tfrac {3}{10}}.$ More generally, $\mathbb {E} (Y|g(X))=\mathbb {E} (Y|X)$ for every function g that is one-to-one on the set of all possible values of X. The values of X are irrelevant; what matters is the partition (denote it α_X)

\Omega =\{X=x_{1}\}\uplus \{X=x_{2}\}\uplus \dots

of the sample space Ω into disjoint sets {X = x_n}. (Here $x_{1},x_{2},\ldots$ are all possible values of X.) Given an arbitrary partition α of Ω, one may define the random variable E ( Y | α ). Still, E ( E ( Y | α)) = E ( Y ).

Conditional probability may be treated as a special case of conditional expectation. Namely, P ( A | X ) = E ( Y | X ) if Y is the indicator of A. Therefore the conditional probability also depends on the partition α_X generated by X rather than on X itself; P ( A | g(X) ) = P (A | X) = P (A | α), α = α_X = α_g(X).

On the other hand, conditioning on an event B is well-defined, provided that $\mathbb {P} (B)\neq 0,$ irrespective of any partition that may contain B as one of several parts.

Conditional distribution

Given X = x, the conditional distribution of Y is

\mathbb {P} (Y=y|X=x)={\frac {{\binom {3}{y}}{\binom {7}{x-y}}}{\binom {10}{x}}}={\frac {{\binom {x}{y}}{\binom {10-x}{3-y}}}{\binom {10}{3}}}

for 0 ≤ y ≤ min ( 3, x ). It is the hypergeometric distribution H ( x; 3, 7 ), or equivalently, H ( 3; x, 10-x ). The corresponding expectation 0.3 x, obtained from the general formula

n{\frac {R}{R+W}}

for H ( n; R, W ), is nothing but the conditional expectation E (Y | X = x) = 0.3 x.

Treating H ( X; 3, 7 ) as a random distribution (a random vector in the four-dimensional space of all measures on {0,1,2,3}), one may take its expectation, getting the unconditional distribution of Y, — the binomial distribution Bin ( 3, 0.5 ). This fact amounts to the equality

\sum _{x=0}^{10}\mathbb {P} (Y=y|X=x)\mathbb {P} (X=x)=\mathbb {P} (Y=y)={\frac {1}{2^{3}}}{\binom {3}{y}}

for y = 0,1,2,3; which is an instance of the law of total probability.

Conditioning on the level of densities

Summarize

Perspective

Example. A point of the sphere x² + y² + z² = 1 is chosen at random according to the uniform distribution on the sphere.^[1] The random variables X, Y, Z are the coordinates of the random point. The joint density of X, Y, Z does not exist (since the sphere is of zero volume), but the joint density f_X,Y of X, Y exists,

f_{X,Y}(x,y)={\begin{cases}{\frac {1}{2\pi {\sqrt {1-x^{2}-y^{2}}}}}&{\text{if }}x^{2}+y^{2}<1,\\0&{\text{otherwise}}.\end{cases}}

(The density is non-constant because of a non-constant angle between the sphere and the plane.) The density of X may be calculated by integration,

f_{X}(x)=\int _{-\infty }^{+\infty }f_{X,Y}(x,y)\,\mathrm {d} y=\int _{-{\sqrt {1-x^{2}}}}^{+{\sqrt {1-x^{2}}}}{\frac {\mathrm {d} y}{2\pi {\sqrt {1-x^{2}-y^{2}}}}}\,;

surprisingly, the result does not depend on x in (−1,1),

f_{X}(x)={\begin{cases}0.5&{\text{for }}-1<x<1,\\0&{\text{otherwise}},\end{cases}}

which means that X is distributed uniformly on (−1,1). The same holds for Y and Z (and in fact, for aX + bY + cZ whenever a² + b² + c² = 1).

Example. A different measure of calculating the marginal distribution function is provided below ^[2]^[3]

$f_{X,Y,Z}(x,y,z)={\frac {3}{4\pi }}$

$f_{X}(x)=\int _{-{\sqrt {1-y^{2}-x^{2}}}}^{+{\sqrt {1-y^{2}-x^{2}}}}\int _{-{\sqrt {1-x^{2}}}}^{+{\sqrt {1-x^{2}}}}{\frac {3\mathrm {d} y\mathrm {d} z}{4\pi }}=3{\sqrt {1-x^{2}}}/4\,;$