User:Jmath666/Conditional expectation/conditional probability.tex

\documentclass{article}%
\usepackage{amsfonts}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{graphicx}%
\setcounter{MaxMatrixCols}{30}
%TCIDATA{OutputFilter=latex2.dll}
%TCIDATA{Version=4.10.0.2363}
%TCIDATA{CSTFile=40 LaTeX article.cst}
%TCIDATA{Created=Tuesday, March 20, 2007 18:07:29}
%TCIDATA{LastRevised=Thursday, March 29, 2007 10:59:22}
%TCIDATA{<META NAME="GraphicsSave" CONTENT="32">}
%TCIDATA{<META NAME="DocumentShell" CONTENT="Standard LaTeX\Standard LaTeX Article">}
%TCIDATA{Language=American English}
\newtheorem{theorem}{Theorem}
\newtheorem{acknowledgement}[theorem]{Acknowledgement}
\newtheorem{algorithm}[theorem]{Algorithm}
\newtheorem{axiom}[theorem]{Axiom}
\newtheorem{case}[theorem]{Case}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{conclusion}[theorem]{Conclusion}
\newtheorem{condition}[theorem]{Condition}
\newtheorem{conjecture}[theorem]{Conjecture}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{criterion}[theorem]{Criterion}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{example}[theorem]{Example}
\newtheorem{exercise}[theorem]{Exercise}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{notation}[theorem]{Notation}
\newtheorem{problem}[theorem]{Problem}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{remark}[theorem]{Remark}
\newtheorem{solution}[theorem]{Solution}
\newtheorem{summary}[theorem]{Summary}
\newenvironment{proof}[1][Proof]{\noindent\textbf{#1.} }{\ \rule{0.5em}{0.5em}}
\begin{document}

\title{Notes on\\Conditional Probability and Filtering\thanks{This document is not copyrighted
and its use is governed by the GFDL.}}
\author{Jan Mandel\\{University of Colorado}}
\maketitle

\section{Introduction}

This document summarizes some facts from probability theory and applications.
It attempts to convert the material from vague and ambiguous presentations
often found in the literature into a form that makes sense to the author.
Hopefully it will be useful to others as well.

The selection of the material is given by an effort to understand the theory
for particle filters in \cite{Crisan-2001-PFT,Crisan-2000-CSM} and other
related work, esp. \cite{DelMoral-1998-MVP}. The concept of conditional
expectation is central to important topics of probability theory, in
particular Markov chains and martingales.

\section{Elementary description}

If $A,$ $B$ are events such that $P\left(  B\right)  >0$, \emph{the
conditional probability of the event }$A$\emph{ given }$B$ is defined by%
\[
P\left(  A|B\right)  =\frac{P\left(  A\cap B\right)  }{P\left(  B\right)  }.
\]
If $B$ is fixed, the mapping $A\mapsto P\left(  A|B\right)  $ is a
\emph{conditional probability distribution given the event }$B$\emph{.}

If also $P\left(  B\right)  >0$, then also%
\[
P\left(  B|A\right)  =\frac{P\left(  A\cap B\right)  }{P\left(  A\right)  }%
\]
and so%
\begin{align*}
P\left(  A|B\right)   &  =\frac{P\left(  A\cap B\right)  }{P\left(  B\right)
}=\frac{P\left(  A\cap B\right)  }{P\left(  A\right)  }\frac{P\left(
A\right)  }{P\left(  B\right)  }\\
&  =\frac{P\left(  B|A\right)  P\left(  A\right)  }{P\left(  B\right)  },
\end{align*}
which is known as the Bayes theorem.

\subsection{Conditioning of discrete random variables}

If $Y$ is a discrete real random variable (that is, attaining only values
$y_{j}$, $j=1,2,\ldots$), then the \emph{conditional probability of an event
}$A$\emph{ given that }$Y=y_{j}$ is%
\[
P\left(  A|Y=y_{j}\right)  =\frac{P\left(  A\wedge Y=y_{j}\right)  }{P\left(
Y=y_{j}\right)  }.
\]
The mapping $A\mapsto P\left(  A|Y=y_{j}\right)  $ defines a \emph{conditional
probability distribution given that }$Y=y_{j}$.

Note that $P\left(  A|Y=y_{j}\right)  $ is a number, that is, a
\emph{deterministic} quantity. If we allow $y_{j}$ to be a realization of the
random variable $Y$, we obtain \emph{conditional probability of the event }%
$A$\emph{ given random variable }$Y$, denoted by $P\left(  A|Y\right)  $,
which is a random variable itself. The conditional probability $P\left(
A|Y\right)  $ attains the value of $P\left(  A|Y=y_{j}\right)  $ with
probability $P\left(  Y=y_{j}\right)  $.

Now suppose $X$ and $Y$ are two discrete real random variables with a joint
distribution. Then the \emph{conditional probability distribution of }%
$X$\emph{ given }$Y=y_{j}$\emph{ is}%
\[
P\left(  X=x_{i}|Y=y_{j}\right)  =\frac{P\left(  X=x_{i}\wedge Y=y_{j}\right)
}{P\left(  Y=y_{j}\right)  }.
\]
If we allow $y_{j}$ to be a realization of the random variable $Y$, we obtain
the \emph{conditional distribution }$P\left(  X|Y\right)  $ \emph{of random
variable }$X$\emph{ given random variable }$Y$. Given $x_{i}$, the random
variable $P\left(  X=x_{i}|Y\right)  $ that attains the value $P\left(
X=x_{i}|Y=y_{j}\right)  $ with probability $P\left(  Y=y_{j}\right)  $.

The random variables $X$ and $Y$ are \emph{independent} when the events
$X=x_{i}$ and $Y=y_{j}$ are independent for all $x_{i}$ and $y_{j}$, that is,%
\[
P\left(  X=x_{i}\wedge Y=y_{j}\right)  =P\left(  X=x_{i}\right)  P\left(
Y=y_{j}\right)  .
\]
Clearly, this is equivalent to
\[
P\left(  X=x_{i}|Y=y_{j}\right)  =P\left(  X=x_{i}\right)  .
\]


The \emph{conditional expectation of }$X$\emph{ given the value }$Y=y_{j}$ is%
\begin{align*}
E\left(  X|Y=y_{j}\right)   &  =\sum_{i}x_{i}P\left(  X=x_{i}|Y=y_{j}\right)
\\
&  =\sum_{i}x_{i}\frac{P\left(  X=x_{i}\wedge Y=y_{j}\right)  }{P\left(
Y=y_{j}\right)  }\text{, }%
\end{align*}
which is defined whenever the marginal probability%
\[
P\left(  Y=y_{j}\right)  =\sum_{i}P\left(  X=x_{i}\wedge Y=y_{j}\right)  >0.
\]


This is a description common in statistics \cite[page 209]{Feller-1968-IPT}.
Note that $E\left(  X|Y=y_{j}\right)  $ is a number, that is, a
\emph{deterministic} quantity, and the particular value of $y_{j}$ does not
matter; only the probabilities $P\left(  X=x_{i}\wedge Y=y_{j}\right)  $ do.

If we allow $y_{j}$ to be a realization of the random variable $Y$, we obtain
\emph{conditional expectation of random variable }$X$\emph{ given random
variable }$Y$, denoted by $E\left(  X|Y\right)  $. This form is closer to the
mathematical form favored by probabilists (described in more detail below),
and it is a random variable itself. The conditional expectation $E\left(
X|Y\right)  $ attains the value $E\left(  X|Y=y_{j}\right)  $ with probability
$P\left(  Y=y_{j}\right)  $.

\subsection{Conditioning of continuous random variables}

For continuous random variables $X$, $Y$ with joint density $p_{X,Y}\left(
x,y\right)  $, the \emph{conditional probability density of }$X$\emph{ given
that }$Y=y$ is%
\[
p_{X|Y}\left(  x,y\right)  =\frac{p_{X,Y}\left(  x,y\right)  }{p_{Y}\left(
y\right)  },
\]
where%
\[
p_{Y}\left(  y\right)  =\int p_{X,Y}\left(  x,y\right)  dx
\]
is the marginal density of $Y$. The conventional notation $p_{X|Y}\left(
x|y\right)  $ is often used to mean the same as $p_{X|Y}\left(  x,y\right)  $,
that is, the function $p_{X|Y}$ of two variables $x$ and $y$. The notation
$p\left(  x|y\right)  $, often used in practice, is ambigous, because if $x$
and $y$ are substituted for by something else (like specific numbers), the
information what $p$ means is lost.

When the value of $y$ is constant, the function $x\longmapsto p_{X|Y}\left(
x,y\right)  $ is the probability density function of $X$ for that value of
$y$. When the value of $x$ is constant, the function $y\longmapsto
p_{X|Y}\left(  x,y\right)  $ is called the \emph{likelihood} function.

The continuous random variables are \emph{independent} if, for all $x$ and
$y$, the events $P\left(  X\leq x\right)  $ and $P\left(  Y\leq y\right)  $
are independent, which can be proved to be equivalent to%
\[
p_{X,Y}\left(  x,y\right)  =p_{X}\left(  x\right)  p_{Y}\left(  y\right)  .
\]
This is clearly equivalent to%
\[
p_{X,Y}\left(  x,y\right)  =p_{X|Y}\left(  x,y\right)  p_{Y}\left(  y\right)
.
\]


The \emph{conditional probability density of }$X$\emph{ given }$Y$ is the
random function $p_{X|Y}\left(  x,Y\right)  $. The \emph{conditional
expectation of }$X$\emph{ given the value }$Y=y$ is%
\[
E\left(  X|Y=y\right)  =\int xp_{X|Y}\left(  x|y\right)  dx
\]
and the \emph{conditional expectation of }$X$\emph{ given }$Y$ is the random
variable%
\[
E\left(  X|Y\right)  =\int xp_{X|Y}\left(  x|Y\right)  dx,
\]
dependent on the values of $Y$.

\subsection{Warning}

Unfortunately, in the the literature, esp. more elementary oriented statistics
texts, the authors do not always distinguish properly between conditioning
\emph{given the value of} a random variable (the result is a number) and
conditioning \emph{given the random variable} (the result is a random
variable), so, confusingly enough, the words \textquotedblleft given the
random variable\textquotedblright\ can mean either.

\section{Mathematical synopsis}

This section follows \cite{Wikipedia-2007-CE}. In probability theory, a
\emph{conditional expectation} (also known as conditional expected value or
conditional mean) is the expected value of a random variable with respect to a
conditional probability distribution, defined as follows.

If $X$ is a real random variable, and $A$ is an event with positive
probability, then the \emph{conditional probability distribution of }$X$\emph{
given }$A$ assigns a probability $P(X\in B|A)$ to the Borel set $B$. The mean
(if it exists) of this conditional probability distribution of $X$ is denoted
by $E(X|A)$ and called \emph{the conditional expectation of }$X$\emph{ given
the event }$A$.

If $Y$ is another random variable, then the \emph{conditional expectation
}$E(X|Y=y)$\emph{ of }$X$\emph{ given that the value }$Y=y$ is a function of
$y$, let us say $g(y)$. An argument using the Radon-Nikodym theorem is needed
to define $g$ properly because the event that $Y=y$ may have probability zero.
Also, $g$ is defined only for almost all $y$, with respect to the distribution
of $Y$. The \emph{conditional expectation of }$X$\emph{ given random variable
}$Y$, denoted by $E(X|Y)$, is the random variable $g(Y)$.

It turns out that the conditional expectation $E(X|Y)$ is a function only of
the $\sigma$-algebra, say $\mathcal{A}$, generated by the events $Y\in B$ for
Borel sets $B$, rather than the particular values of $Y$. For a $\sigma
$-algebra $\mathcal{A}$, the \emph{conditional expectation }$E(X|\mathcal{A})$\emph{ of
}$X$\emph{ given the }$\sigma$\emph{-algebra }$\mathcal{A}$ is a random variable that is
$\mathcal{A}$-measurable and whose integral over any $\mathcal{A}$-measurable
set is the same as the integral of $X$ over the same set. The existence of
this conditional expectation is proved from the Radon-Nikodym theorem. If $X$
happens to be $\mathcal{A}$-measurable, then $E(X|\mathcal{A})=X$.

If $X$ has an expected value, then the conditional expectation $E(X|Y)$ also
has an expected value, which is the same as that of $X$. This is the law of
total expectation.

For simplicity, the presentation here is done for real-valued random
variables, but generalization to probability on more general spaces, such as
$\mathbb{R}^{n}$ or normed metric spaces equipped with a probability measure,
is immediate.

\section{Mathematical prerequisites}

Recall that probability space is $\left(  \Omega,\Sigma,P\right)  $, where
$\Sigma$ is a $\sigma$-algebra of subsets of $\Omega$, and $P$ a probability
measure with $\mathcal{B}$ measurable sets. A random variable on the space
$\left(  \Omega,\Sigma,P\right)  $ is a $\Sigma$-measurable function.
$\mathcal{B}\left(  \mathbb{R}\right)  $ is the sigma algebra of all Borel
sets in $\mathbb{R}$. If $A$ is a set and $X$ a random variable, $X\in A$ or
$\left\{  X\in A\right\}  $ are common shorthands for the event $\left\{
\omega:X\left(  \omega\right)  \in A\right\}  =X^{-1}\left(  A\right)
\in\Sigma.$

\section{Probability conditional on the value of a random variable}

Let $\left(  \Omega,\Sigma,P\right)  $ be probability space, $Y$ a $\Sigma
$-measurable random variable with values in $\mathbb{R}$, $A\in\Sigma$ (i.e.,
an event not necessarily independent of $Y$), and $B\in\mathcal{B}\left(
\mathbb{R}\right)  $. For $P\left(  Y\in B\right)  >0$ and $A\in\Sigma$, the
conditional probability of $A$ given $Y\in B$ is by definition%
\[
P\left(  A|Y\in B\right)  =\frac{P\left(  A\cap\left\{  Y\in B\right\}
\right)  }{P\left(  Y\in B\right)  }.
\]
We wish to attach a meaning to the conditional probability of $A$ given $Y=y$
even when $P\left(  Y=y\right)  =0$. The following argument follows Wilks
\cite[p. 26]{Wilks-1962-MS}, who attributes it to Kolmogorov
\cite{Kolmogorov-1956-FTP}. Fix $A$ and define%
\[
Q\left(  B\right)  =P\left(  A\cap\left\{  Y\in B\right\}  \right)  =P\left(
A\cap Y^{-1}\left(  B\right)  \right)  .
\]
Since $Y$ is $\Sigma$-measurable, the set function $R$ is a measure on Borel
sets $\mathcal{B}\left(  \mathbb{R}\right)  $. Define another measure $Q$ on
$\mathcal{B}\left(  \mathbb{R}\right)  $ by%
\[
R\left(  B\right)  =P\left(  \left\{  Y\in B\right\}  \right)  \quad\forall
B\in\mathcal{B}\left(  \mathbb{R}\right)
\]
Clearly,
\[
0\leq Q\left(  B\right)  \leq R\left(  B\right)  \quad\forall B\in
\mathcal{B}\left(  \mathbb{R}\right)
\]
and hence $R\left(  B\right)  =0$ implies $Q\left(  B\right)  =0$. Thus the
measure $Q$ is absolutely continuous with respect to the measure $R$ and by
the Radon-Nikodym theorem, there exists a real-valued $\mathcal{B}\left(
\mathbb{R}\right)  $-measurable function $f$ such that%
\[
Q\left(  B\right)  =\int_{B}f\left(  y\right)  dR\left(  y\right)
\quad\forall B\in\mathcal{B}\left(  \mathbb{R}\right)  .
\]
We interpret the function $f$ as the conditional probability of $A$ given
$Y=y$,%
\[
f\left(  y\right)  =P\left(  A|Y=y\right)  .
\]
Once the conditional probability is defined, other concepts of probability
follow, such as expectation and density.

One way to justify this interpretation is $f$ as the conditional probability
of $A$ given $Y=y$ the limit of probability conditioned on the value of $Y$
being in a small neighborhood of $y$. Set $B=N_{\varepsilon}\left(  y\right)
$ (a neighborhood of $y$ with radius $x$) to get%
\[
Q\left(  N_{\varepsilon}\left(  y\right)  \right)  =P\left(  A\cap
Y^{-1}\left(  N_{\varepsilon}\left(  y\right)  \right)  \right)
\]
and using the fact that $P\left(  Y\in N_{\varepsilon}\left(  y\right)
\right)  =\int_{N_{\varepsilon}\left(  y\right)  }dR$, we have%
\[
Q\left(  N_{\varepsilon}\left(  y\right)  \right)  =\int_{N_{\varepsilon
}\left(  y\right)  }fdR=\frac{\int_{N_{\varepsilon}\left(  x\right)  }%
fdR}{\int_{N_{\varepsilon}\left(  x\right)  }dR}P\left(  Y\in N_{\varepsilon
}\left(  y\right)  \right)  ,
\]
so%
\[
P\left(  A|Y\in N_{\varepsilon}\left(  y\right)  \right)  =\frac{P\left(
A\cap Y\in N_{\varepsilon}\left(  y\right)  \right)  }{P\left(  Y\in
N_{\varepsilon}\left(  y\right)  \right)  }=\frac{\int_{N_{\varepsilon}\left(
y\right)  }fdR}{\int_{N_{\varepsilon}\left(  y\right)  }dR}\rightarrow
f\left(  y\right)  ,\quad\varepsilon\rightarrow0,
\]
for almost all $x$ in the measure $R$.\footnote{I do not know how to prove
that without additional assumptions on $f$, like continuous. \cite[p.
26]{Wilks-1962-MS} claims the limit a.e. \textquotedblleft
can\textquotedblright\ be proved, though he does not proceed this way, and
neglects to mention a.e. is in the measure $R$.}

As another illustration and justification for understanding $f$ as the
conditional probability of $A$ given $Y=y$, we now show what happens when the
random variable $Y$ is discrete. Suppose $Y$ attains only values $y_{j}$,
$j=1,2,\ldots$, with $P\left(  Y=y_{j}\right)  >0$. Then%
\[
R\left(  B\right)  =P\left(  Y\in B\right)  =\sum_{y_{j}\in B}P\left(
Y=y_{j}\right)  ,\quad\forall B\in\mathcal{B}\left(  \mathbb{R}\right)  .
\]
Choose $y_{j}$ and $B$ as a neighborhood $N_{\varepsilon}\left(  y_{j}\right)
$ of $y_{j}$ with radius $\varepsilon>0$ so small that $N_{\varepsilon}\left(
y_{j}\right)  $ does not contain any other $y_{k}$, $k\neq j$. Then for any
$A\in\Sigma$,%
\[
Q\left(  N_{\varepsilon}\left(  y_{j}\right)  \right)  =P\left(  A\cap\left\{
Y\in N_{\varepsilon}\right\}  \right)  =P\left(  A\cap\left\{  Y=y_{j}%
\right\}  \right)
\]
by the definition of $Q$, and from the definition of $f$ by Radon-Nykodym
derivative,%
\[
Q\left(  N_{\varepsilon}\left(  y_{j}\right)  \right)  =\int_{N_{\varepsilon}%
}f\left(  y\right)  dR\left(  y\right)  =f\left(  y_{j}\right)  P\left(
Y=y_{j}\right)  .
\]
This gives, for $y=y_{j}$,%
\begin{align*}
f\left(  y\right)   &  =\lim_{\varepsilon\rightarrow0}\frac{P\left(
E\cap\left\{  Y\in N_{\varepsilon}\left(  y\right)  \right\}  \right)
}{P\left(  Y\in N_{\varepsilon}\left(  y\right)  \right)  }=\lim
_{\varepsilon\rightarrow0}P\left(  A|Y\in N_{\varepsilon}\left(  y\right)
\right) \\
&  =\frac{P\left(  A\cap\left\{  Y=y\right\}  \right)  }{P\left(  Y=y\right)
}=P\left(  A|Y=y\right)  ,
\end{align*}
by definition of conditional probability. The function $f\left(  y\right)  $
is defined only on the set $\left\{  y_{1},y_{2},\ldots\right\}  $. Because
that's where the variable $Y$ is concentrated, this is a.s.

\section{Expectation conditional on the value of a random variable}

Suppose that $X$ and $Y$ are random variables, $X$ integrable. Define again
the measures on $\mathcal{B}\left(  \mathbb{R}\right)  $ generated by the
random variable $Y$,%

\[
R\left(  B\right)  =P\left(  Y\in B\right)  =P\left(  Y^{-1}\left(  B\right)
\right)  ,
\]
and a signed finite measure on $\mathcal{B}\left(  \mathbb{R}\right)  $,%
\[
Q\left(  B\right)  =E\left(  X\mathbf{1}_{Y\in B}\right)  =\int_{\omega
:Y\left(  \omega\right)  \in B}X\left(  \omega\right)  P\left(  d\omega
\right)  =\int_{Y^{-1}\left(  B\right)  }X\left(  \omega\right)  P\left(
d\omega\right)  .
\]
Here, $\mathbf{1}_{Y\in B}$ is the indicator function of the event $Y\in B$,
so $\left(  X\mathbf{1}_{Y\in B}\right)  \left(  \omega\right)  =X\left(
\omega\right)  $ if $Y\left(  \omega\right)  \in B$ and zero otherwise. Since%
\begin{align*}
\left\vert Q\left(  B\right)  \right\vert  &  \leq\underbrace{P\left(
Y^{-1}\left(  B\right)  \right)  }_{R\left(  B\right)  }\int_{\Omega}X\left(
\omega\right)  P\left(  d\omega\right) \\
&  =R\left(  B\right)  E\left(  X\right)
\end{align*}
and $E\left(  X\right)  <+\infty$, we have that $R\left(  B\right)
=0\Longrightarrow Q\left(  B\right)  =0$, so $Q$ is absolutely continuous with
respect to $R$. Consequently, there exists Radon-Nikodym derivative $f$ such
that%
\[
Q\left(  B\right)  =\int_{B}f\left(  y\right)  R\left(  dy\right)
,\quad\forall B\in\mathcal{B}\left(  \mathbb{R}\right)  .
\]
The value $f\left(  y\right)  $\emph{ is conditional expectation of }$X$\emph{
given }$Y=y$ and denoted by $E\left(  X|Y=y\right)  $. Then the result can be
written as%
\[
E\left(  X\mathbf{1}_{Y\in B}\right)  =\int_{B}E\left(  X|Y=y\right)  P\left(
Y\in dy\right)  ,
\]
for almost all $y$ in the measure $P\left(  Y\in dy\right)  $ generated by the
random variable $Y$.

This definition is consistent with that of conditional probability: the
conditional probability of $A$ given $Y=y$ is the same as the conditional mean
of the indicator function of $A$ given $Y=y$. The proof is also completely the
same. Actually we did not have to do conditional probability at all and just
call it a special case of conditional expectation.

\section{Expectation conditional on a random variable and on a $\sigma
$-algebra}

Let $g\left(  y\right)  =E\left(  X|Y=y\right)  $ be conditional expectation
of the random variable $X$ given that random variable $Y=y$. Here $y$ is a
fixed, deterministic value. Now take $y$ random, namely the value of the
random variable $Y$, $y=Y\left(  \omega\right)  $. The result is called the
\emph{conditional expectation of }$X$\emph{ given }$Y$, which is the random
variable%
\[
E\left(  X|Y\right)  \left(  \omega\right)  =E\left(  X|Y=Y\left(
\omega\right)  \right)  =g\left(  Y\left(  \omega\right)  \right)  .
\]


So now we have the conditional expectation given in terms of the sample space
$\Omega$ rather than in terms of $\mathbb{R}$, the range space of the random
variable $Y$. It will turn out that after the change of the independent
variable, the particular values attained by the random variable $Y$ do not
matter that much; rather, it is the granularity of $Y$ that is important. The
granularity of $Y$ can be expressed in terms of the $\sigma$-algebra generated
by the random variable $Y$, which is%
\[
\mathcal{A}=\left\{  Y^{-1}\left(  B\right)  :\mathcal{B}\left(
\mathbb{R}\right)  \right\}  .
\]


By substitution, the conditional expectation $g$ satisfies%
\[
E\left(  X\mathbf{1}_{\omega\in Y^{-1}\left(  B\right)  }\right)
=\int_{Y^{-1}\left(  B\right)  }g\left(  Y\left(  \omega\right)  \right)
P\left(  d\omega\right)  ,\quad\forall B\in\mathcal{B}\left(  \mathbb{R}%
\right)  .
\]
which, by writing%
\[
C=Y^{-1}\left(  B\right)  ,\quad h\left(  \omega\right)  =g\left(  Y\left(
\omega\right)  \right)  ,
\]
is seen to be the same as%
\[
\int_{C}X\left(  \omega\right)  P\left(  d\omega\right)  =\int_{C}h\left(
\omega\right)  P\left(  d\omega\right)  ,\quad\forall C\in\mathcal{A}.
\]


It can be proved that for any $\sigma$-algebra $\mathcal{A}\subset\Sigma$, the
random variable $h$ exists and is defined by this equation uniquely, up to
equality a.e. in $P$ \cite[page 32-II]{Dellacherie-1978-PP}. The random
variable $h$ is called the \emph{conditional expectation of }$X$\emph{ given
the }$\sigma$\emph{-algebra }$\mathcal{A}$\emph{. }It can be interpreted as a
sort of averaging of the random variable $X$ to the granularity given by the
$\sigma$-algebra $\mathcal{A}$ \cite{Varadhan-2001-PT}.

The \emph{conditional probability }$h=P\left(  A|\mathcal{A}\right)  $\emph{
of a an event (that is, a set) }$A\in\Sigma$\emph{ given the }$\sigma
$\emph{-algebra }$\mathcal{A}$ is obtained by substituting $X=\mathbf{1}%
_{\omega\in A}$, which gives%
\[
P\left(  A\cap C\right)  =\int_{C}h\left(  \omega\right)  P\left(
d\omega\right)  ,\quad\forall C\in\mathcal{A}.
\]


An event $A\in\Sigma$ is defined to be \emph{independent of a }$\sigma
$\emph{-algebra} $\mathcal{A}\subset\Sigma$ if $A$ and any $C\in\mathcal{A}$
are independent. It is easy to see that $A\in\Sigma$ is independent of
$\sigma$-algebra $A$ if and only if%
\[
P\left(  A\cap C\right)  =P\left(  A\right)  P\left(  C\right)  =\int
_{C}P\left(  A\right)  P\left(  d\omega\right)  ,\quad\forall C\in
\mathcal{A},
\]
that is, if and only if $P\left(  A|\mathcal{A}\right)  =P\left(  A\right)  $
a.s. (which is a particularly obscure way to write independence given how
complicated the definitions are).

Two random variables $X$, $Y$ are said to be independent if%
\[
P\left(  X\in A\wedge Y\in B\right)  =P\left(  X\in A\right)  P\left(  Y\in
B\right)  ,\quad\forall A,B\in\mathcal{B}\left(  \mathbb{R}\right)  ,
\]
which is now seen to be the same as%
\[
P\left(  X\in A|Y\right)  =P\left(  X\in A\right)  ,\quad\forall
A\in\mathcal{B}\left(  \mathbb{R}\right)  .
\]


\section{Properties of conditional expectation}

To be done.

\section{Conditional density}

Now that we have $P\left(  A|Y=y\right)  $ for an arbitrary event $A$, we can
define the conditional probability $P\left(  X\in F|Y=y\right)  $ for a random
variable $X$ and Borel set $F$. Thus we can define the conditional density
$p_{X|Y}\left(  x,y\right)  $ as the Radon-Nikodym derivative,%
\[
P\left(  X\in F|Y=y\right)  =\int_{G}p_{X|Y}\left(  x,y\right)  d\mu\left(
y\right)
\]
where $\mu$ is the Lebesgue measure. In the conditional density $p_{X|Y}%
\left(  x,y\right)  $, $X$ and $Y$ are random variables that identify the
density function, and $x$ and $y$ are the arguments of the density function.

Note that in general $p_{X|Y}\left(  x,y\right)  $ is defined only for almost
all $x$ (in Lebesgue measure) and almost all $y$ (in the measure $R$ generated
by the random variable $Y$).\textbf{ }Under reasonable additional conditions
(for example, it is enough to assume that the joint density $p_{X,Y}$ is
continuous at $\left(  x,y\right)  $, and $p\left(  y\right)  >0$), the
density of $X$ conditional on $Y=y$ satisfies%
\begin{align*}
p_{X|Y}\left(  x,y\right)   &  =\lim_{\varepsilon\rightarrow0}\frac{P\left(
X\in N_{\varepsilon}\left(  x\right)  |Y\in N_{\varepsilon}\left(  y\right)
\right)  }{\mu\left(  N_{\varepsilon}\left(  x\right)  \right)  }\\
&  =\lim_{\varepsilon\rightarrow0}\frac{P\left(  X\in N_{\varepsilon}\left(
x\right)  \cap Y\in N_{\varepsilon}\left(  y\right)  \right)  }{\mu\left(
N_{\varepsilon}\left(  x\right)  \right)  P\left(  Y\in N_{\varepsilon}\left(
y\right)  \right)  }\\
&  =\lim_{\varepsilon\rightarrow0}\frac{P\left(  x\in N_{\varepsilon}\left(
x\right)  \cap Y\in N_{\varepsilon}\left(  y\right)  \right)  }{\mu\left(
N_{\varepsilon}\left(  x\right)  \right)  \mu\left(  N_{\varepsilon}\left(
y\right)  \right)  }\frac{\mu\left(  N_{\varepsilon}\left(  y\right)  \right)
}{P\left(  Y\in N_{\varepsilon}\left(  y\right)  \right)  }\\
&  =\frac{p\left(  x,y\right)  }{p\left(  y\right)  }.
\end{align*}
Note that this density is a deterministic function.

Density of a random variable $X$ conditional on a random variable $Y$ is%
\[
p_{X|Y}\left(  x,Y\right)  =\frac{p\left(  x,Y\right)  }{p\left(  Y\right)
}.
\]
It is a function valued random variable obtained from the deterministic
function $p_{X|Y}\left(  x,y\right)  $ by taking $y$ to be the value of the
random variable $Y$.

A common shorthand for the conditional density is%
\[
p_{X|Y}\left(  x,y\right)  =p\left(  x|y\right)  .
\]
This abuse of notation identifies a function from the symbols for its
arguments, which is incorrect. Imagine that we wish to evaluate the value of
the conditional density of $X$ at $2$ given $Y=1$; then $p\left(  x|y\right)
$ becomes $p\left(  2|1\right)  $, which is a nonsense.

\section{Application: Markov chains}

To be done.

\section{Application: Martingales}

To be done.

\section{Sequential Bayesian estimation}

This section follows \cite[Section 1.1]{Doucet-2001-ISM}, with some details
added. Consider an unobserved process with state \textquotedblleft probability
distribution\textquotedblright\ $p\left(  x_{t}\right)  $. This really not a
distribution of any kind; what is meant by this, unfortunately common, abuse
of notation is
\[
p\left(  x_{t}\right)  =p_{X_{t}}(x),
\]
that is, the probability density of a random variable $X_{t}$, the state
associated with the time $t$, at the point $x$. The notation $p\left(
x_{0:t}\right)  $ means the joint density of $X_{0}$, $X_{1}$,\ldots\ ,
$X_{t}$, and so on. The process is assumed to be Markov with some initial
density $p\left(  x_{0}\right)  $ and transition density $p\left(
x_{t+1}|x_{t}\right)  $; that is, the state at time $t$ is assumed to satisfy
the Markov property%
\[
p\left(  x_{0:t+1}\right)  =p\left(  x_{t+1}|x_{t}\right)  p\left(
x_{0:t}\right)  ,
\]
thus, by induction%
\[
p\left(  x_{0:t+1}\right)  =p\left(  x_{t+1}|x_{t}\right)  \cdots p\left(
x_{1}|x_{0}\right)  p\left(  x_{0}\right)  .
\]
The only available observations are the data likelihoods $p\left(  y_{t}%
|x_{t}\right)  $ and it is assumed that $Y_{t}$ depends on the state $X_{t}$
only. Therefore,%
\[
p\left(  y_{1:t+1}|x_{0:t+1}\right)  =p\left(  y_{t+1}|x_{t+1}\right)
p\left(  y_{1:t}|x_{t}\right)  ,
\]
and by induction%
\[
p\left(  y_{1:t+1}|x_{0:t+1}\right)  =p\left(  y_{t+1}|x_{t+1}\right)  \cdots
p\left(  y_{1}|x_{1}\right)  .
\]


We wish to compute the conditional joint probability of the state $x_{0:t}$
given the measurements $y_{1:t}$. This is given by the Bayes theorem,%
\[
p\left(  x_{0:t}|y_{1:t}\right)  \varpropto p\left(  y_{1:t}|x_{0:t}\right)
p\left(  x_{0:t}\right)  .
\]
where $\varpropto$ means proportional as a function of $x$. This conditional
probability can be computed recursively,%
\begin{align*}
p\left(  x_{0:t+1}|y_{1:t+1}\right)   &  \varpropto p\left(  y_{1:t+1}%
|x_{0:t+1}\right)  p\left(  x_{0:t+1}\right) \\
&  =p\left(  y_{t+1}|x_{t+1}\right)  p\left(  y_{1:t}|x_{t}\right)  p\left(
x_{t+1}|x_{t}\right)  p\left(  x_{0:t}\right) \\
&  =p\left(  y_{t+1}|x_{t+1}\right)  p\left(  x_{t+1}|x_{t}\right)
\underbrace{p\left(  y_{1:t}|x_{t}\right)  p\left(  x_{0:t}\right)
}_{p\left(  x_{0:t}|y_{1:t}\right)  }\\
&  =p\left(  y_{t+1}|x_{t+1}\right)  p\left(  x_{t+1}|x_{t}\right)  p\left(
x_{0:t}|y_{1:t}\right)  .
\end{align*}
By induction, and since there is no $y_{0}$,%
\[
p\left(  x_{0:t+1}|y_{1:t+1}\right)  \varpropto p\left(  y_{t+1}%
|x_{t+1}\right)  p\left(  x_{t+1}|x_{t}\right)  \cdots p\left(  y_{1}%
|x_{1}\right)  p\left(  x_{1}|x_{0}\right)  p\left(  x_{0}\right)  .
\]


To obtain a recursion for the marginal distribution $p\left(  x_{t}%
|y_{1:t}\right)  $ (the probability density of the state at time $t$ given all
observations up to the time $t$, called the filtering distribution), we need
to integrate over all earlier states,%
\begin{align*}
&  p\left(  x_{t+1}|y_{1:t+1}\right) \\
&  \qquad\varpropto\int\cdots\int p\left(  y_{t+1}|x_{t+1}\right)  p\left(
x_{t+1}|x_{t}\right)  \cdots p\left(  y_{1}|x_{1}\right)  p\left(  x_{1}%
|x_{0}\right)  p\left(  x_{0}\right)  dx_{t}\cdots dx_{0}\\
&  \qquad=p\left(  y_{t+1}|x_{t+1}\right)  \int p\left(  x_{t+1}|x_{t}\right)
p\left(  x_{t}|y_{1:t}\right)  dx_{t}.
\end{align*}


In practice, one needs to estimate the filtering distribution when

\begin{itemize}
\item the initial distribution $p\left(  x_{0}\right)  $ is and the transition
probability $p\left(  x_{t+1}|x_{t}\right)  $ are known only approximately

\item the filtering density $p\left(  x_{t+1}|y_{1:t+1}\right)  $ is
represented only approximately (e.g., by an ensemble, or few moments)

\item the Bayesian update (multiplication by the data likelihood)\ is
implemented only approximately
\end{itemize}

For this purpose, write $p\left(  u_{t}\right)  $ for $p\left(  x_{t}%
|y_{1:t}\right)  $, let $p\left(  u_{t}\right)  $ be an approximation of the
initial distribution $p\left(  x_{0}\right)  $, and $p\left(  u_{t+1}%
|u_{t}\right)  $ an approximation of the transition probability $p\left(
x_{t+1}|x_{t}\right)  $, and let
\[
p\left(  u_{t+1}\right)  \varpropto p\left(  y_{t+1}|u_{t+1}\right)  \int
p\left(  u_{t+1}|u_{t}\right)  p\left(  u_{t}\right)  dx_{t}.
\]
Note that

\begin{itemize}
\item The probability density $p\left(  u_{t}\right)  $ plays the role of the
state of the model.

\item The recursion for $p\left(  u_{t}\right)  $ means advancing the model in
time by the transition probability $p\left(  u_{t+1}|u_{t}\right)  $ (the
integral) followed by an application of the Bayes theorem.

\item The advancement in time from time $t$ to $t+1$ is by a general Markov
chain. In particular, there is no linearity assumption.

\item The measurements $y_{t}$ are incorporated into the model state
sequentially. At time $t$ there is no need to return to states or measurements
at earlier times.

\item If the initial probability distribution and the transition probability
distribution are exact, $p\left(  u_{t}\right)  $ is the exact filtering
distribution. This is better written as%
\[
p_{U_{t}}\left(  x\right)  =p_{X_{t}|Y_{1:t}}\left(  x,y_{1:t}\right)  ,
\]
which should be understood to mean that the probability density of the random
variable $U_{t}$ and the conditional probability density of the state $X_{t}$
given the measurements $Y_{1:t}=y_{1,t}$, are same on the state space, for
almost all $y_{1:t}$ relative to the measure defined by $Y_{1:t}$.

\item The goal of convergence analysis of filters should be to estimate the
difference between $p\left(  u_{t}\right)  $ and $p\left(  x_{t}%
|y_{1:t}\right)  $ when the initial and the transition probability
distributions are not exact
\end{itemize}

\section{Tools for identically distributed but not independent variables}

The law of large numbers cannot be used for particle and ensemble filters
because the ensemble members are not independent. Techniques used in the
literature to deal with the asymptotics of identically distributed but not
independent variables include:

Measure valued random variables \cite{Crisan-2001-PFT,DelMoral-1998-MVP}

Martingales, Doob's martingale (method of bounded differences), Azuma's
inequality \cite{Azuma-1967-WSC,Godbole-1998-BMB}

Concentration of measures \cite{Dubhashi-1998-CMA,Talagrand-1996-NLI}

\bibliographystyle{plain}
\bibliography{../../bibliography/dddas-jm}

\end{document}