統計推斷(七) Typical Sequence

1. 一些定理

Markov inequality: $r.v. \ \ \mathsf{x}\ge0$
$\mathbb{P}(x\ge\mu)\le \frac{\mathbb{E}[x]}{\mu}$
Proof: omit…

Weak law of large numbers(WLLN): $\vec{y}=[y_1,y_2,...,y_N]^T, \ \ \ \ y_i \sim p \ \ \ i.i.d$
$\lim_{N\to\infty}\mathbb{P}(|L_p(\vec{y})+H(p)|>\varepsilon)=0, \ \ \forall \varepsilon>0$
Proof: omit…

2. Typical set

Definition: $\mathcal{T}_\varepsilon(p;N)=\{\vec{y}\in\mathcal{Y}^N:|L_p(\vec{y})+H(p)|<\varepsilon\}$
Properties
- WLLN $\Longrightarrow P\left(\vec{y}\in\mathcal{T}_\varepsilon(p;N)\right)\simeq1$ , $N$ large
- $L_p(\vec{y})\simeq H(p) \Longrightarrow p_y(\vec{y})\simeq 2^{-NH(p)}$
- $\Longrightarrow |\mathcal{T}_\varepsilon(p;N)|\simeq 2^{NH(p)}$
- 當 p 不是均勻分佈的時候， $\frac{|\mathcal{T}_\varepsilon(p;N)|}{|\mathcal{Y}^N|}\to0$ ，也就是說典型集中元素(序列)個數在所有可能的元素(序列)中所佔比例趨於 0，但是典型集中元素概率的和卻趨近於 1
Theorem
Asymptotic Equipartition Property(AEP)
$\lim_{N\to\infty}P(\mathcal{T}_\varepsilon(p;N))=1 \\$

$2^{-N(H(p)+\epsilon)} \leq p_{\mathrm{y}}(\boldsymbol{y}) \leq 2^{-N(H(p)-\epsilon)}, \forall \boldsymbol{y} \in \mathcal{T}_{\epsilon}(p ; N)$
- for a sufficient large $N$
$(1-\epsilon) 2^{N(H(p)-\epsilon)} \leq\left|\mathcal{T}_{\epsilon}(p ; N)\right| \leq 2^{N(H(p)+\epsilon)}$

Proof:
$\begin{aligned}\left|\mathcal{T}_{\epsilon}(p ; N)\right| &=\sum_{\boldsymbol{y} \in \mathcal{T}_{\epsilon}(p ; N)} 1 \\ &=2^{N(H(p)+\epsilon)} \sum_{\boldsymbol{y} \in \mathcal{T}_{\epsilon}(p ; N)} 2^{-N(H(p)+\epsilon)} \\ & \leq 2^{N(H(p)+\epsilon)} \sum_{\boldsymbol{y} \in \mathcal{T}_{\epsilon}(p ; N)} p_{\mathbf{y}}(\boldsymbol{y}) \\ &=2^{N(H(p)+\epsilon)} P\left\{\mathcal{T}_{\epsilon}(p ; N)\right\} \\ & \leq 2^{N(H(p)+\epsilon)} \end{aligned}$

3. Divergence $\varepsilon$ -typical set

WLLN: $\vec{y}=[y_1,y_2,...,y_N]^T, \ \ \ \ y_i \sim p \ \ \ i.i.d$
$$
L_{p | q}(\boldsymbol{y})=\frac{1}{N} \log \frac{p_{\mathbf{y}}(\boldsymbol{y})}{q_{\mathbf{y}}(\boldsymbol{y})}=\frac{1}{N} \sum_{n=1}^{N} \log \frac{p\left(y_{n}\right)}{q\left(y_{n}\right)} \

\lim {N \rightarrow \infty} \mathbb{P}\left(\left|L{p | q}(\boldsymbol{y})-D(p | q)\right|>\epsilon\right)=0
$$
Remarks: 前面只考慮的均值，這裏還考慮了另一個分佈
Definition: $\vec{\boldsymbol{y}}=[y_1,y_2,...,y_N]^T, \ \ \ \ y_i \sim p \ \ \ i.i.d$
$\mathcal{T}_{\epsilon}(p | q ; N)=\left\{\boldsymbol{y} \in \mathcal{Y}^{N}:\left|L_{p | q}(\boldsymbol{y})-D(p \| q)\right| \leq \epsilon\right\}$
Properties
- WLLN $\Longrightarrow q_{\mathbf{y}}(\boldsymbol{y}) \approx p_{\mathbf{y}}(\boldsymbol{y}) 2^{-N D(p \| q)}$
- $Q\left\{\mathcal{T}_{\epsilon}(p | q ; N)\right\} \approx 2^{-N D(p \| q)} \to0$
- Remarks: p 的典型集可能是 q 的非典型集，在 $N$ 很大的時候，不同分佈的 typical set 是正交的
Theorem
$(1-\epsilon) 2^{-N(D(p \| q)+\epsilon)} \leq Q\left\{\mathcal{T}_{\epsilon}(p \| q ; N)\right\} \leq 2^{-N(D(p \| q)-\epsilon)}$

4. Large deviation of sample averages

Theorem (Cram´er’s Theorem): $\vec{\boldsymbol{y}}=[y_1,y_2,...,y_N]^T, \ \ \ y_i \sim q \ \ \ i.i.d$ with mean $\mu<\infty$ and $\gamma>\mu$
$\lim _{N \rightarrow \infty}-\frac{1}{N} \log \mathbb{P}\left(\frac{1}{N} \sum_{n=1}^{N} y_{n} \geq \gamma\right)=E_{C}(\gamma)$
where $E_C(\gamma)$ is referred as Chernoﬀ exponent
$E_{C}(\gamma) \triangleq D(p(\cdot ; x) \| q),\ \ \ p(\cdot ; x)=q(y) e^{x y-\alpha(x)}$
and with $x>0$ chosen such that
$\mathbb{E}_{p(\cdot;x)}[y]=\gamma$
Proof:

$\begin{aligned} \mathbb{P}\left(\frac{1}{N} \sum_{n=1}^{N} y_{n} \geq \gamma\right) &=\mathbb{P}\left(e^{x \sum_{n=1}^{N} y_{n}} \geq e^{N x \gamma}\right) \\ & \leq e^{-N x \gamma} \mathbb{E}\left[e^{x \sum_{n=1}^{N} y_{n}}\right] \\ &=e^{-N x \gamma}\left(\mathbb{E}\left[e^{x y}\right]\right)^{N} \\ & \leq e^{-N\left(x_{*} \gamma-\alpha\left(x_{*}\right)\right)} \end{aligned}$

$\varphi(x)=x\gamma-\alpha(x)$ 是凸的，最大值取在 $\mathbb{E}_{p\left(\cdot ; x_{*}\right)}[y]=\dot{\alpha}\left(x_{*}\right)=\gamma$

可以證明 $x_{*} \gamma-\alpha\left(x_{*}\right)=x_{*} \dot{\alpha}\left(x_{*}\right)-\alpha\left(x_{*}\right)=D\left(p\left(\cdot ; x_{*}\right) \| q\right)$

於是有 $\mathbb{P}\left(\frac{1}{N} \sum_{n=1}^{N} y_{n} \geq \gamma\right) \leq e^{-N E_{C}(\gamma)}$

下界的證明，暫時略…

用到的兩個事實： $p(y;x)=q(y)\exp(xy-\alpha(x))$

$D(p(y;x)||q(y))$ 隨着 x 單調增加

$\mathbb{E}_{p(;x)}[y]$ 隨着 x 單調增加

Remarks:

這個定理也相當於表達了 $\mathbb{P}\left(\frac{1}{N} \sum_{n=1}^{N} y_{n} \geq \gamma\right) \cong 2^{-N E_{\mathrm{C}}(\gamma)}$

相當於是分佈 q 向由 $\mathbb{E}[y]=\sum_{n=1}^{N} y_{n} \geq \gamma$ 所定義的一個凸集中投影，恰好投影到邊界(線性分佈族) $\mathbb{E}[y]=\gamma$ 上，而 $q$ 向線性分佈族的投影恰好就是 (10) 中的指數族表達式

5. Types and type classes

Definition: $\vec{\boldsymbol{y}}=[y_1,y_2,...,y_N]^T$ (不關心真實服從的是哪個分佈)
- type(實質上就是一個經驗分佈)定義爲
$\hat{p}(b ; \mathbf{y})=\frac{1}{N} \sum_{n=1}^{N} \mathbb{1}_{b}\left(y_{n}\right)=\frac{N_{b}(\mathbf{y})}{N}$
- $\mathcal{P}_{N}^{y}$ 表示長度爲 $N$ 的序列所有可能的 types
- type class: $\mathcal{T}_{N}^{y}(p)=\left\{\mathbf{y} \in y^{N}: \hat{p}(\cdot ; \mathbf{y}) \equiv p(\cdot)\right\},\ \ \ p\in\mathcal{P}_{N}^{y}$
Exponential Rate Notation: $f(N) \doteq 2^{N \alpha}$
$\lim _{N \rightarrow \infty} \frac{\log f(N)}{N}=\alpha$
Remarks: $\alpha$ 表示了指數上面關於 $N$ 的階數(log、線性、二次 …)
Properties
- $\left|\mathcal{P}_{N}^{y}\right| \leq(N+1)^{|y|}$
- $q^{N}(\mathbf{y})=2^{-N(D(\hat{p}(\cdot \mathbf{y}) \| q)+H(\hat{p}(\cdot ; \mathbf{y})))}$
  $p^{N}(\mathbf{y})=2^{-N H(p)} \quad \text { for } \mathbf{y} \in \mathcal{T}_{N}^{y}(p)$
- $c N^{-|y|} 2^{N H(p)} \leq\left|\mathcal{T}_{N}^{y}(p)\right| \leq 2^{N H(p)}$
Theorem
$c N^{-|y|} 2^{-N D(p \| q)} \leq Q\left\{\mathcal{T}_{N}^{y}(p)\right\} \leq 2^{-N D(p \| q)} \\ Q\left\{\mathcal{T}_{N}^{y}(p)\right\} \doteq 2^{-N D(p \| q)}$

6. Large Deviation Analysis via Types

Definition: $\mathcal{R}=\left\{\mathbf{y} \in y^{N}: \hat{p}(\cdot ; \mathbf{y}) \in \mathcal{S} \cap \mathcal{P}_{N}^{y}\right\}$

Sanov’s Theorem:
$Q\left\{\mathrm{S} \cap \mathcal{P}_{N}^{y}\right\} \leq(N+1)^{|y|} 2^{-N D\left(p_{*} \| q\right)} \\ Q\left\{\mathrm{S} \cap \mathcal{P}_{N}^{y}\right\} \dot\leq 2^{-N D\left(p_{*} \| q\right)} \\ p_{*}=\underset{p \in \mathcal{S}}{\arg \min } D(p \| q)$

7. Asymptotics of hypothesis testing

LRT: $L(\boldsymbol{y})=\frac{1}{N} \log \frac{p_{1}^{N}(\boldsymbol{y})}{p_{0}^{N}(\boldsymbol{y})}=\frac{1}{N} \sum_{n=1}^{N} \log \frac{p_{1}\left(y_{n}\right)}{p_{0}\left(y_{n}\right)} \frac{>}{<} \gamma$
$P_{F}=\mathbb{P}_{0}\left\{\frac{1}{N} \sum_{n=1}^{N} t_{n} \geq \gamma\right\} \approx 2^{-N D\left(p^{*} \| p_{0}^{\prime}\right)}$
$P_{M}=1-P_{D} \approx 2^{-N D\left(p^{*} \| p_{1}^{\prime}\right)}$
$D\left(p^{*} \| p_{0}^{\prime}\right)-D\left(p^{*} \| p_{1}^{\prime}\right)=\int p^{*}(t) \log \frac{p_{1}^{\prime}(t)}{p_{0}^{\prime}(t)} \mathrm{d} t=\int p^{*}(t) t \mathrm{d} t=\mathbb{E}_{p^{*}}[\mathrm{t}]=\gamma$

8.Asymptotics of parameter estimation

Strong Law of Large Numbers(SLLN):
$\mathbb{P}\left(\lim _{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^{N} w_{n}=\mu\right)=1$
Central Limit Theorem(CLT):
$\lim _{N \rightarrow \infty} \mathbb{P}\left(\frac{1}{\sqrt{N}} \sum_{n=1}^{N}\left(\frac{w_{n}-\mu}{\sigma}\right) \leq b\right)=\Phi(b)$
以下三個強度依次遞減

依概率 1 收斂(SLLN)： $\mathsf{x}_{N} \stackrel{w . p .1}{\longrightarrow} a$

概率趨於 0(WLLN):

依分佈收斂: $\mathsf{x}_{N} \stackrel{d}{\longrightarrow} p$

Asymptotics of ML Estimation

Theorem:
$\hat{x}_{N}=\arg \max _{x} L_{N}(x ; \mathbf{y})$
在滿足某些條件下(mild conditions)，有
$\begin{array}{c}{\hat{x}_{N} \stackrel{w \cdot p \cdot 1}{\longrightarrow} x_{0}} \\ {\sqrt{N}\left(\hat{x}_{N}-x_{0}\right) \stackrel{d}{\longrightarrow} \mathcal{N}\left(0, J_{y}\left(x_{0}\right)^{-1}\right)}\end{array}$

其他內容請看：
統計推斷(一) Hypothesis Test
統計推斷(二) Estimation Problem
統計推斷(三) Exponential Family
統計推斷(四) Information Geometry
統計推斷(五) EM algorithm
統計推斷(六) Modeling
統計推斷(七) Typical Sequence
統計推斷(八) Model Selection
統計推斷(九) Graphical models
統計推斷(十) Elimination algorithm
統計推斷(十一) Sum-product algorithm

Bonennult

發佈了38 篇原創文章 · 獲贊 33 · 訪問量 3萬+

私信關注

統計推斷(七) Typical Sequence

1. 一些定理

2. Typical set

3. Divergence $\varepsilon$ -typical set

4. Large deviation of sample averages

5. Types and type classes

6. Large Deviation Analysis via Types

7. Asymptotics of hypothesis testing

8.Asymptotics of parameter estimation

凸優化學習筆記 15：梯度方法

最優化方法 23：算子分裂法 & ADMM

最優化方法 22：近似點算法 PPA

最優化方法 18：近似點算子 Proximal Mapping

凸優化學習筆記 2：超平面分離定理

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

統計推斷(七) Typical Sequence

1. 一些定理

2. Typical set

3. Divergence ε\varepsilonε-typical set

4. Large deviation of sample averages

5. Types and type classes

6. Large Deviation Analysis via Types

7. Asymptotics of hypothesis testing

8.Asymptotics of parameter estimation

3. Divergence $\varepsilon$ -typical set