Bandit UCB推導

推導Reinforcement Learning Richard S.Sutton and Andrew G. Barto 第二章Bandit算法中的Upper-Confidence-Bound Action Selection.

預備知識

Markov Inequality

對於任意r.v. (random variable) X and constant a > 0,

P(|x| \geq a) \leq \frac{E[|x|]}{a}

Prf:

P(|x| \geq a)=\int _{|x| \geq a}{f(x)}dx \leq \int_{|x|\geq a}{\frac{|x|}{a}f(x)dx}\leq \int{\frac{|x|}{a}f(x)dx}\leq \frac{1}{a}\int{|x|f(x)dx}=\frac{1}{a}E[|x|]

Chebyshev's Inequality

Let X have mean \mu and variance \sigma^2, Then for any a>0,

P(|x-\mu|\geq a)\leq \frac{\sigma^2}{a^2}

Prf:

P(|x-\mu|\geq a)=\int_{|x-\mu|\geq a}{f(x)dx}\leq \int{\frac{|x-\mu|^2}{a^2}f(x)dx}=\frac{1}{a^2}Var(x)=\frac{\sigma^2}{a^2}

Chernoff's Inequality

For any r.v. X and constants a>0 and t>0,

P(X\geq a) \leq \frac{E(e^{tx})}{e^{ta}}

Prf:

P(x\geq a)=P(e^{tx}\geq e^{ta})\leq \frac{E(e^{tx})}{e^{ta}}, by Markov Inequality.

Hoeffding Lemma

Tool:

1. Jenson Inequality:

For f is a convex function, and \lambda_1, \lambda_2 \leq 1, and\ \lambda_1+\lambda_2 = 1.

f(\lambda_1\times x_1 + \lambda_2\times x_2)\leq \lambda_1 f(x_1) + \lambda_2 f(x_2),

2.Taylor's Theorem:

All derivatives of f(x) exist at point a,

f(x) = f(a) + f^{'}(a)(x-a) + \frac{1}{2}f{''}(a)(x-a)^2+... + \frac{1}{k!}f^{k}(a)(x-a)^k+...

Prf:

1.f(x)=e^{\lambda x} is a convex function, for any \alpha \in (0, 1),

f(a \alpha + b(1- \alpha))\leq \alpha f(a) + (1-\alpha)f(b)=a\times e^{\lambda a} + (1-\alpha)\times e^{\lambda b}

x \in [a, b],\ \alpha = \frac{b-x}{b-a},\ so\ x=(1-\alpha)b + \alpha a

$$f(x)=e^{\lambda x}\leq \frac{b-x}{b-a}e^{\lambda a} + \frac{x-a}{b-a}e^{\lambda b}.$$\\ So:\\ $$E(e^{\lambda x})\leq E(\frac{b-x}{b-a})e^{\lambda a} + E(\frac{x-a}{b-a})e^{\lambda b}=\frac{b}{b-a}e^{\lambda a}-\frac{a}{b-a}e^{\lambda b}, where\ E(x)=0,$$

We need to find a function \varphi, s.t.

E[e^{\lambda x}]\leq \frac{b}{b-a}e^{\lambda a}-\frac{a}{b-a}e^{\lambda b}=e^{\varphi (t)}|_{t=?}

2. \varphi(t)=-\theta t + ln(1-\theta + \theta e^t), where\ \theta = -\frac{a}{b-a} > 0.\ \\

e^{\varphi(t)}|_{t=\lambda(b-a)}=e^{-\theta t}\times [1-\theta + \theta e^t]|_{t=\lambda(b-a)}=\\

e^{-\theta \lambda (b-a)}[1- \theta + \theta e^{\lambda (b-a)}]|_{\theta = -\frac{a}{b-a}}=\\ e^{a}[\frac{b}{b-a}-\frac{a}{b-a}e^{\lambda(b-a)}]=\frac{b}{b-a}e^{\lambda a}-\frac{a}{b-a}e^{\lambda b},

So:

E(e^{\lambda x})\leq e^{\varphi(\lambda(b-a))}\leq e^{\frac{1}{8}\lambda^2(b-a)^2}\ [by\ Taylor's\ theorem]

3.\varphi(t) = \varphi(0) + \varphi(0)^{'}(t-0) + \frac{1}{2} \varphi(0)^{''}(t-0)^2 + o(t^2),\ by\ Taylor's\ Theorem.

\varphi(0) = 0,

\varphi(t)^{'}= -\theta + \frac{\theta e^t}{1 -\theta + \theta e^t},

\varphi(0)^{'}=0,

\varphi(t)^{''}=\frac{(1-\theta)\theta e^t}{(1-\theta +\theta e^t)^2},\\ As\ \frac{xy}{(x+y)^2}\leq \frac{1}{4}, \ make\ x=1-\theta,\ y=\theta e^t.\\ \varphi(t)^{''}\leq \frac{1}{4}\\ \varphi(t)=0+0+\frac{1}{8}t^2,\\ E(e^{\lambda x})\leq e^{\varphi(t)}|_{t=\lambda(b-a)}=e^{\frac{1}{8}\lambda^2(b-a)^2}\\ So\ E[e^{\lambda x}]\leq e^{\frac{1}{8}\lambda^2(b-a)^2}

So Hoeffding Lemma proved.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章