推導Reinforcement Learning Richard S.Sutton and Andrew G. Barto 第二章Bandit算法中的Upper-Confidence-Bound Action Selection.
預備知識
Markov Inequality
對於任意r.v. (random variable) X and constant a > 0,
Prf:
Chebyshev's Inequality
Let X have mean and variance , Then for any a>0,
Prf:
Chernoff's Inequality
For any r.v. X and constants a>0 and t>0,
Prf:
, by Markov Inequality.
Hoeffding Lemma
Tool:
1. Jenson Inequality:
For f is a convex function, and
2.Taylor's Theorem:
All derivatives of f(x) exist at point a,
Prf:
1. is a convex function, for any
We need to find a function , s.t.
2.
So:
3.
So Hoeffding Lemma proved.