Chapter 6 Stochastic Approximation

RyanLee_ljx...大约 3 分钟

Chapter 6 Stochastic Approximation

Stochastic Approximation (SA) refers to a broad class of stochastic iterative algorithms solving root finding or optimization problems. Compared to many other root-finding algorithms such as gradient-based methods, SA is powerful in the sense that it does not require to know the expression of the objective function nor its derivative.

This chapter we will introduce the some SA.

Mean Estimation in a iterative manner

We can calculate the mean of $X$ by $\mathbb{E}[X] \approx \bar{x} := \frac{1}{N} \sum_{i=1}^N x_i$ .

This way is trivial since it needs to collect all the samples then calculate.

An iterative way can be written as:

This approach is incremental. A mean estimate can be obtained immediately once a sample is received. Then, the mean estimate can be used for other purposes immediately.

Robbins-Monro algorithm

The Robbins-Monro algorithm (RM) is designed to solve the root finding problem.

g(w) = 0

Here $w \in \mathbb{R}$ is the varible to be solved and $g:\mathbb{R} \to \mathbb{R}$ is a function.

Many problems can be converted to the root finding problem, e.g. minimizing a function needs to calculate the root of its derivative.

RM is a model-free approach without knowing $g$ . This is important since $g$ can be much more complex in some nueral networks.

RM algorithm is:

It input is a sequence of root estimates ${w_k}$ and outputs a sequence with noise.

The Convergence properties of RM are as follows:

Explanation of Convergent conditions in RM

In condition 2, $\sum^{\infty }_{k=1} a_k^2 < \infty$ indicates $k \to \infty, a_k^2 \to 0$ This is important for ensuring $w$ can reach $w^*$ .

$\sum^{\infty }_{k=1} a_k = \infty$ in condition 2 indicates that $a_k$ should not converge to zero too fast.

Comparison between Mean estimation and RM

Mean Estimation is actually a special case of RM.

RM:

w_{k+1} = w_k - \alpha_{k}\tilde{g}(w_k, \eta_{k})

Mean estimation:

w_{k+1} = w_k - \frac{1}{k}(w_k - x_k)

If we set $g(x)=w-\mathbb{E}(X)$ . The goal is to find the root of $g(x)$ . This setting corresponds with the goal of mean estimation.

We set $\alpha_{k}$ as $\frac{1}{k}$ . Next we need to derive the $\tilde{g}(w_k, \eta_{k})$ .

In Mean estimation, $\tilde{g}=w-x$ . So we need to prove $\tilde{g}(w, \eta)=w-x=g(w)+\eta$

\tilde{g}(w, \eta) = w - x = w - x + \mathbb{E}[X] - \mathbb{E}[X] = (w - \mathbb{E}[X]) + (\mathbb{E}[X] - x) \doteq g(w) + \eta

Here, $x$ is a sample of $X$ , just like the $x_k$ obtained in $k$ iteration.

Stochastic gradient descent

We know the gradient descent (GD) method:

w_{k+1} = w_k - \alpha_k \nabla_w \mathbb{E}[f(w_k, X)] = w_k - \alpha_k \mathbb{E}[\nabla_w f(w_k, X)]

However, directly calculate the gradient of $f$ may be difficult, since 1) We do not know the distribution of $X$ . 2) calculate the gradient of $f$ may sometimes difficult and time-consuming.

So Batch gradient descent (BGD) is introduced:

\begin{array}{c} \mathbb{E}[\nabla_w f(w_k, X)] \approx \frac{1}{n} \sum_{i=1}^{n} \nabla_w f(w_k, x_i), \\ w_{k+1} = w_k - \alpha_k \frac{1}{n} \sum_{i=1}^{n} \nabla_w f(w_k, x_i) \end{array}

It still requires many samples in each iteration.

So we introduce Stochastic gradient descent (SGD) by replacing the true gradient in GD $\mathbb{E}[\nabla_w f(w_k, X)]$ with the stochastic gradient (one sample estimation of gradient) $\nabla_w f(w_k, x_k)$ .

w_{k+1}=w_k - \alpha_{k} \nabla_w f(w_k, x_k)

So here batch is 1 in BGD.

Comparison SGD with RM

In RM:

w_{k+1} = w_k - \alpha_{k}\tilde{g}(w_k, \eta_{k})

We replace $\alpha_{k}$ with the learning rate. Next we need to prove $\tilde{g}(w_k, \eta_{k})$ can be the single sample gradient estimation $\nabla_w f(w_k, x_k)$ .

So SGD follows RM's Convergence.

Between BGD and SGD is MBGD, which can be expressed as:

Deeper understanding of SGD

In practice, SGD can converge towards the minima quickly, but will staturate around the minima.

Though the estimation of a single sample gradient is not accurate. It will bring noise. However, this noise can be beneficial:

Escaping Local Minima: The noise in SGD's gradient estimates helps it jump out of sharp, suboptimal local minima that full-batch gradient descent would firmly converge to, often leading to better generalizing flat minima.
Evading Saddle Points: In high-dimensional non-convex problems, saddle points are prevalent. While full-batch gradient descent stalls at these points, the stochastic noise in SGD constantly perturbs the parameters, pushing the optimization process forward.

昵称

邮箱

网址

按正序
按倒序
按热度

Chapter 6 Stochastic Approximation

Chapter 6 Stochastic Approximation

Mean Estimation in a iterative manner

Robbins-Monro algorithm

Stochastic gradient descent

预览: