跳至主要內容

Chapter 6 Stochastic Approximation

RyanLee_ljx...大约 3 分钟RL

Chapter 6 Stochastic Approximation

Stochastic Approximation (SA) refers to a broad class of stochastic iterative algorithms solving root finding or optimization problems. Compared to many other root-finding algorithms such as gradient-based methods, SA is powerful in the sense that it does not require to know the expression of the objective function nor its derivative.

This chapter we will introduce the some SA.

Mean Estimation in a iterative manner

We can calculate the mean of XX by E[X]xˉ:=1Ni=1Nxi\mathbb{E}[X] \approx \bar{x} := \frac{1}{N} \sum_{i=1}^N x_i.

This way is trivial since it needs to collect all the samples then calculate.

An iterative way can be written as:

Mean Estimation in a iterative manner
Mean Estimation in a iterative manner

This approach is incremental. A mean estimate can be obtained immediately once a sample is received. Then, the mean estimate can be used for other purposes immediately.

Robbins-Monro algorithm

The Robbins-Monro algorithm (RM) is designed to solve the root finding problem.

g(w)=0 g(w) = 0

Here wRw \in \mathbb{R} is the varible to be solved and g:RRg:\mathbb{R} \to \mathbb{R} is a function.

Many problems can be converted to the root finding problem, e.g. minimizing a function needs to calculate the root of its derivative.

RM is a model-free approach without knowing gg. This is important since gg can be much more complex in some nueral networks.

RM algorithm is:

Robbins-Monro (RM) algorithm
Robbins-Monro (RM) algorithm

It input is a sequence of root estimates wk{w_k} and outputs a sequence with noise.

Robbins-Monro (RM) algorithm
Robbins-Monro (RM) algorithm

The Convergence properties of RM are as follows:

Convergence of RM
Convergence of RM
Explanation of Convergent conditions in RM
Explanation of Convergent conditions in RM

In condition 2, k=1ak2<\sum^{\infty }_{k=1} a_k^2 < \infty indicates k,ak20k \to \infty, a_k^2 \to 0 This is important for ensuring ww can reach ww^*.

Condition 2 Explanation
Condition 2 Explanation

k=1ak=\sum^{\infty }_{k=1} a_k = \infty in condition 2 indicates that aka_k should not converge to zero too fast.

Condition 2 Explanation
Condition 2 Explanation

Comparison between Mean estimation and RM

Mean Estimation is actually a special case of RM.

RM:

wk+1=wkαkg~(wk,ηk) w_{k+1} = w_k - \alpha_{k}\tilde{g}(w_k, \eta_{k})

Mean estimation:

wk+1=wk1k(wkxk) w_{k+1} = w_k - \frac{1}{k}(w_k - x_k)

If we set g(x)=wE(X)g(x)=w-\mathbb{E}(X). The goal is to find the root of g(x)g(x). This setting corresponds with the goal of mean estimation.

We set αk\alpha_{k} as 1k\frac{1}{k}. Next we need to derive the g~(wk,ηk)\tilde{g}(w_k, \eta_{k}).

In Mean estimation, g~=wx\tilde{g}=w-x. So we need to prove g~(w,η)=wx=g(w)+η\tilde{g}(w, \eta)=w-x=g(w)+\eta

g~(w,η)=wx=wx+E[X]E[X]=(wE[X])+(E[X]x)g(w)+η \tilde{g}(w, \eta) = w - x = w - x + \mathbb{E}[X] - \mathbb{E}[X] = (w - \mathbb{E}[X]) + (\mathbb{E}[X] - x) \doteq g(w) + \eta

Here, xx is a sample of XX, just like the xkx_k obtained in kk iteration.

Stochastic gradient descent

We know the gradient descent (GD) method:

wk+1=wkαkwE[f(wk,X)]=wkαkE[wf(wk,X)] w_{k+1} = w_k - \alpha_k \nabla_w \mathbb{E}[f(w_k, X)] = w_k - \alpha_k \mathbb{E}[\nabla_w f(w_k, X)]

However, directly calculate the gradient of ff may be difficult, since 1) We do not know the distribution of XX. 2) calculate the gradient of ff may sometimes difficult and time-consuming.

So Batch gradient descent (BGD) is introduced:

E[wf(wk,X)]1ni=1nwf(wk,xi),wk+1=wkαk1ni=1nwf(wk,xi) \begin{array}{c} \mathbb{E}[\nabla_w f(w_k, X)] \approx \frac{1}{n} \sum_{i=1}^{n} \nabla_w f(w_k, x_i), \\ w_{k+1} = w_k - \alpha_k \frac{1}{n} \sum_{i=1}^{n} \nabla_w f(w_k, x_i) \end{array}

It still requires many samples in each iteration.

So we introduce Stochastic gradient descent (SGD) by replacing the true gradient in GD E[wf(wk,X)]\mathbb{E}[\nabla_w f(w_k, X)] with the stochastic gradient (one sample estimation of gradient) wf(wk,xk)\nabla_w f(w_k, x_k).

wk+1=wkαkwf(wk,xk) w_{k+1}=w_k - \alpha_{k} \nabla_w f(w_k, x_k)

So here batch is 1 in BGD.

Comparison SGD with RM

SGD with RM
SGD with RM

In RM:

wk+1=wkαkg~(wk,ηk) w_{k+1} = w_k - \alpha_{k}\tilde{g}(w_k, \eta_{k})

We replace αk\alpha_{k} with the learning rate. Next we need to prove g~(wk,ηk)\tilde{g}(w_k, \eta_{k}) can be the single sample gradient estimation wf(wk,xk)\nabla_w f(w_k, x_k).

SGD with RM
SGD with RM

So SGD follows RM's Convergence.

Between BGD and SGD is MBGD, which can be expressed as:

BGD, MBGD, SGD
BGD, MBGD, SGD

Deeper understanding of SGD

In practice, SGD can converge towards the minima quickly, but will staturate around the minima.

Though the estimation of a single sample gradient is not accurate. It will bring noise. However, this noise can be beneficial:

  1. Escaping Local Minima: The noise in SGD's gradient estimates helps it jump out of sharp, suboptimal local minima that full-batch gradient descent would firmly converge to, often leading to better generalizing flat minima.

  2. Evading Saddle Points: In high-dimensional non-convex problems, saddle points are prevalent. While full-batch gradient descent stalls at these points, the stochastic noise in SGD constantly perturbs the parameters, pushing the optimization process forward.

评论
  • 按正序
  • 按倒序
  • 按热度
Powered by Waline v3.1.3