Chapter 6 Stochastic Approximation
Chapter 6 Stochastic Approximation
Stochastic Approximation (SA) refers to a broad class of stochastic iterative algorithms solving root finding or optimization problems. Compared to many other root-finding algorithms such as gradient-based methods, SA is powerful in the sense that it does not require to know the expression of the objective function nor its derivative.
This chapter we will introduce the some SA.
Mean Estimation in a iterative manner
We can calculate the mean of by .
This way is trivial since it needs to collect all the samples then calculate.
An iterative way can be written as:

This approach is incremental. A mean estimate can be obtained immediately once a sample is received. Then, the mean estimate can be used for other purposes immediately.
Robbins-Monro algorithm
The Robbins-Monro algorithm (RM) is designed to solve the root finding problem.
Here is the varible to be solved and is a function.
Many problems can be converted to the root finding problem, e.g. minimizing a function needs to calculate the root of its derivative.
RM is a model-free approach without knowing . This is important since can be much more complex in some nueral networks.
RM algorithm is:

It input is a sequence of root estimates and outputs a sequence with noise.

The Convergence properties of RM are as follows:


In condition 2, indicates This is important for ensuring can reach .

in condition 2 indicates that should not converge to zero too fast.

Comparison between Mean estimation and RM
Mean Estimation is actually a special case of RM.
RM:
Mean estimation:
If we set . The goal is to find the root of . This setting corresponds with the goal of mean estimation.
We set as . Next we need to derive the .
In Mean estimation, . So we need to prove
Here, is a sample of , just like the obtained in iteration.
Stochastic gradient descent
We know the gradient descent (GD) method:
However, directly calculate the gradient of may be difficult, since 1) We do not know the distribution of . 2) calculate the gradient of may sometimes difficult and time-consuming.
So Batch gradient descent (BGD) is introduced:
It still requires many samples in each iteration.
So we introduce Stochastic gradient descent (SGD) by replacing the true gradient in GD with the stochastic gradient (one sample estimation of gradient) .
So here batch is 1 in BGD.
Comparison SGD with RM

In RM:
We replace with the learning rate. Next we need to prove can be the single sample gradient estimation .

So SGD follows RM's Convergence.
Between BGD and SGD is MBGD, which can be expressed as:

Deeper understanding of SGD
In practice, SGD can converge towards the minima quickly, but will staturate around the minima.
Though the estimation of a single sample gradient is not accurate. It will bring noise. However, this noise can be beneficial:
Escaping Local Minima: The noise in SGD's gradient estimates helps it jump out of sharp, suboptimal local minima that full-batch gradient descent would firmly converge to, often leading to better generalizing flat minima.
Evading Saddle Points: In high-dimensional non-convex problems, saddle points are prevalent. While full-batch gradient descent stalls at these points, the stochastic noise in SGD constantly perturbs the parameters, pushing the optimization process forward.
