In chapter 8, we introduce value approximation function, that is to replace tabular representations for state/action value with function. Similarly, in chapter 9 we use function to represent policy instead fo tabular and turn to policy-based methods. So in this chapter we combine both of them, representing both value and policy with function and incorporating both policy-based and value-based methods.

In all previous chapters, all the methods are value-based methods. The difference between value-based and policy-based methods lies in their approach. Value-based methods generate policies implicitly and indirectly. The algorithm itself does not directly maintain a policy function. Instead, it solves for the value (state or action value) based on model-free or model-based methods. It greedily (or uses an epsilon-greedy approach) determines the action by maximizing the value function at each step (e.g., ), thereby deriving the policy from the value. In contrast, policy-based methods directly represent the policy as a parameterized function , where is a parameter vector (instead of previous tabular representation). The probability distribution of the policy is obtained by directly optimizing the parameter .
In this chapter we move from previous tabular representation for state/action value to function representation. That is to say, we use a function to fit the true expression of the state/action value function. Such a function can be predifined, e.g., a linear function, or a neural network.
The reason why we move from tabular-based representation to function-based are:
这篇博客主要是对Kevin M. Lynch and Frank C. Park老师的MODERN ROBOTICS MECHANICS, PLANNING, AND CONTROL一书内容做一个简单整理,只涵盖主要核心的内容,细节内容会有所忽略。
第 1 章:绪论 (Preliminary)
机器人的本质是由刚体 (Rigid Bodies) 组成的系统。
- 连杆 (Links): 机器人系统中的刚体。
- 关节 (Joints): 连接相邻连杆并允许其发生相对运动的部件。
第 2 章:位形空间 (Configuration Space)
2.1 基本概念
- 位形 (Configuration): 指定机器人上每一个点的位置(Position)和姿态(Orientation)的一组参数。
- 位形空间 (C-space): 所有可能位形的集合。
- 自由度 (Degrees of Freedom, dof): C-space 的维度,即表示机器人位形所需的最小实数参数的个数。
对数据的认识
机器学习就是对一个未知分布的数据建模的过程。无论是机器学习哪种学派,其都认为观察到的数据并不是凭空产生的,而是由一个潜在的、客观存在的数据生成过程所产生。这个数据生成过程可以用一个概率分布来描述。
例如抛硬币,会出现正面或反面,我们抛了次,得到个数据。这个结果就可以看作是由一个伯努利分布生成(采样)的。
隐变量
举一个例子(源于【隐变量(潜在变量)模型】硬核介绍):
观察下图,表面上我们观测到的数据是一堆点 ,但实际上我们可以直观地发现这些点以某种概率采样自四个不同的分布(假设都是高斯分布)。而潜在变量 控制了 从哪个分布中采样:,其中 。设 已知。于是,潜在变量 表示观测变量 对应类别的序号。
这篇博客及后续内容将主要介绍扩散模型的相关内容,包括一些基础知识,最终引入扩散模型,最终希望介绍Diffusion Policy在机械臂motion planning的应用。
In this section we will first introduce TD learning, which refers a wide range of algorithms. It can solve Bellman equation of a given policy without model. We refer TD learning in the first chapter specifically as a classic algorithm for estimating state values. Then we will introduce other algorithms belonging to the wide range of TD learning in the next section.
Stochastic Approximation (SA) refers to a broad class of stochastic iterative algorithms solving root finding or optimization problems. Compared to many other root-finding algorithms such as gradient-based methods, SA is powerful in the sense that it does not require to know the expression of the objective function nor its derivative.
