﻿ 基于时延Q学习的机器人动态规划方法 Dynamic Planning Method Based on Time Delayed Q-Learning

Dynamic Planning Method Based on Time Delayed Q-Learning

Xia Zhuang

Civil Aviation Flight University of China, Guanghan Sichuan

Received: Jul. 8th, 2017; accepted: Jul. 22nd, 2017; published: Jul. 25th, 2017

ABSTRACT

Aiming at the unknown environment of the existing robot dynamic planning methods with the slow convergence, a robot planning method based on time delayed Q-Learning is proposed. Firstly, the robot planning is modeled as MDP model, and it is then transferred as the problem which can be solved by reinforcement learning method. Then, the goal function of dynamic planning is defined, and the planning algorithm based on time delayed Q-Learning is proposed. The Q value of every state action pair is initialized to Rmax, so that all the state action pairs are explored, via decreasing the number of updating for Q value, to improve the updating efficiency. The simulation experiment shows: this time delayed Q-Learning algorithm can achieve the path planning of the mobile robot; compared with the other methods, this method has the advantages of good convergence performance and quick convergence rate with big priority, thus it is an effective robot planning method.

Keywords:Robot, Dynamic Planning, Time Delayed Q Learning, Optimal Policy

1. 引言

2. 背景知识

2.1. MDP

2.2. Q学习算法

Q学习算法 [12] [13] 是由Watkins等人在1989年首次提出的一种离策略学习算法，属于时间差分(Temporal difference, TD)学习算法的一种，在Q学习算法中，产生样本的行为策略和用于评估的策略不是同一个策略。行为策略往往采用ε-greedy策略，ε-greedy策略即Agent以1-ε的概率选择最优动作，而以ε/|m|的概率选择其他动作，m为动作的个数。评估策略采用的是greedy策略，即贪心策略，在Q学习中，动作值函数的更新可以表示为：

(1)

3. 基于时延Q学习的机器人动态规划

3.1. 机器人规划的MDP模型

MDP模型需要根据特定的场景进行建模，如对于图1所示机器人规划场景。

Figure 1. Robot programming experiment scene graph

(2)

(3)

3.2. 机器人规划的目标

(4)

(5)

3.3. 基于时延Q学习的机器人最优规划算法

1) 采用Rmax来对所有状态动作对的Q值设置为，使得Agent在初始时刻就开始尽量地探索，因为没有被探索过的状态动作对对应的Q值都初始化为最大值，因为会有更高的概率被探索到。

2) 状态动作对被访问过h次后，才开始更新状态动作对的Q值。

(6)

If then

;

End If

If then

If then

End If

End If

If then

End If

If then

Else if then

End If

If

Else

End If

(7)

4. 仿真实验

Figure 2. The robot plans the optimal path

Figure 3. Three methods convergent performance comparison

5. 总结

