﻿ 基于时延Q学习的机器人动态规划方法 Dynamic Planning Method Based on Time Delayed Q-Learning

Computer Science and Application
Vol.07 No.07(2017), Article ID:21468,7 pages
10.12677/CSA.2017.77078

Dynamic Planning Method Based on Time Delayed Q-Learning

Xia Zhuang

Civil Aviation Flight University of China, Guanghan Sichuan

Received: Jul. 8th, 2017; accepted: Jul. 22nd, 2017; published: Jul. 25th, 2017

ABSTRACT

Aiming at the unknown environment of the existing robot dynamic planning methods with the slow convergence, a robot planning method based on time delayed Q-Learning is proposed. Firstly, the robot planning is modeled as MDP model, and it is then transferred as the problem which can be solved by reinforcement learning method. Then, the goal function of dynamic planning is defined, and the planning algorithm based on time delayed Q-Learning is proposed. The Q value of every state action pair is initialized to Rmax, so that all the state action pairs are explored, via decreasing the number of updating for Q value, to improve the updating efficiency. The simulation experiment shows: this time delayed Q-Learning algorithm can achieve the path planning of the mobile robot; compared with the other methods, this method has the advantages of good convergence performance and quick convergence rate with big priority, thus it is an effective robot planning method.

Keywords:Robot, Dynamic Planning, Time Delayed Q Learning, Optimal Policy

1. 引言

2. 背景知识

2.1. MDP

2.2. Q学习算法

Q学习算法 [12] [13] 是由Watkins等人在1989年首次提出的一种离策略学习算法，属于时间差分(Temporal difference, TD)学习算法的一种，在Q学习算法中，产生样本的行为策略和用于评估的策略不是同一个策略。行为策略往往采用ε-greedy策略，ε-greedy策略即Agent以1-ε的概率选择最优动作，而以ε/|m|的概率选择其他动作，m为动作的个数。评估策略采用的是greedy策略，即贪心策略，在Q学习中，动作值函数的更新可以表示为：

(1)

3. 基于时延Q学习的机器人动态规划

3.1. 机器人规划的MDP模型

MDP模型需要根据特定的场景进行建模，如对于图1所示机器人规划场景。

Figure 1. Robot programming experiment scene graph

(2)

(3)

3.2. 机器人规划的目标

(4)

(5)

3.3. 基于时延Q学习的机器人最优规划算法

1) 采用Rmax来对所有状态动作对的Q值设置为，使得Agent在初始时刻就开始尽量地探索，因为没有被探索过的状态动作对对应的Q值都初始化为最大值，因为会有更高的概率被探索到。

2) 状态动作对被访问过h次后，才开始更新状态动作对的Q值。

(6)

If then

;

End If

If then

If then

End If

End If

If then

End If

If then

Else if then

End If

If

Else

End If

(7)

4. 仿真实验

Figure 2. The robot plans the optimal path

Figure 3. Three methods convergent performance comparison

5. 总结

Dynamic Planning Method Based on Time Delayed Q-Learning[J]. 计算机科学与应用, 2017, 07(07): 671-677. http://dx.doi.org/10.12677/CSA.2017.77078

1. 1. Schaal, S. and Atkeson, C. (2010) Learning Control in Robotics. IEEE Robotics & Automation Magazine, 17, 20-29. https://doi.org/10.1109/MRA.2010.936957

2. 2. 宋勇, 李贻斌, 李彩虹. 移动机器人路径规划强化学习的初始化[J]. 控制理论与应用, 2012, 12(29): 1623-1628.

3. 3. Bu, Q., Wang, Z. and Tong, X. (2013) An Improved Genetic Algorithm for Searching for Pollution Sources. Water Science and Engineering, 6, 392-401.

4. 4. Deng, Z.Y. and Chen, C.K. (2006) Mobile Robot Path Planning Based on Improved Genetic Algorithm. Journal of Chinese Computer Systems, 27, 1695-1699.

5. 5. Liu, C.M., Li, Z.B., Zhen, H., et al. (2013) A Reactive Navigation Method of Mobile Robots Based on LSPI and Rolling Windows. Journal of Central South University (Science and Technology), 44, 970-977.

6. 6. Er, M.J. and Zhou, Y. (2008) A Novel Framework for Automatic Generation of Fuzzy Neural Networks. Neurocomputing, 71, 584-591. https://doi.org/10.1016/j.neucom.2007.03.015

7. 7. 曾明如, 徐小勇, 罗浩, 徐志敏. 多步长蚁群算法的机器人路径规划研究[J]. 小型微型计算机系统, 2016, 2(37): 366-369.

8. 8. 屈鸿, 黄利伟, 柯星. 动态环境下基于改进蚁群算法的机器人路径规划研究[J]. 电子科技大学学报, 2015, 2(44): 260-265.

9. 9. 翁理国, 纪壮壮, 夏旻, 王安. 基于改进多目标粒子群算法的机器人路径规划[J]. 系统仿真学报, 2014, 12(26): 2892-2898.

10. 10. 潘桂彬, 潘丰, 刘国栋. 基于改进混合蛙跳算法的移动机器人路径规划[J]. 计算机应用, 2014, 34(10): 2850-2853.

11. 11. 温素芳, 郭光耀. 基于改进人工势场法的移动机器人路径规划[J]. 计算机工程与设计, 2015, 10(36): 2818-2822.

12. 12. Watkins, C.J.C.H. and Dayan, P. (1992) Q-Learning. Machine Learning, 8, 279-292.

13. 13. Palanisamy, M., Modares, H., Lewis, F.L., et al. (2015) Continuous-Time Q-Learning for Infinite-Horizon Discounted Cost Linear Quadratic Regulator Problems. IEEE Transactions on Cybernetics, 45, 165-176. https://doi.org/10.1109/TCYB.2014.2322116