# Reinforcement Learning: An Introduction
[![Build Status](https://travis-ci.org/ShangtongZhang/reinforcement-learning-an-introduction.svg?branch=master)](https://travis-ci.org/ShangtongZhang/reinforcement-learning-an-introduction)
Python code for Sutton & Barto's book [*Reinforcement Learning: An Introduction (2nd Edition)*](http://incompleteideas.net/book/the-book-2nd.html)
> If you have any confusion about the code or want to report a bug, please open an issue instead of emailing me directly.
# Contents
> Click to view the sample output
### Chapter 1
1. Tic-Tac-Toe
### Chapter 2
1. [Figure 2.1: An exemplary bandit problem from the 10-armed testbed](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#2_1)
2. [Figure 2.2: Average performance of epsilon-greedy action-value methods on the 10-armed testbed](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#2_2)
3. [Figure 2.3: Optimistic initial action-value estimates](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#2_3)
4. [Figure 2.4: Average performance of UCB action selection on the 10-armed testbed](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#2_4)
5. [Figure 2.5: Average performance of the gradient bandit algorithm](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#2_5)
6. [Figure 2.6: A parameter study of the various bandit algorithms](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#2_6)
### Chapter 3
1. [Figure 3.5: Grid example with random policy](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#3_5)
2. [Figure 3.8: Optimal solutions to the gridworld example](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#3_8)
### Chapter 4
1. [Figure 4.1: Convergence of iterative policy evaluation on a small gridworld](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#4_1)
2. [Figure 4.2: Jack’s car rental problem](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#4_2)
3. [Figure 4.3: The solution to the gambler’s problem](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#4_3)
### Chapter 5
1. [Figure 5.1: Approximate state-value functions for the blackjack policy](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#5_1)
2. [Figure 5.3: The optimal policy and state-value function for blackjack found by Monte Carlo ES](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#5_3)
3. [Figure 5.4: Weighted importance sampling](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#5_4)
4. [Figure 5.5: Ordinary importance sampling with surprisingly unstable estimates](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#5_5)
### Chapter 6
1. [Figure 6.2: Random walk](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#6_2)
2. [Figure 6.3: Batch updating](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#6_3)
3. [Figure 6.4: Sarsa applied to windy grid world](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#6_4)
4. [Figure 6.5: The cliff-walking task](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#6_5)
5. [Figure 6.7: Interim and asymptotic performance of TD control methods](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#6_7)
6. [Figure 6.8: Comparison of Q-learning and Double Q-learning](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#6_8)
### Chapter 7
1. [Figure 7.2: Performance of n-step TD methods on 19-state random walk](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#7_2)
### Chapter 8
1. [Figure 8.3: Average learning curves for Dyna-Q agents varying in their number of planning steps](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#8_3)
2. [Figure 8.5: Average performance of Dyna agents on a blocking task](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#8_5)
3. [Figure 8.6: Average performance of Dyna agents on a shortcut task](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#8_6)
4. [Figure 8.7: Prioritized sweeping significantly shortens learning time on the Dyna maze task](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#8_7)
### Chapter 9
1. [Figure 9.1: Gradient Monte Carlo algorithm on the 1000-state random walk task](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#9_1)
2. [Figure 9.2: Semi-gradient n-steps TD algorithm on the 1000-state random walk task](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#9_2)
3. [Figure 9.5: Fourier basis vs polynomials on the 1000-state random walk task](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#9_5)
4. [Figure 9.8: Example of feature width’s effect on initial generalization and asymptotic accuracy](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#9_8)
5. [Figure 9.10: Single tiling and multiple tilings on the 1000-state random walk task](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#9_10)
### Chapter 10
1. [Figure 10.1: The cost-to-go function for Mountain Car task in one run](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#10_1)
2. [Figure 10.2: Learning curves for semi-gradient Sarsa on Mountain Car task](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#10_2)
3. [Figure 10.3: One-step vs multi-step performance of semi-gradient Sarsa on the Mountain Car task](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#10_3)
4. [Figure 10.4: Effect of the alpha and n on early performance of n-step semi-gradient Sarsa](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#10_4)
5. [Figure 10.5: Differential semi-gradient Sarsa on the access-control queuing task](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#10_5)
### Chapter 11
1. [Figure 11.2: Baird's Counterexample](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#11_2)
2. [Figure 11.6: The behavior of the TDC algorithm on Baird’s counterexample](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#11_6)
3. [Figure 11.7: The behavior of the ETD algorithm in expectation on Baird’s counterexample](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#11_7)
### Chapter 12
1. [Figure 12.3: Off-line λ-return algorithm on 19-state random walk](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#12_3)
2. [Figure 12.6: TD(λ) algorithm on 19-state random walk](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#12_6)
3. [Figure 12.8: True online TD(λ) algorithm on 19-state random walk](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#12_8)
4. [Figure 12.10: Sarsa(λ) with replacing traces on Mountain Car](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#12_10)
5. [Figure 12.11: Summary comparison of Sarsa(λ) algorithms on Mountain Car](https://shangtongzhang.github.io/reinforcement-learning-an-introduction/#12_11)
# Environment
* Python2 or Python3
* Numpy
* Matplotlib
* Six
* Seaborn
# Usage
```commandline
git clone https://github.com/ShangtongZhang/reinforcement-learning-an-introduction.git
cd reinforcement-learning-an-introduction/chapterXX
python XXX.py
```
# Contribution
This project contains almost all the programmable figures in the book. However, when I completed this project, the book is still in draft and some chapters are still incomplete. Furthermore, due to the limited computational capacity of my machine, I can only use limited runs and episodes for some experiments, so the sample output is much
评论0