|
| |
SERIES FOREWORD |
|
| |
|
PREFACE |
|
| |
| I |
THE PROBLEM |
|
| |
| 1 |
Introduction |
|
| |
| |
|
1.1 |
Reinforcement Learning |
|
| |
| |
|
1.2 |
Examples |
|
| |
| |
|
1.3 |
Elements of Reinforcement Learning |
|
| |
| |
|
1.4 |
An Extended Example: Tic-Tac-Toe |
|
| |
| |
|
1.5 |
Summary |
|
| |
| |
|
1.6 |
History of Reinforcement Learning |
|
| |
| |
|
1.7 |
Bibliographical Remarks |
|
| |
| 2 |
Evaluative Feedback |
|
| |
| |
|
2.1 |
An n-Armed Bandit Problem |
|
| |
| |
|
2.2 |
Action-Value Methods |
|
| |
| |
|
2.3 |
Softmax Action Selection |
|
| |
| |
|
2.4 |
Evaluation Versus Instruction |
|
| |
| |
|
2.5 |
Incremental Implementation |
|
| |
| |
|
2.6 |
Tracking a Nonstationary Problem |
|
| |
| |
|
2.7 |
Optimistic Initial Values |
|
| |
| |
|
2.8 |
Reinforcement Comparison |
|
| |
| |
|
2.9 |
Pursuit Methods |
|
| |
| |
|
2.10 |
Associative Search |
|
| |
| |
|
2.11 |
Conclusions |
|
| |
| |
|
2.12 |
Bibliographical and Historical Remarks |
|
| |
| 3 |
The Reinforcement Learning Problem |
|
| |
| |
|
3.1 |
The Agent-Environment Interface |
|
| |
| |
|
3.2 |
Goals and Rewards |
|
| |
| |
|
3.3 |
Returns |
|
| |
| |
|
3.4 |
Unified Notation for Episodic and Continuing Tasks |
|
| |
| |
|
3.5 |
The Markov Property |
|
| |
| |
|
3.6 |
Markov Decision Processes |
|
| |
| |
|
3.7 |
Value Functions |
|
| |
| |
|
3.8 |
Optimal Value Functions |
|
| |
| |
|
3.9 |
Optimality and Approximation |
|
| |
| |
|
3.10 |
Summary |
|
| |
| |
|
3.11 |
Bibliographical and Historical Remarks |
|
| |
| II |
ELEMENTARY SOLUTION METHODS |
|
| |
| 4 |
Dynamic Programming |
|
| |
| |
|
4.1 |
Policy Evaluation |
|
| |
| |
|
4.2 |
Policy Improvement |
|
| |
| |
|
4.3 |
Policy Iteration |
|
| |
| |
|
4.4 |
Value Iteration |
|
| |
| |
|
4.5 |
Asynchronous Dynamic Programming |
|
| |
| |
|
4.6 |
Generalized Policy Iteration |
|
| |
| |
|
4.7 |
Efficiency of Dynamic Programming |
|
| |
| |
|
4.8 |
Summary |
|
| |
| |
|
4.9 |
Bibliographical and Historical Remarks |
|
| |
| 5 |
Monte Carlo Methods |
|
| |
| |
|
5.1 |
Monte Carlo Policy Evaluation |
|
| |
| |
|
5.2 |
Monte Carlo Estimation of Action Values |
|
| |
| |
|
5.3 |
Monte Carlo Control |
|
| |
| |
|
5.4 |
On-Policy Monte Carlo Control |
|
| |
| |
|
5.5 |
Evaluating One Policy While Following Another |
|
| |
| |
|
5.6 |
Off-Policy Monte Carlo Control |
|
| |
| |
|
5.7 |
Incremental Implementation |
|
| |
| |
|
5.8 |
Summary |
|
| |
| |
|
5.9 |
Bibliographical and Historical Remarks |
|
| |
| 6 |
Temporal-Difference Learning |
|
| |
| |
|
6.1 |
TD Prediction |
|
| |
| |
|
6.2 |
Advantages of TD Prediction Methods |
|
| |
| |
|
6.3 |
Optimality of TD(O) |
|
| |
| |
|
6.4 |
Sarsa: On-Policy TD Control |
|
| |
| |
|
6.5 |
Q-Learning: Off-Policy TD Control |
|
| |
| |
|
6.6 |
Actor-Critic Methods |
|
| |
| |
|
6.7 |
R-Learning for Undiscounted Continuing Tasks |
|
| |
| |
|
6.8 |
Games, Afterstates, and Other Special Cases |
|
| |
| |
|
6.9 |
Summary |
|
| |
| |
|
6.10 |
Bibliographical and Historical Remarks |
|
| |
| III |
A UNIFIED VIEW |
|
| |
| 7 |
Eligibility Traces |
|
| |
| |
|
7.1 |
n-Step TD Prediction |
|
| |
| |
|
7.2 |
The Forward View of TD(l) |
|
| |
| |
|
7.3 |
The Backward View of TD(l) |
|
| |
| |
|
7.4 |
Equivalence of Forward and Backward Views |
|
| |
| |
|
7.5 |
Sarsa(l) |
|
| |
| |
|
7.6 |
Q(l) |
|
| |
| |
|
7.7 |
Eligibility Traces for Actor-Critic Methods |
|
| |
| |
|
7.8 |
Replacing Traces |
|
| |
| |
|
7.9 |
Implementation Issues |
|
| |
| |
|
7.10 |
Variable l |
|
| |
| |
|
7.11 |
Conclusions |
|
| |
| |
|
7.12 |
Bibliographical and Historical Remarks |
|
| |
| 8 |
Generalization and Function Approximation |
|
| |
| |
|
8.1 |
Value Prediction with Function Approximation |
|
| |
| |
|
8.2 |
Gradient-Descent Methods |
|
| |
| |
|
8.3 |
Linear Methods |
|
| |
| |
|
8.4 |
Control with Function Approximation |
|
| |
| |
|
8.5 |
Off-Policy Bootstrapping |
|
| |
| |
|
8.6 |
Should We Bootstrap? |
|
| |
| |
|
8.7 |
Summary |
|
| |
| |
|
8.8 |
Bibliographical and Historical Remarks |
|
| |
| 9 |
Planning and Learning |
|
| |
| |
|
9.1 |
Models and Planning |
|
| |
| |
|
9.2 |
Integrating Planning, Acting, and Learning |
|
| |
| |
|
9.3 |
When the Model Is Wrong |
|
| |
| |
|
9.4 |
Prioritized Sweeping |
|
| |
| |
|
9.5 |
Full vs. Sample Backups |
|
| |
| |
|
9.6 |
Trajectory Sampling |
|
| |
| |
|
9.7 |
Heuristic Search |
|
| |
| |
|
9.8 |
Summary |
|
| |
| |
|
9.9 |
Bibliographical and Historical Remarks |
|
| |
| 10 |
Dimensions of Reinforcement Learning |
|
| |
| |
|
10.1 |
The Unified View |
|
| |
| |
|
10.2 |
Other Frontier Dimensions |
|
| |
| 11 |
Case Studies |
|
| |
| |
|
11.1 |
TD-Gammon |
|
| |
| |
|
11.2 |
Samuel's Checkers Player |
|
| |
| |
|
11.3 |
The Acrobot |
|
| |
| |
|
11.4 |
Elevator Dispatching |
|
| |
| |
|
11.5 |
Dynamic Channel Allocation |
|
| |
| |
|
11.6 |
Job-Shop Scheduling |
|
| |
|
REFERENCES |
|
| |
|
SUMMARY OF NOTATION |
|
| |
|
INDEX |