| CONTENTS | |
|
| SERIES FOREWORD | |
|
| PREFACE | |
|
| I | THE PROBLEM | |
|
| 1 | Introduction | |
|
| | 1.1 | Reinforcement Learning | |
|
| | 1.2 | Examples | |
|
| | 1.3 | Elements of Reinforcement Learning | |
|
| | 1.4 | An Extended Example: Tic-Tac-Toe | |
|
| | 1.5 | Summary | |
|
| | 1.6 | History of Reinforcement Learning | |
|
| | 1.7 | Bibliographical Remarks | |
|
| 2 | Evaluative Feedback | |
|
| | 2.1 | An n-Armed Bandit Problem | |
|
| | 2.2 | Action-Value Methods | |
|
| | 2.3 | Softmax Action Selection | |
|
| | 2.4 | Evaluation Versus Instruction | |
|
| | 2.5 | Incremental Implementation | |
|
| | 2.6 | Tracking a Nonstationary Problem | |
|
| | 2.7 | Optimistic Initial Values | |
|
| | 2.8 | Reinforcement Comparison | |
|
| | 2.9 | Pursuit Methods | |
|
| | 2.10 | Associative Search | |
|
| | 2.11 | Conclusions | |
|
| | 2.12 | Bibliographical and Historical Remarks | |
|
| 3 | The Reinforcement Learning Problem | |
|
| | 3.1 | The Agent-Environment Interface | |
|
| | 3.2 | Goals and Rewards | |
|
| | 3.3 | Returns | |
|
| | 3.4 | Unified Notation for Episodic and Continuing Tasks | |
|
| | 3.5 | The Markov Property | |
|
| | 3.6 | Markov Decision Processes | |
|
| | 3.7 | Value Functions | |
|
| | 3.8 | Optimal Value Functions | |
|
| | 3.9 | Optimality and Approximation | |
|
| | 3.10 | Summary | |
|
| | 3.11 | Bibliographical and Historical Remarks | |
|
| II | ELEMENTARY SOLUTION METHODS | |
|
| 4 | Dynamic Programming | |
|
| | 4.1 | Policy Evaluation | |
|
| | 4.2 | Policy Improvement | |
|
| | 4.3 | Policy Iteration | |
|
| | 4.4 | Value Iteration | |
|
| | 4.5 | Asynchronous Dynamic Programming | |
|
| | 4.6 | Generalized Policy Iteration | |
|
| | 4.7 | Efficiency of Dynamic Programming | |
|
| | 4.8 | Summary | |
|
| | 4.9 | Bibliographical and Historical Remarks | |
|
| 5 | Monte Carlo Methods | |
|
| | 5.1 | Monte Carlo Policy Evaluation | |
|
| | 5.2 | Monte Carlo Estimation of Action Values | |
|
| | 5.3 | Monte Carlo Control | |
|
| | 5.4 | On-Policy Monte Carlo Control | |
|
| | 5.5 | Evaluating One Policy While Following Another | |
|
| | 5.6 | Off-Policy Monte Carlo Control | |
|
| | 5.7 | Incremental Implementation | |
|
| | 5.8 | Summary | |
|
| | 5.9 | Bibliographical and Historical Remarks | |
|
| 6 | Temporal-Difference Learning | |
|
| | 6.1 | TD Prediction | |
|
| | 6.2 | Advantages of TD Prediction Methods | |
|
| | 6.3 | Optimality of TD(O) | |
|
| | 6.4 | Sarsa: On-Policy TD Control | |
|
| | 6.5 | Q-Learning: Off-Policy TD Control | |
|
| | 6.6 | Actor-Critic Methods | |
|
| | 6.7 | R-Learning for Undiscounted Continuing Tasks | |
|
| | 6.8 | Games, Afterstates, and Other Special Cases | |
|
| | 6.9 | Summary | |
|
| | 6.10 | Bibliographical and Historical Remarks | |
|
| III | A UNIFIED VIEW | |
|
| 7 | Eligibility Traces | |
|
| | 7.1 | n-Step TD Prediction | |
|
| | 7.2 | The Forward View of TD(l) | |
|
| | 7.3 | The Backward View of TD(l) | |
|
| | 7.4 | Equivalence of Forward and Backward Views | |
|
| | 7.5 | Sarsa(l) | |
|
| | 7.6 | Q(l) | |
|
| | 7.7 | Eligibility Traces for Actor-Critic Methods | |
|
| | 7.8 | Replacing Traces | |
|
| | 7.9 | Implementation Issues | |
|
| | 7.10 | Variable l | |
|
| | 7.11 | Conclusions | |
|
| | 7.12 | Bibliographical and Historical Remarks | |
|
| 8 | Generalization and Function Approximation | |
|
| | 8.1 | Value Prediction with Function Approximation | |
|
| | 8.2 | Gradient-Descent Methods | |
|
| | 8.3 | Linear Methods | |
|
| | 8.4 | Control with Function Approximation | |
|
| | 8.5 | Off-Policy Bootstrapping | |
|
| | 8.6 | Should We Bootstrap? | |
|
| | 8.7 | Summary | |
|
| | 8.8 | Bibliographical and Historical Remarks | |
|
| 9 | Planning and Learning | |
|
| | 9.1 | Models and Planning | |
|
| | 9.2 | Integrating Planning, Acting, and Learning | |
|
| | 9.3 | When the Model Is Wrong | |
|
| | 9.4 | Prioritized Sweeping | |
|
| | 9.5 | Full vs. Sample Backups | |
|
| | 9.6 | Trajectory Sampling | |
|
| | 9.7 | Heuristic Search | |
|
| | 9.8 | Summary | |
|
| | 9.9 | Bibliographical and Historical Remarks | |
|
| 10 | Dimensions of Reinforcement Learning | |
|
| | 10.1 | The Unified View | |
|
| | 10.2 | Other Frontier Dimensions | |
|
| 11 | Case Studies | |
|
| | 11.1 | TD-Gammon | |
|
| | 11.2 | Samuel's Checkers Player | |
|
| | 11.3 | The Acrobot | |
|
| | 11.4 | Elevator Dispatching | |
|
| | 11.5 | Dynamic Channel Allocation | |
|
| | 11.6 | Job-Shop Scheduling | |
|
| REFERENCES | |
|
| SUMMARY OF NOTATION | |
|
| INDEX | |
|