- Buying Options

**Hardcover**| Out of Print | 584 pp. | 8 x 9 in | 172 figures, 10 tables | December 2009 | ISBN: 9780262012430

**eBook**|

**$60.00 Short**| December 2009 | ISBN: 9780262315548

**Mouseover for Online Attention Data**

## Essential Info

- Table of Contents
- Author Site with Instructional Material

## Table of Contents

Series Foreword |
xvii | ||||

Figures |
xix | ||||

Tables |
xxix | ||||

Preface |
xxxi | ||||

Acknowledgments |
xxxiii | ||||

Notes for the Second Edition |
xxxv | ||||

Notations |
xxxix | ||||

Introduction |
1 | ||||

What Is Machine Learning? | 1 | ||||

Examples of Machine Learning Applications | 4 | ||||

Learning Associations | 4 | ||||

Classification | 5 | ||||

Regression | 9 | ||||

Unsupervised Learning | 11 | ||||

Reinforcement Learning | 13 | ||||

Notes | 14 | ||||

Relevant Resources | 16 | ||||

Exercises | 18 | ||||

References | 19 | ||||

Supervised Learning |
21 | ||||

Learning a Class from Examples | 21 | ||||

Vapnik-Chervonenkis (VC) Dimension | 27 | ||||

Probably Approximately Correct (PAC) Learning | 29 | ||||

Noise | 30 | ||||

Learning Multiple Classes | 32 | ||||

Regression | 34 | ||||

Model Selection and Generalization | 37 | ||||

Dimensions of a Supervised Machine Learning Algorithm | 41 | ||||

Notes | 42 | ||||

Exercises | 43 | ||||

References | 44 | ||||

Bayesian Decision Theory |
47 | ||||

Introduction | 47 | ||||

Classification | 49 | ||||

Losses and Risks | 51 | ||||

Discriminant Functions | 53 | ||||

Utility Theory | 54 | ||||

Association Rules | 55 | ||||

Notes | 58 | ||||

Exercises | 58 | ||||

References | 59 | ||||

Parametric Methods |
61 | ||||

Introduction | 61 | ||||

Maximum Likelihood Estimation | 62 | ||||

Bernoulli Density | 63 | ||||

Multinomial Density | 64 | ||||

Gaussian (Normal) Density | 64 | ||||

Evaluating an Estimator: Bias and Variance | 65 | ||||

The Bayes' Estimator | 66 | ||||

Parametric Classification | 69 | ||||

Regression | 73 | ||||

Tuning Model Complexity: Bias/Variance Dilemma | 76 | ||||

Model Selection Procedures | 80 | ||||

Notes | 84 | ||||

Exercises | 84 | ||||

References | 85 | ||||

Multivariate Methods |
87 | ||||

Multivariate Data | 87 | ||||

Parameter Estimation | 88 | ||||

Estimation of Missing Values | 89 | ||||

Multivariate Normal Distribution | 90 | ||||

Multivariate Classification | 94 | ||||

Tuning Complexity | 99 | ||||

Discrete Features | 102 | ||||

Multivariate Regression | 103 | ||||

Notes | 105 | ||||

Exercises | 106 | ||||

References | 107 | ||||

Dimensionality Reduction |
109 | ||||

Introduction | 109 | ||||

Subset Selection | 110 | ||||

Principal Components Analysis | 113 | ||||

Factor Analysis | 120 | ||||

Multidimensional Scaling | 125 | ||||

Linear Discriminant Analysis | 128 | ||||

Isomap | 133 | ||||

Locally Linear Embedding | 135 | ||||

Notes | 138 | ||||

Exercises | 139 | ||||

References | 140 | ||||

Clustering |
143 | ||||

Introduction | 143 | ||||

Mixture Densities | 144 | ||||

k-Means Clustering |
145 | ||||

Expectation-Maximization Algorithm | 149 | ||||

Mixtures of Latent Variable Models | 154 | ||||

Supervised Learning after Clustering | 155 | ||||

Hierarchical Clustering | 157 | ||||

Choosing the Number of Clusters | 158 | ||||

Notes | 160 | ||||

Exercises | 160 | ||||

References | 161 | ||||

Nonparametric Methods |
163 | ||||

Introduction | 163 | ||||

Nonparametric Density Estimation | 165 | ||||

Histogram Estimator | 165 | ||||

Kernel Estimator | 167 | ||||

k-Nearest Neighbor Estimator |
168 | ||||

Generalization to Multivariate Data | 170 | ||||

Nonparametric Classification | 171 | ||||

Condensed Nearest Neighbor | 172 | ||||

Nonparametric Regression: Smoothing Models | 174 | ||||

Running Mean Smoother | 175 | ||||

Kernel Smoother | 176 | ||||

Running Line Smoother | 177 | ||||

How to Choose the Smoothing Parameter | 178 | ||||

Notes | 180 | ||||

Exercises | 181 | ||||

References | 182 | ||||

Decision Trees |
185 | ||||

Introduction | 185 | ||||

Univariate Trees | 187 | ||||

Classification Trees | 188 | ||||

Regression Trees | 192 | ||||

Pruning | 194 | ||||

Rule Extraction from Trees | 197 | ||||

Learning Rules from Data | 198 | ||||

Multivariate Trees | 202 | ||||

Notes | 204 | ||||

Exercises | 207 | ||||

References | 207 | ||||

Linear Discrimination |
209 | ||||

Introduction | 209 | ||||

Generalizing the Linear Model | 211 | ||||

Geometry of the Linear Discriminant | 212 | ||||

Two Classes | 212 | ||||

Multiple Classes | 214 | ||||

Pairwise Separation | 216 | ||||

Parametric Discrimination Revisited | 217 | ||||

Gradient Descent | 218 | ||||

Logistic Discrimination | 220 | ||||

Two Classes | 220 | ||||

Multiple Classes | 224 | ||||

Discrimination by Regression | 228 | ||||

Notes | 230 | ||||

Exercises | 230 | ||||

References | 231 | ||||

Multilayer Perceptrons |
233 | ||||

Introduction | 233 | ||||

Understanding the Brain | 234 | ||||

Neural Networks as a Paradigm for Parallel Processing | 235 | ||||

The Perceptron | 237 | ||||

Training a Perceptron | 240 | ||||

Learning Boolean Functions | 243 | ||||

Multilayer Perceptrons | 245 | ||||

MLP as a Universal Approximator | 248 | ||||

Backpropagation Algorithm | 249 | ||||

Nonlinear Regression | 250 | ||||

Two-Class Discrimination | 252 | ||||

Multiclass Discrimination | 254 | ||||

Multiple Hidden Layers | 256 | ||||

Training Procedures | 256 | ||||

Improving Convergence | 256 | ||||

Overtraining | 257 | ||||

Structuring the Network | 258 | ||||

Hints | 261 | ||||

Tuning the Network Size | 263 | ||||

Bayesian View of Learning | 266 | ||||

Dimensionality Reduction | 267 | ||||

Learning Time | 270 | ||||

Time Delay Neural Networks | 270 | ||||

Recurrent Networks | 271 | ||||

Notes | 272 | ||||

Exercises | 274 | ||||

References | 275 | ||||

Local Models |
279 | ||||

Introduction | 279 | ||||

Competitive Learning | 280 | ||||

Online k-Means |
280 | ||||

Adaptive Resonance Theory | 285 | ||||

Self-Organizing Maps | 286 | ||||

Radial Basis Functions | 288 | ||||

Incorporating Rule-Based Knowledge | 294 | ||||

Normalized Basis Functions | 295 | ||||

Competitive Basis Functions | 297 | ||||

Learning Vector Quantization | 300 | ||||

Mixture of Experts | 300 | ||||

Cooperative Experts | 303 | ||||

Competitive Experts | 304 | ||||

Hierarchical Mixture of Experts | 304 | ||||

Notes | 305 | ||||

Exercises | 306 | ||||

References | 307 | ||||

Kernel Machines |
309 | ||||

Introduction | 309 | ||||

Optimal Separating Hyperplane | 311 | ||||

The Nonseparable Case: Soft Margin Hyperplane | 315 | ||||

ν-SVM |
318 | ||||

Kernel Trick | 319 | ||||

Vectorial Kernels | 321 | ||||

Defining Kernels | 324 | ||||

Multiple Kernel Learning | 325 | ||||

Multiclass Kernel Machines | 327 | ||||

Kernel Machines for Regression | 328 | ||||

One-Class Kernel Machines | 333 | ||||

Kernel Dimensionality Reduction | 335 | ||||

Notes | 337 | ||||

Exercises | 338 | ||||

References | 339 | ||||

Bayesian Estimation |
341 | ||||

Introduction | 341 | ||||

Estimating the Parameter of a Distribution | 343 | ||||

Discrete Variables | 343 | ||||

Continuous Variables | 345 | ||||

Bayesian Estimation of the Parameters of a Function | 348 | ||||

Regression | 348 | ||||

The Use of Basis/Kernel Functions | 352 | ||||

Bayesian Classification | 353 | ||||

Gaussian Processes | 356 | ||||

Notes | 359 | ||||

Exercises | 360 | ||||

References | 361 | ||||

Hidden Markov Models |
363 | ||||

Introduction | 363 | ||||

Discrete Markov Processes | 364 | ||||

Hidden Markov Models | 367 | ||||

Three Basic Problems of HMMs | 369 | ||||

Evaluation Problem | 369 | ||||

Finding the State Sequence | 373 | ||||

Learning Model Parameters | 375 | ||||

Continuous Observations | 378 | ||||

The HMM with Input | 379 | ||||

Model Selection in HMM | 380 | ||||

Notes | 382 | ||||

Exercises | 383 | ||||

References | 384 | ||||

Graphical Models |
387 | ||||

Introduction | 387 | ||||

Canonical Cases for Conditional Independence | 389 | ||||

Example Graphical Models | 396 | ||||

Naive Bayes' Classifier | 396 | ||||

Hidden Markov Model | 398 | ||||

Linear Regression | 401 | ||||

d-Separation | 402 | ||||

Belief Propagation | 402 | ||||

Chains | 403 | ||||

Trees | 405 | ||||

Polytrees | 407 | ||||

Junction Trees | 409 | ||||

Undirected Graphs: Markov Random Fields | 410 | ||||

Learning the Structure of a Graphical Model | 413 | ||||

Influence Diagrams | 414 | ||||

Notes | 414 | ||||

Exercises | 417 | ||||

References | 417 | ||||

Combining Multiple Learners |
419 | ||||

Rationale | 419 | ||||

Generating Diverse Learners | 420 | ||||

Model Combination Schemes | 423 | ||||

Voting | 424 | ||||

Error-Correcting Output Codes | 427 | ||||

Bagging | 430 | ||||

Boosting | 431 | ||||

Mixture of Experts Revisited | 434 | ||||

Stacked Generalization | 435 | ||||

Fine-Tuning an Ensemble | 437 | ||||

Cascading | 438 | ||||

Notes | 440 | ||||

Exercises | 442 | ||||

References | 443 | ||||

Reinforcement Learning |
447 | ||||

Introduction | 447 | ||||

Single State Case: K-Armed Bandit |
449 | ||||

Elements of Reinforcement Learning | 450 | ||||

Model-Based Learning | 453 | ||||

Value Iteration | 453 | ||||

Policy Iteration | 454 | ||||

Temporal Difference Learning | 454 | ||||

Exploration Strategies | 455 | ||||

Deterministic Rewards and Actions | 456 | ||||

Nondeterministic Rewards and Actions | 457 | ||||

Eligibility Traces | 459 | ||||

Generalization | 461 | ||||

Partially Observable States | 464 | ||||

The Setting | 464 | ||||

Example: The Tiger Problem | 465 | ||||

Notes | 470 | ||||

Exercises | 472 | ||||

References | 473 | ||||

Design and Analysis of Machine Learning Experiments |
475 | ||||

Introduction | 475 | ||||

Factors, Response, and Strategy of Experimentation | 478 | ||||

Response Surface Design | 481 | ||||

Randomization, Replication, and Blocking | 482 | ||||

Guidelines for Machine Learning Experiments | 483 | ||||

Cross-Validation and Resampling Methods | 486 | ||||

K-Fold Cross-Validation |
487 | ||||

5�2 Cross-Validation | 488 | ||||

Bootstrapping | 489 | ||||

Measuring Classifier Performance | 489 | ||||

Interval Estimation | 493 | ||||

Hypothesis Testing | 496 | ||||

Assessing a Classification Algorithm's Performance | 498 | ||||

Binomial Test | 499 | ||||

Approximate Normal Test | 500 | ||||

t Test |
500 | ||||

Comparing Two Classification Algorithms | 501 | ||||

McNemar's Test | 501 | ||||

K-Fold Cross-Validated Paired t Test |
501 | ||||

5 � 2 cv Paired t Test |
502 | ||||

5 � 2 cv Paired F Test |
503 | ||||

Comparing Multiple Algorithms: Analysis of Variance | 504 | ||||

Comparison over Multiple Datasets | 508 | ||||

Comparing Two Algorithms | 509 | ||||

Multiple Algorithms | 511 | ||||

Notes | 512 | ||||

Exercises | 513 | ||||

References | 514 | ||||

Probability |
517 | ||||

Elements of Probability | 517 | ||||

Axioms of Probability | 518 | ||||

Conditional Probability | 518 | ||||

Random Variables | 519 | ||||

Probability Distribution and Density Functions | 519 | ||||

Joint Distribution and Density Functions | 520 | ||||

Conditional Distributions | 520 | ||||

Bayes' Rule | 521 | ||||

Expectation | 521 | ||||

Variance | 522 | ||||

Weak Law of Large Numbers | 523 | ||||

Special Random Variables | 523 | ||||

Bernoulli Distribution | 523 | ||||

Binomial Distribution | 524 | ||||

Multinomial Distribution | 524 | ||||

Uniform Distribution | 524 | ||||

Normal (Gaussian) Distribution | 525 | ||||

Chi-Square Distribution | 526 | ||||

t Distribution |
527 | ||||

F Distribution |
527 | ||||

References | 527 | ||||

Index |
529 |

## Also by this Author

## Introduction to Machine Learning, Second Edition

## Overview

The goal of machine learning is to program computers to use example data or past experience to solve a given problem. Many successful applications of machine learning exist already, including systems that analyze past sales data to predict customer behavior, optimize robot behavior so that a task can be completed using minimum resources, and extract knowledge from bioinformatics data.

The second edition of *Introduction to Machine Learning* is a comprehensive textbook on the subject, covering a broad array of topics not usually included in introductory machine learning texts. In order to present a unified treatment of machine learning problems and solutions, it discusses many methods from different fields, including statistics, pattern recognition, neural networks, artificial intelligence, signal processing, control, and data mining. All learning algorithms are explained so that the student can easily move from the equations in the book to a computer program. The text covers such topics as supervised learning, Bayesian decision theory, parametric methods, multivariate methods, multilayer perceptrons, local models, hidden Markov models, assessing and comparing classification algorithms, and reinforcement learning.

New to the second edition are chapters on kernel machines, graphical models, and Bayesian estimation; expanded coverage of statistical tests in a chapter on design and analysis of machine learning experiments; case studies available on the Web (with downloadable results for instructors); and many additional exercises. All chapters have been revised and updated. *Introduction to Machine Learning* can be used by advanced undergraduates and graduate students who have completed courses in computer programming, probability, calculus, and linear algebra. It will also be of interest to engineers in the field who are concerned with the application of machine learning methods.

## About the Author

Ethem Alpaydin is Professor in the Department of Computer Engineering at Bogaziçi University, Istanbul. He is the author of the widely used textbook *Introduction to Machine Learning*, now in its third edition (MIT Press).

## Reviews

*Computing Reviews*

## Endorsements

—

**Joaquin Qui**