PastPaperHero | Machine learning and simulation - Supervised learning methods and overfitting control

Learning Outcomes

This article explains supervised learning and overfitting control in a CFA Level 2 context, including:

understanding the role of labeled data, the distinction between regression and classification tasks, and how targets guide model training and prediction;
describing the intuition, structure, and typical finance uses of key supervised algorithms, including penalized regression, SVM, KNN, CART, ensemble learning, and random forest;
defining overfitting and generalization, and recognizing common symptoms of an overfit model in narrative, tabular, and graphical exam exhibits;
applying methods to evaluate model performance, such as train–validation–test splits, cross-validation, and appropriate performance metrics for classification and regression problems;
comparing the strengths, weaknesses, and data requirements of different algorithms when applied to noisy, high-dimensional financial datasets;
selecting a suitable algorithm and complexity level for exam-style scenarios in credit risk, equity return forecasting, and fraud detection;
interpreting how regularization, pruning, early stopping, feature selection, and ensemble methods help mitigate overfitting and improve model robustness;
evaluating whether reported training and test results indicate an appropriately simple, well-generalized model or a data-mined specification that is unlikely to perform well out-of-sample in practice.

CFA Level 2 Syllabus

For the CFA Level 2 exam, you are required to understand supervised learning fundamentals and overfitting control, with a focus on the following syllabus points:

Describing supervised machine learning, related tasks, and algorithm types (e.g., regression, classification)
Identifying and explaining overfitting and generalization in machine learning models
Comparing core supervised algorithms: penalized regression, support vector machine (SVM), k-nearest neighbor (KNN), classification and regression tree (CART), ensemble learning, and random forest
Explaining methods to control overfitting, including regularization, pruning, cross-validation, and appropriate model complexity
Choosing suitable algorithms and validation approaches for financial prediction problems and interpreting their performance

Test Your Knowledge

Attempt these questions before reading this article. If you find some difficult or cannot remember the answers, remember to look more closely at that area during your revision.

A buy-side quantitative analyst is building supervised learning models to predict corporate bond default and assign rating categories. She has data on 2,000 historical bonds, including issuer financial ratios, market data, and whether each bond defaulted within three years. She splits the data into training and test sets and experiments with several algorithms.

On one specification, a very deep decision tree achieves 99% accuracy on the training sample but only 60% accuracy on the test sample. On another specification, a simpler logistic regression achieves 88% training accuracy and 84% test accuracy.

The analyst is considering adding more features (ratios and market variables) and using cross-validation to tune hyperparameters such as tree depth, the penalty parameter in logistic regression, and the number of trees in a random forest.

Which statement best explains the performance of the very deep decision tree relative to the logistic regression model?
a) The tree is underfitting the data, leading to low variance and high bias.
1. The tree is overfitting the training data, leading to high variance and poor generalization.
2. The logistic regression is overfitting because its training accuracy is below 90%.
3. Both models have similar generalization because they are trained on the same sample.
The analyst wants a more reliable estimate of out-of-sample performance and an objective way to select the optimal penalty parameter for logistic regression. Which procedure is most appropriate?
a) Fit the model once on all available data and select the penalty giving the highest in-sample R2R^2R2.
1. Use k-fold cross-validation on the training data to choose the penalty that minimizes validation error.
2. Randomly adjust the penalty until training error converges to zero.
3. Increase the number of features until training accuracy is at least 99%.
The analyst also wants to assign each bond to one of several rating categories (AAA, AA, A, BBB, BB, B, CCC). Which algorithm is most appropriate for this multiclass classification task based on historical labeled data?
a) Principal components analysis (PCA)
1. Classification and regression tree (CART)
2. K-means clustering
3. Simple linear regression
To reduce the risk of overfitting when moving from in-sample research to live credit risk monitoring, which combination of actions is most appropriate?
a) Add more features, increase tree depth, and report only training accuracy.
1. Use penalized regression or tree pruning, perform cross-validation, and reserve a test set for final evaluation.
2. Remove the test set to increase the training sample size and estimate a more complex model.
3. Replace supervised learning with clustering to avoid the need for labeled data.

Introduction

Supervised machine learning is a family of algorithms designed to uncover patterns between known inputs and labeled outputs. These methods are widely used in asset management, credit scoring, fraud detection, and other finance applications. For the CFA exam, you must distinguish supervised learning from other approaches, recognize core supervised algorithms, understand how to evaluate model fit, and apply techniques to reduce overfitting—a major risk in predictive modeling.

Key Term: supervised learning
A class of machine learning methods where the algorithm is trained on historical data with known input (feature) and output (target) pairs. The goal is to predict future output values for new data based on learned input–output relationships.

Supervised learning sits alongside unsupervised learning (no target variable; e.g., clustering) and deep or neural-network-based methods. The Level 2 focus is on “classical” supervised algorithms and how to make them generalize well to new financial data.

Key Term: target variable
The output or dependent variable the model is trying to predict. It can be continuous (e.g., return), categorical (e.g., default vs. non-default), or ordinal (e.g., rating classes).

Key Term: feature
An input or explanatory variable used by the model to predict the target, such as valuation ratios, leverage, or volatility measures.

In practice, you rarely care about fitting historical data perfectly. You care about generalization: performance on new, unseen data. Controlling overfitting is therefore central to any responsible use of supervised learning in finance.

Key Term: generalization
The ability of a machine learning model to make accurate predictions on new, unseen data, not just on the data it was trained with.

Supervised Learning: Principles and Tasks

Supervised learning relies on labeled data: each observation consists of features (inputs) and a known outcome (target). The model learns a mapping from features to target during training and then applies this mapping to new cases.

Two core supervised learning tasks are:

Regression: Predicting a numeric value (e.g., next-period stock return, credit spread change, expected net income).
Classification: Assigning cases to discrete categories (e.g., default vs. non-default, “fraud” vs. “non-fraud,” or rating buckets).

The target’s type determines whether a regression or classification algorithm is appropriate. For example:

Predicting the one-year probability of default as a number between 0 and 1 can be framed as regression.
Predicting default vs. no default is a binary classification problem.
Assigning credit ratings (AAA–CCC) is an ordinal, multiclass classification problem.

Because supervised learning uses historical data, how you split that data is critical.

Key Term: training set
A subset of the data used to fit or estimate the model’s parameters.

Key Term: validation set
A subset of the data, separate from the training set, used during model development for tasks such as hyperparameter tuning and model comparison.

Key Term: test set
A final hold-out subset of the data used only once for unbiased evaluation of the chosen model’s out-of-sample performance.

Key Term: hyperparameter
A model setting chosen by the researcher (not estimated directly from the data) that controls aspects of model complexity or behavior, such as the penalty strength in LASSO, the number of neighbors in KNN, or the depth of a tree.

A typical supervised learning workflow in exam questions:

Split the sample into training, validation, and test sets (or use cross-validation on the training data).
Train several candidate models on the training set.
Use validation performance to select the best model and tune hyperparameters.
Evaluate the final model once on the test set to approximate true out-of-sample performance.

High in-sample accuracy with poor test performance is a red flag for overfitting.

Key Term: overfitting
The situation where a model learns noise or random fluctuations in training data, resulting in high accuracy in-sample but poor performance on new (test) data.

Supervised Learning Algorithms

Several supervised learning models are relevant for CFA Level 2. Each has strengths and is suited to particular application types and data structures.

At a high level, the CFA curriculum suggests the following selection logic:

For numerical prediction (regression) with approximately linear relationships and many features, use penalized regression.
For classification with linear boundaries, use SVM or KNN.
For complex, nonlinear problems, especially with interactions among features, consider tree-based methods (CART, random forest) or neural networks (beyond Level 2 focus).

Below, we summarize the main algorithms you need to recognize.

Penalized Regression (including LASSO)

Penalized regression is a natural extension of linear regression that addresses overfitting by adding a penalty for complexity.

Key Term: penalized regression
A regression approach that introduces a penalty for including additional or large coefficients, discouraging model complexity to reduce overfitting (e.g., LASSO, Ridge regression).

The model minimizes a combination of fit and penalty. For LASSO (least absolute shrinkage and selection operator), the objective is:

\min_\beta \left[ \sum_{i=1}^n (y_i - \hat y_i)^2 + \lambda \sum_{j=1}^p |\beta_j| \right]

where:

$y_i$ is the target for observation $i$ ,
$\hat y_i$ is the model prediction,
$\beta_j$ are the coefficients on each feature,
$\lambda$ is a hyperparameter controlling penalty strength.

Key Term: LASSO
A penalized regression method that minimizes the sum of squared errors plus a penalty proportional to the absolute values of coefficients, shrinking some coefficients exactly to zero and performing automatic variable selection.

Key points:

Larger $\lambda$ leads to stronger shrinkage and a simpler (more parsimonious) model.
LASSO can set some coefficients to zero, effectively performing feature selection.
It is well suited when you have many potentially correlated predictive variables (e.g., dozens of valuation and accounting ratios) and want a stable, interpretable model.

Typical finance uses:

Building parsimonious factor models for expected returns.
Selecting key financial ratios for bankruptcy prediction.
Estimating stable covariance matrices for portfolio optimization (via regularization).

Support Vector Machine (SVM)

Key Term: support vector machine (SVM)
A supervised algorithm that finds the optimal boundary (hyperplane) to separate classes with maximum margin, often used for binary classification.

SVM is a linear classification method that draws a separating boundary between two classes (e.g., “default” vs. “non-default”) in a high-dimensional feature space.

The separating hyperplane is chosen to maximize the margin—the distance between the hyperplane and the nearest observations from each class.
Observations that lie on the margin are called support vectors and determine the boundary.

Key Term: hyperplane
In an $n$ -dimensional feature space, a flat decision boundary of dimension $n-1$ that separates classes (e.g., a line in 2D, a plane in 3D).

Key Term: margin
The distance between the separating hyperplane and the closest observations in each class; SVM seeks to maximize this margin.

SVM can be extended to handle nonlinear boundaries (using kernel functions), but at Level 2 the emphasis is on recognizing it as a robust linear classifier.

Finance applications:

Classifying issuers as “likely to default” vs. “not likely to default.”
Classifying text or news tone as positive vs. negative.
Identifying “short candidates” vs. “non-shorts” based on financial and market features.

k-Nearest Neighbor (KNN)

Key Term: k-nearest neighbor (KNN)
A nonparametric algorithm that assigns a prediction to a new observation based on the majority class or average value among its closest $k$ neighbors in the feature space.

KNN’s steps:

Choose a distance metric (e.g., Euclidean distance) and a value for $k$ (hyperparameter).
For a new observation, find the $k$ nearest points in the training data.
For classification, assign the majority class among the neighbors; for regression, average their target values.

Strengths:

Very flexible and nonparametric—no assumption of linearity.
Works well when there is abundant data and a meaningful notion of “distance” between observations.

Weaknesses:

Sensitive to irrelevant features and scaling; features usually must be standardized.
Prediction can be slow for very large training sets.
Choice of $k$ is important: too small implies high variance; too large can blur distinctions.

Finance applications:

Bankruptcy prediction (nearest neighbors in financial ratio space).
Assigning bonds into rating buckets based on similarity to historical bonds.
Forming customized indices by finding stocks most similar to a prototype portfolio.

Classification and Regression Tree (CART)

Key Term: classification and regression tree (CART)
A decision tree model that recursively splits data using input features to predict discrete classes (classification) or continuous values (regression).

A tree:

Starts at a root node with all observations.
Splits the data based on conditions on features (e.g., leverage > 60%).
Continues splitting until a stopping rule is met (e.g., minimum node size, maximum depth).
Produces a set of terminal nodes (leaves), each with a predicted class or value.

Advantages:

Handles nonlinear relationships and complex interactions among features.
Easy to visualize and interpret (“if–then” rules).
Can be used for both regression and classification.

Disadvantages:

A single, deep tree tends to overfit (high variance).
Small changes in data can lead to very different trees.

Finance applications:

Determining rules for “successful vs. unsuccessful” IPOs.
Screening equities into “buy,” “hold,” and “sell” buckets.
Identifying suspicious transactions in fraud detection.

Ensemble Learning and Random Forest

Key Term: ensemble learning
A method that combines results from multiple models (or the same model trained on different data or with different settings) to improve prediction accuracy and robustness.

Ensemble methods often outperform single “strong” learners because they average out idiosyncratic model errors.

Two common ensemble approaches for trees are:

Bagging (bootstrap aggregation): train many trees on bootstrapped samples and average their predictions.
Boosting: train trees sequentially, each focusing on errors of prior trees.

Key Term: random forest
An ensemble algorithm that builds many decision trees on random sub-samples of features and cases, then aggregates their predictions to reduce variance and improve generalization.

Random forests:

Use bootstrapped samples of observations for each tree.
At each split, consider only a random subset of features, which decorrelates trees and improves ensemble diversity.
Typically perform very well out-of-sample with limited tuning.

Finance applications:

Credit scoring and probability-of-default modeling with many borrower features.
Equity return or alpha forecasting using large factor libraries.
Forecasting earnings surprises or rating downgrades using fundamentals and market variables.

Key Term: cross-validation
A method that partitions data into multiple subsets; the model trains on some subsets and validates on others to estimate out-of-sample performance and tune hyperparameters.

Key Term: regularization
The technique in regression or classification methods of applying penalties for including extra features or large coefficients, encouraging simpler, more generalizable models.

Key Term: pruning
The process in tree-based algorithms of removing branches that do not provide significant predictive improvement, reducing complexity and overfitting.

Worked Example 1.1

A credit risk analyst needs to predict whether a new loan applicant will default. They have a dataset of historical applicant features and default status. Which model(s) would be most appropriate and why?

Answer:
Classification-focused supervised algorithms such as logistic regression, SVM, CART, or random forest are suitable because the target (default vs. non-default) is binary. Tree-based ensembles (random forest) are particularly attractive when the relationship between features and default is nonlinear and complex.

Worked Example 1.2

An equity analyst wants to forecast next-year earnings-per-share (EPS) growth for firms in a sector. The dataset includes 50 candidate predictors, many of which are correlated. The analyst is concerned about overfitting. Which algorithm is most appropriate?

Answer:
Penalized regression methods such as LASSO are well suited. They handle many correlated predictors, shrink unimportant coefficients toward zero, and yield a parsimonious model that is less likely to overfit than an unpenalized multiple regression using all 50 variables.

Overfitting and Its Control

Overfitting is a frequent issue in financial modeling, often caused by excessive model complexity, too many features, or not enough data. Overfit models perform well with training data but poorly on validation or test samples.

Signs of overfitting:

Model accuracy or $R^2$ is much higher on training than on validation/test data.
The model’s predictions change significantly when small amounts of new data are added.
In a learning curve plot, training error is very low but validation error remains high and does not decrease as more data are used.

From the CFA curriculum, prediction errors can be decomposed as:

Key Term: bias error
In-sample error arising because the model is too simple or misspecified, failing to capture true patterns (underfitting).

Key Term: variance error
Out-of-sample error arising because the model is too complex and overfits noise, making predictions unstable across samples.

Key Term: base error
Residual error due to irreducible noise in the data; even the best possible model cannot eliminate this.

There is a bias–variance trade-off:

Simple models (e.g., linear regression with few predictors) tend to have higher bias but lower variance.
Very flexible models (e.g., deep trees, overly complex ensembles) tend to have low bias but high variance.
The goal is an optimal complexity that minimizes total error (bias + variance + base).

Key Term: learning curve
A plot of model performance (e.g., accuracy or error rate) on training and validation/test sets as a function of training sample size, used to diagnose high bias or high variance.

A robust, well-generalizing model will show training and validation performance converging to a satisfactory level as the training sample grows.

Methods to Reduce Overfitting

Controlling overfitting is essential for building robust models and for correctly interpreting exam exhibits. Common methods include:

Cross-validation: Split data into multiple training/validation folds. Train on some folds and validate on the remaining fold; repeat across folds. Use average validation performance to choose model type and hyperparameters (e.g., $\lambda$ in LASSO, number of neighbors in KNN, tree depth in CART).
Complexity penalties (regularization): Penalize large or numerous coefficients (LASSO, ridge) or large margins of error to discourage overly complex models.
Pruning (for trees): Grow a sufficiently large tree, then cut back branches that do not significantly improve validation performance. This yields a smaller, more stable tree.
Early stopping: For iterative algorithms (e.g., gradient-boosted trees, some neural networks), stop training when validation error stops improving, even if training error continues to decline.

Key Term: early stopping
A technique where training is halted once validation performance ceases to improve, preventing the model from overfitting the training data.

Feature selection and dimension reduction: Remove irrelevant, noisy, or redundant features; use domain knowledge or techniques like penalized regression to keep only predictors that add true predictive signal.

Key Term: feature selection
The process of choosing a subset of relevant features that contribute meaningfully to prediction, improving model interpretability and reducing overfitting risk.

Ensemble methods: Use bagging, random forests, or other ensembles to average across models, which can substantially reduce variance without greatly increasing bias.

In addition to these techniques, proper data splitting is important:

Use training data to fit models.
Use validation data (or cross-validation) to compare models and tune hyperparameters.
Use the test set only once, at the end, to evaluate the chosen model. Repeatedly reusing the test set to choose models effectively turns it into a second validation set and leads to optimistic performance estimates.

Worked Example 1.3

A hedge fund data scientist is developing a random forest model to forecast quarterly earnings surprises. The model uses 200 features and 500 trees. It performs extremely well on historic data but poorly on recent out-of-sample quarters, with a large gap between training and test $R^2$ .

What adjustments are most likely to improve generalization?

Answer:
The large performance gap indicates overfitting and high variance error. The data scientist can reduce overfitting by limiting tree depth, reducing the number of features considered at each split, or using fewer trees (if they are extremely deep), and by performing cross-validation to choose hyperparameters. They might also use feature selection or stronger regularization on upstream models that generate features.

Worked Example 1.4

An analyst fits three models to predict next-month equity returns using 15 candidate factors:

Model A: Ordinary least squares (OLS) regression with all 15 factors.
Model B: LASSO regression with an optimally tuned penalty $\lambda$ .
Model C: Deep CART model without pruning.

Exam results report:

Model A: Training $R^2 = 0.80$ , test $R^2 = 0.10$
Model B: Training $R^2 = 0.60$ , test $R^2 = 0.25$
Model C: Training $R^2 = 0.99$ , test $R^2 = -0.05$

Which model is preferred, and why?

Answer:
Model B (LASSO) is preferred. Although its in-sample fit is lower than A and C, it delivers the best test $R^2$ and likely generalizes better. Models A and C are clearly overfit (high training $R^2$ , poor or negative test $R^2$ ), illustrating that simpler, regularized models often outperform more complex ones out-of-sample.

Exam Warning

A common exam error is to report only in-sample (training) model accuracy. CFA exam questions will typically require evaluation of both in-sample and out-of-sample (validation/test) results to correctly assess generalization and overfitting.

You may see:

Tables showing training and test accuracy, confusion matrices, or error rates.
Learning curves or plots of accuracy vs. model complexity.

Your task is to identify whether the model is appropriately complex or overfit/underfit.

Revision Tip

When reviewing model results:

Compare performance metrics across training and validation/test data. Large discrepancies usually indicate overfitting.
Consider whether the algorithm type matches the problem (regression vs. classification; linear vs. nonlinear).
Evaluate whether regularization, pruning, or ensemble methods have been applied when the dataset is high dimensional or noisy.

Summary

Supervised learning uses labeled data to train models for classification and regression problems in finance. Core algorithms—penalized regression (especially LASSO), SVM, KNN, CART, and ensemble/random forest—offer different trade-offs between flexibility, interpretability, and data requirements.

Overfitting is a key concern: models that capture noise show excellent in-sample fit but poor out-of-sample performance. Effective practice requires:

Understanding how to split data into training, validation, and test sets (or use cross-validation).
Recognizing overfitting symptoms in exam exhibits (large training–test performance gaps).
Applying regularization, pruning, early stopping, ensemble learning, and feature selection to control model complexity.
Choosing algorithms sensibly for tasks such as credit risk modeling, equity return forecasting, and fraud detection.

For CFA Level 2, you are expected to interpret these methods conceptually and apply them to case-based problems, not to implement them in code.

Key Point Checklist

This article has covered the following key knowledge points:

Explain supervised learning and tasks (regression and classification) using labeled data
Describe how targets, features, and data splits (training, validation, test) guide model training and evaluation
Summarize common supervised learning algorithms used in finance (penalized regression, SVM, KNN, CART, ensemble, random forest) and when each is appropriate
Define overfitting, its risks, and consequences for model generalization, including the bias–variance trade-off
List and explain main techniques to control overfitting in machine learning models, including cross-validation, regularization, pruning, early stopping, feature selection, and ensembles
Recognize exam situations where evaluation of test-set accuracy and model simplicity are critical
Interpret tables or graphs of training vs. test performance to judge whether a model is likely to generalize well

Key Terms and Concepts

supervised learning
target variable
feature
generalization
training set
validation set
test set
hyperparameter
overfitting
penalized regression
LASSO
support vector machine (SVM)
hyperplane
margin
k-nearest neighbor (KNN)
classification and regression tree (CART)
ensemble learning
random forest
cross-validation
regularization
pruning
bias error
variance error
base error
learning curve
early stopping
feature selection

Machine learning and simulation - Supervised learning method...

Learning Outcomes

CFA Level 2 Syllabus

Test Your Knowledge

Introduction

Supervised Learning: Principles and Tasks

Supervised Learning Algorithms

Penalized Regression (including LASSO)

Support Vector Machine (SVM)

k-Nearest Neighbor (KNN)

Classification and Regression Tree (CART)

Ensemble Learning and Random Forest

Worked Example 1.1

Worked Example 1.2

Overfitting and Its Control

Methods to Reduce Overfitting

Worked Example 1.3

Worked Example 1.4

Exam Warning

Revision Tip

Summary

Key Point Checklist

Key Terms and Concepts

Assistant