Learning Outcomes
This article explains how to evaluate and manage key violations of the classical linear regression assumptions in a multiple regression setting that is heavily tested in the CFA Level 2 exam. It focuses on three core problems—multicollinearity, heteroskedasticity, and autocorrelation—and clarifies how each one arises in applied return and risk modeling. The discussion shows how these violations affect coefficient precision, standard errors, hypothesis tests, and the reliability of forecasts, even when R² and overall model fit appear strong. The article explains how to recognize warning signs from regression output, residual plots, and test statistics, including the use of t- and F-tests, variance inflation factors (VIF), the Breusch-Pagan test, the Durbin-Watson statistic, and the Breusch-Godfrey test. It also details how to select and implement appropriate corrections such as dropping or combining explanatory variables, using robust (White) standard errors, and applying Newey–West or other heteroskedasticity- and autocorrelation-consistent (HAC) estimators. Throughout, the emphasis is on interpreting output in an exam-style context so that common CFA Level 2 question formats—conceptual, computational, and vignette-based—can be addressed efficiently by diagnosing the specific violation, linking it to its consequence for statistical inference, and choosing the most defensible corrective action.
CFA Level 2 Syllabus
For the CFA Level 2 exam, you are expected to understand issues that arise in multiple regression analysis and how they impact statistical inference, with a focus on the following syllabus points:
- Explain the impact of multicollinearity, heteroskedasticity, and autocorrelation on regression analysis and inference.
- Detect multicollinearity using t- and F-statistics or variance inflation factors (VIF).
- Identify heteroskedasticity and autocorrelation using residual plots and formal tests.
- Apply corrections and robust estimation methods where violations occur.
- Interpret the implications of regression assumption violations for real-world financial data.
Test Your Knowledge
Attempt these questions before reading this article. If you find some difficult or cannot remember the answers, remember to look more closely at that area during your revision.
- What is the primary effect of multicollinearity on standard errors and hypothesis tests in a multiple regression?
- How can you visually identify conditional heteroskedasticity in a residual plot?
- Which regression assumption does the Durbin-Watson test assess?
- What adjustment should be made to coefficient standard errors in the presence of serial correlation?
Introduction
Multiple regression models are powerful tools, but their reliability depends on several statistical assumptions. In practice, financial data often violate these assumptions, which undermines hypothesis tests, confidence intervals, and model interpretation. The most critical issues encountered in CFA-relevant financial analyses are multicollinearity among independent variables, heteroskedasticity of residuals, and autocorrelation (serial correlation) of errors. Recognizing, testing for, and correcting these problems is essential for robust analysis.
Key Term: multicollinearity
The situation where two or more independent variables (or their linear combinations) in a multiple regression are highly correlated, making it difficult to isolate their individual impacts on the dependent variable.Key Term: heteroskedasticity
The condition where the variance of the regression residuals is not constant across all levels of the independent variables, potentially invalidating hypothesis tests.Key Term: autocorrelation
Also known as serial correlation, this refers to the scenario where residuals from one observation are correlated with residuals from another, a common issue in time-series data.
Assumption Violations in Multiple Regression
1. Multicollinearity
Multicollinearity exists when independent variables are highly correlated with each other. While the overall fit of the regression (e.g., R²) may appear strong, multicollinearity makes standard errors larger and t-statistics smaller, which can mask the statistical significance of individual coefficients. It does not bias coefficient estimates but does make them unstable and unreliable.
Detection
- Low t-statistics for most individual coefficients, but a significant overall F-statistic.
- High R² and most coefficients appear insignificant.
- Variance Inflation Factor (VIF): A VIF greater than 10 is a typical red flag, although lower cutoffs are sometimes used.
- Pairwise correlations close to 1.0 (or -1.0) between independent variables.
Correction
- Drop or combine highly correlated variables.
- Use a different variable as a proxy.
- Increase sample size if possible.
Worked Example 1.1
A regression to predict stock returns uses both P/E ratio and Market Cap. The t-stats for both are insignificant, but the F-test is highly significant and R² = 0.85.
Answer:
Likely cause: multicollinearity between P/E and Market Cap. They may be measuring similar firm characteristics. Consider checking VIF and potentially dropping one variable.
2. Heteroskedasticity
Heteroskedasticity means the variance of the errors varies as a function of the independent variables. It is especially common in cross-sectional and financial return data.
- Unconditional heteroskedasticity is unrelated to the level of the independent variable and typically benign.
- Conditional heteroskedasticity occurs when variance increases or decreases with independent variable values—a serious issue for inference.
Detection (Heteroskedasticity)
- Residual plots: Spread of residuals increases or decreases systematically with the fitted or independent variable values (funnel shape).
- Breusch-Pagan test: A significant chi-square statistic indicates presence of conditional heteroskedasticity.
Effects
- Coefficient estimates remain unbiased, but standard errors are incorrect—often understated—causing Type I errors (false positives).
- F- and t-statistics are unreliable for hypothesis testing.
Correction (Heteroskedasticity)
- Use robust (heteroskedasticity-consistent) standard errors—often called White-corrected standard errors. Recalculate t-statistics for hypothesis tests.
Worked Example 1.2
You regress monthly stock returns on several financial ratios and plot the residuals against the predicted values. The plot widens as fitted values increase.
Answer:
The fan-shaped pattern in the residual plot suggests conditional heteroskedasticity. Use robust (White-corrected) standard errors for valid inference.
3. Autocorrelation
Autocorrelation (serial correlation) is most problematic in time-series data, but can appear in any data with ordering. First-order positive autocorrelation is most common and means a positive error in one period is likely followed by another in the next.
Detection (Autocorrelation)
- Residual plots over time showing systematic runs (positive or negative sequences).
- Durbin-Watson test: For first-order autocorrelation, DW ≈ 2 means no autocorrelation; DW significantly less than 2 implies positive autocorrelation.
- Breusch-Godfrey (BG) test for higher-order or multiple lag autocorrelation.
Effects (Autocorrelation)
- Coefficient estimates are unbiased (if no lagged dependent variable), but standard errors are understated for positive autocorrelation.
- Increases risk of Type I errors because t-statistics and F-statistics are overstated.
- With lagged dependent variable present, coefficients may be inconsistent.
Correction (Autocorrelation)
- Use robust standard errors suitable for autocorrelated data (e.g., Newey–West standard errors).
- Check functional form or lag structure, or use models that capture serial dependence if it is structural (e.g., AR models).
Worked Example 1.3
You regress monthly returns on economic indicators. The Durbin-Watson statistic is 1.1. What does this indicate?
Answer:
DW = 1.1 (which is much less than 2) indicates positive first-order autocorrelation. Consider using Newey–West standard errors to correct the t-tests, or adjust your model.
Exam Warning
Focusing solely on hypothesis test results without checking regression assumptions is a critical error. CFA exam questions often require you to diagnose issues (like heteroskedasticity or autocorrelation) based on residual plots or test results. Always link observed issues to the correct statistical inference problem—in practice and on the exam.
Summary
Understanding and diagnosing multicollinearity, heteroskedasticity, and autocorrelation are essential for CFA candidates using multiple regression in finance. Each problem affects inference in a unique way:
- Multicollinearity inflates standard errors and renders t-tests unreliable for individual coefficients.
- Heteroskedasticity and autocorrelation violate assumptions about residuals, invalidating standard errors and test statistics.
- Robust standard errors (White for heteroskedasticity, Newey–West for autocorrelation) correct statistical inference, not the coefficients. Proper detection and correction techniques must be applied before interpreting regression output for investment decisions.
Key Point Checklist
This article has covered the following key knowledge points:
- Explain the nature, detection, and effects of multicollinearity in multiple regression.
- Identify heteroskedasticity and apply robust standard errors for valid inference.
- Diagnose autocorrelation in regression residuals and correct with suitable standard error adjustments.
- Understand that violations impact the validity of hypothesis testing, not the coefficients themselves (unless a lagged dependent variable is included).
- Use the appropriate tests (VIF for multicollinearity, Breusch-Pagan for heteroskedasticity, Durbin-Watson/Breusch-Godfrey for autocorrelation).
Key Terms and Concepts
- multicollinearity
- heteroskedasticity
- autocorrelation