Advanced Macro Seminar

Machine Learning for Economics Research

Ching-Yang Lin

2026-05-12

Can We Use Machine Learning for Economics?

Machine learning is everywhere — recommendation systems, self-driving cars, language models.

But can we use it for economics research?

If it’s so powerful, why don’t we see it more in economics journals? Why doesn’t our econometrics curriculum cover it?

Two Obstacles (and Why They’re Shrinking)

1. Data volume

Traditional econometrics works with small, carefully collected datasets. ML needs large samples. But survey data is getting bigger (SCF: 55,000 households), administrative data is exploding, and text data is now usable.

2. Causality

Econometrics asks why — the causal effect of X on Y. ML asks what — the best prediction of Y from everything available.

But the gap is closing. Double/Debiased ML, Causal Forests, and SHAP decomposition are bridging prediction and causal inference.

This Course: Three Components

Component 1: What is ML and what does a research workflow look like?

The main algorithms, the process from data to evaluation to interpretation.

Component 2: How do we actually do it?

Modern AI coding assistants (Claude Code, Codex) let you build ML pipelines by describing what you want in natural language. Live demo today.

Component 3: Can ML handle causality?

Yes, increasingly. Methods that combine ML’s predictive power with the causal reasoning economists care about.

Part 1

Why Machine Learning?

What is Machine Learning?

A set of methods that learn patterns from data instead of being explicitly programmed.

Traditional programming:

Rules + Data → Output

Machine learning:

Data + Output → Rules (learned automatically)

The computer figures out the rules by seeing examples.

A Simple Example

Predicting whether a student passes an exam:

Study Hours Attendance Pass?
10 90% Yes
2 40% No
8 85% Yes
3 50% No
??? ??? ???

You can probably see the pattern. ML algorithms do this automatically, even with hundreds of variables.

Types of Learning

Supervised Learning (today’s focus)

  • We have a target we want to predict
  • The algorithm learns from labeled examples
  • Classification: predict a category
  • Regression: predict a number

Unsupervised Learning

  • No target variable
  • Find hidden structure in data
  • Clustering, dimensionality reduction
  • Not covered today

AI Systems You Already Use — Which Type?

System What It Does Learning Type
YouTube recommendations Predicts which video you’ll click next Supervised
Instagram / TikTok feed Predicts which posts you’ll engage with Supervised
Amazon “customers also bought” Predicts what you’ll purchase Supervised + Unsupervised
Email spam filter Classifies email as spam or not Supervised (classification)

AI Systems (continued)

System What It Does Learning Type
Self-driving cars Detects objects, predicts trajectories Supervised + Reinforcement
AlphaGo / chess engines Learns winning strategies by playing Reinforcement Learning
ChatGPT / Claude / Gemini Predicts the next token in a sequence Self-supervised*

*LLMs are trained on massive text without explicit labels — the “label” is the next word itself.

The Key Distinction

Supervised: Someone tells the model the right answer during training.

  • “This email IS spam” / “This email is NOT spam”
  • “This passenger survived” / “This passenger died”
  • “This household holds stocks” / “This household does not”

Unsupervised: No right answers — the model finds structure on its own.

  • “Group these customers into segments” (clustering)
  • “Find the main patterns in this high-dimensional data” (PCA)

Reinforcement: The model learns by trial and error, receiving rewards or penalties.

  • AlphaGo: win = reward, lose = penalty
  • Self-driving: stay in lane = reward, crash = penalty

Today we focus on supervised learning — the most common type in applied research.

Classification vs Regression

Classification Regression
Target Category (Yes/No, A/B/C) Number (price, score)
Example Does this person own stocks? How much do they invest?
Output Probability + label Continuous value
Metrics Accuracy, AUC, F1 RMSE, R²

Today: classification — predicting stock market participation.

Why ML in Economics?

Traditional econometrics asks: “What is the causal effect of X on Y?”

Machine learning asks: “Can we predict Y from all available information?”

These are different questions, and both are useful.

Econometrics Machine Learning
Goal Causal effect of one variable Best prediction using all variables
Variables Few, carefully chosen Many, let the data decide
Evaluation Coefficient significance Out-of-sample prediction accuracy

Today’s Research Question

Can we predict who participates in the stock market?

  • Data: Survey of Consumer Finances (SCF), US Federal Reserve
  • 55,000 households, surveyed every 3 years (1992–2022)
  • Rich information: income, wealth, education, risk attitudes, demographics
  • Target: Does this household hold any stocks? (Yes/No)

This is a real research question from my own work.

The SCF Dataset at a Glance

Feature Type Examples
Demographics Categorical Education, Marital Status, Work Status
Financial Numerical Income, Net Worth, Total Assets, Debt
Risk attitudes Categorical Risk Aversion level
Housing Binary Home Ownership
Macro conditions Numerical VIX, Stock Returns, Unemployment
Target Binary Has_Total_Stock (0 or 1)

55,004 observations across 11 survey waves.

What We’ll Do Today

Part 1: Why Machine Learning?          <-- You are here
Part 2: The ML Process
Part 3: Live Demo with SCF Data
Part 4: Evaluation Deep Dive
Part 5: Model Comparison
Part 6: SHAP --- Understanding Predictions

By the end, you will have built a complete ML pipeline from scratch.

Part 2

The ML Process

The Workflow

1. Data  -->  2. Split  -->  3. Preprocess  -->  4. Train  -->  5. Evaluate  -->  6. Compare
                                  ^                                                   |
                                  |___________________ try another model _____________|

This workflow is the same regardless of which algorithm you use.

Master the process, and you can use any ML method.

Step 1: Get Your Data

  • Understand what each variable means
  • Check for missing values, outliers, data types
  • Think about the problem before writing code

For SCF data:

  • Why might someone participate in the stock market?
  • Which variables would you use to predict this?
  • Are there variables that would be “cheating”? (e.g., Total_Stock_Value)

Step 2: Split the Data

Before anything else, split into training and test sets.

  • Training set (70%): used to build the model
  • Test set (30%): locked away until the very end

Why? To simulate real-world performance — the model must predict on data it has never seen.

  Full Data (55,004 rows)
        |
   +---------+---------+
   |                   |
Training            Test
38,503 rows      16,501 rows
  (70%)            (30%)

Why Split FIRST?

A common mistake:

“I’ll preprocess all the data, then split.”

This is wrong. If you compute the mean of all data and use it to fill missing values, the test set has “seen” training data information.

This is called data leakage — your test results will be too optimistic.

Rule: Split first. Preprocess training and test sets separately.

(Pipelines handle this automatically.)

The Overfitting Problem

  • Underfitting: model is too simple, misses the pattern
  • Overfitting: model memorises noise, fails on new data
  • Good fit: captures the pattern, generalises well

Cross-Validation

Problem: A single train/test split might be lucky or unlucky.

Solution: Repeat the process multiple times.

Fold Data Split
1 Train Train Train Train Test
2 Train Train Train Test Train
3 Train Train Test Train Train
4 Train Test Train Train Train
5 Test Train Train Train Train

Each fold gets a score → report the average.

Step 3: Preprocess

Real data is messy. Common problems:

Problem Solution
Missing values Impute (mean, median, or most frequent)
Different scales Standardise (mean=0, std=1)
Categorical text Encode (one-hot encoding)

In scikit-learn, we handle this with Pipeline and ColumnTransformer — we’ll see the code in Part 3.

Step 4: Train a Model

Plug in any algorithm:

  • Logistic Regression (simple, interpretable)
  • Random Forest (ensemble of decision trees)
  • Gradient Boosting (powerful, widely used)
  • Support Vector Machine
  • Neural Network

The beauty of scikit-learn’s Pipeline: swap the algorithm in one line.

Algorithm Intuitions: Logistic Regression

Learned from data Set before training
Model parameters Weights (one per feature), intercept
Hyperparameters C (regularisation), penalty type

Algorithm Intuitions: Decision Tree

Easy to visualise and explain, but tends to overfit — memorises the training data.

Algorithm Intuitions: Random Forest

Build hundreds of trees on random subsets. Errors cancel out → robust ensemble.

Algorithm Intuitions: Gradient Boosting

Sequential error-correction. Often achieves the highest accuracy, but slower to train.

Algorithm Intuitions: SVM

Finds the boundary with the widest margin. The kernel trick handles non-linear cases.

Algorithm Intuitions: Neural Network

Each node: inputs × weights → activation function. Stacking layers = increasingly abstract features. LLMs (GPT, Claude, Gemini) are built on this architecture.

Unsupervised Learning: Clustering

No labels needed. The algorithm discovers structure — e.g., customer segments, country groups, household types.

Unsupervised Learning: Dimensionality Reduction (PCA)

Reduces many features to a few key dimensions — useful for visualisation and noise removal.

The ML Library Landscape

Library Type Use For
scikit-learn Traditional ML Logistic regression, trees, SVM, pipelines
XGBoost / LightGBM Gradient boosting High-accuracy tabular data models
PyTorch / TensorFlow Deep learning Neural networks, custom architectures

All scikit-learn algorithms share a uniform .fit() / .predict() interface.

XGBoost/LightGBM are compatible with this interface too.

Model Parameters vs Hyperparameters: Summary

Algorithm Model Parameters (learned) Hyperparameters (you set)
Logistic Regression Weights, intercept C, penalty
Decision Tree Split feature, threshold, leaf values max_depth, min_samples_leaf
Random Forest All trees’ splits + leaf values n_estimators, max_depth
Gradient Boosting All trees’ structure + leaf values n_estimators, learning_rate
SVM Support vectors, boundary weights C, kernel, gamma
Neural Network All connection weights + biases Layers, nodes, learning rate

Model parameters: the algorithm finds these by optimising on training data.

Hyperparameters: you choose these before training. We use GridSearchCV to find the best combination.

Step 5: Evaluate

Never evaluate on training data!

Use the test set (held out since Step 2):

  • Accuracy: what fraction did we get right?
  • Precision, Recall, F1: more nuanced metrics
  • ROC/AUC: overall ranking quality

We’ll dive deep in Part 4.

Step 6: Compare and Iterate

  • Try multiple algorithms
  • Compare their test performance
  • Select the best one

Important: The process is the same every time. Only the model changes.

pipeline = Pipeline([
    ('preprocessor', preprocessor),   # stays the same
    ('classifier',   NEW_MODEL)       # <-- swap this
])

Part 3

Live Demo: Predicting Stock Market Participation

Coding with AI Assistants in 2026

We won’t just type code from scratch — we’ll use AI coding assistants to help us build the ML pipeline step by step.

Two categories of tools:

Type Examples How It Works
Application (GUI) Cursor, Codex (ChatGPT), Claude Desktop IDE or chat interface, point-and-click
CLI (Terminal) Claude Code, Codex CLI, OpenCode, Gemini CLI Run directly in your terminal, works alongside your code

Today we’ll use Claude Code in the terminal.

Setting Up the CLI Environment

macOS:

  • Open Terminal (built-in) or iTerm2
  • Install: npm install -g @anthropic-ai/claude-code
  • Run: claude in your project folder

Windows:

  • First install WSL (Windows Subsystem for Linux)
  • Open Windows Terminal → select your Linux distribution
  • Then install Claude Code the same way as macOS

Linux:

  • Use any terminal emulator
  • Same installation as macOS

What Can CLI AI Assistants Do?

In a CLI coding environment, the AI assistant can:

  • Read your files — it understands your project structure
  • Write and edit code — create scripts, fix bugs, refactor
  • Run commands — execute Python, install packages, run tests
  • Explain code — ask “what does this function do?”
  • Iterate — “change the model to Random Forest” and it modifies the code

You describe what you want in natural language, and the assistant builds the code.

Demo: Building a Pipeline with Claude Code

Instead of copy-pasting code blocks, we’ll give instructions like:

“Load the SCF dataset from data/raw/SCF_with_Macro_and_Weights.csv. Show me the shape and the first few rows.”

“Build a scikit-learn pipeline with logistic regression. Use Age, Income, Net_Worth, Total_Fin_Asset, and Total_Debt as numerical features. Use Education, Work_Status, Marital_Status, Home_Ownership, and Risk_Aversion as categorical features.”

“Run 5-fold cross-validation with AUC scoring and show the results.”

The assistant writes the code, runs it, and shows you the output — all in the terminal.

Why Use AI Assistants for ML?

  1. Faster iteration — describe what you want, get working code
  2. Fewer syntax errors — the assistant handles boilerplate
  3. Learning tool — ask “why did you use StandardScaler here?” and get an explanation
  4. Focus on the process — you think about what to do, the assistant handles how to write it

The code below shows what the assistant would generate at each step. In class, we’ll build this live.

Setup

Code
# Import everything we need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.metrics import (
    confusion_matrix, ConfusionMatrixDisplay,
    classification_report, roc_curve, auc,
    RocCurveDisplay, f1_score, accuracy_score
)
import warnings
warnings.filterwarnings('ignore')

Step 1: Load the Data

Code
# Load the SCF dataset
df = pd.read_csv('data/raw/SCF_with_Macro_and_Weights.csv')
print(f'Dataset shape: {df.shape}')
print(f'Survey years: {sorted(df["Year"].unique())}')
df.head()

Expected output:

Dataset shape: (55004, 65)
Survey years: [1992, 1995, 1998, 2001, 2004, 2007, 2010, 2013, 2016, 2019, 2022]

Explore the Target Variable

Code
# How many households hold stocks?
print(df['Has_Total_Stock'].value_counts())
print(f'\nParticipation rate: {df["Has_Total_Stock"].mean():.1%}')
Code
# Participation rate over time
df.groupby('Year')['Has_Total_Stock'].mean().plot(
    kind='bar', figsize=(10, 4),
    title='Stock Market Participation Rate by Survey Year',
    ylabel='Participation Rate', xlabel='Year',
    color='steelblue', edgecolor='white'
)
plt.ylim(0, 1)
plt.tight_layout()
plt.show()

Step 2: Select Features

Code
# Feature selection --- based on economic reasoning
numerical_features = ['Age', 'Income', 'Net_Worth', 'Total_Fin_Asset', 'Total_Debt']
categorical_features = ['Education', 'Work_Status', 'Marital_Status',
                        'Home_Ownership', 'Risk_Aversion']

target = 'Has_Total_Stock'

# Select only the columns we need
X = df[numerical_features + categorical_features]
y = df[target]

print(f'Features: {X.shape[1]} ({len(numerical_features)} numerical, '
      f'{len(categorical_features)} categorical)')
print(f'Observations: {X.shape[0]:,}')
print(f'Target balance: {y.mean():.1%} hold stocks')

Why These Features?

Numerical features — things we can measure:

  • Age: lifecycle savings theory — older people accumulate more
  • Income: more income → more to invest
  • Net_Worth: wealth enables risk-taking
  • Total_Fin_Asset: financial sophistication proxy
  • Total_Debt: debt constrains investment capacity

Categorical features — characteristics:

  • Education: financial literacy, information access
  • Work_Status: employment stability
  • Risk_Aversion: willingness to bear stock market risk
  • Home_Ownership: existing asset base
  • Marital_Status: household decision-making

What We Deliberately Exclude

  • Total_Stock_Value, Direct_Stock_Valuethat’s the answer!
  • Stock_Company_Count — also reveals the answer
  • Year — we want features that generalise
  • Macro variables (for now) — keep it simple first

Lesson: Feature selection requires domain knowledge, not just statistical criteria.

Step 3: Train-Test Split

Code
# 70% train, 30% test --- stratified to maintain class balance
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,
    random_state=42,
    stratify=y          # important for imbalanced classes!
)

print(f'Training set: {X_train.shape[0]:,} rows')
print(f'Test set:     {X_test.shape[0]:,} rows')
print(f'Train participation rate: {y_train.mean():.1%}')
print(f'Test participation rate:  {y_test.mean():.1%}')

stratify=y ensures both sets have the same proportion of stock holders.

Without this, one set might accidentally have more or fewer.

Step 4: Preprocessing Pipeline

Code
# Numerical features: fill missing --> standardise
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical features: fill missing --> one-hot encode
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Combine into one preprocessor
preprocessor = ColumnTransformer([
    ('num', numerical_pipeline, numerical_features),
    ('cat', categorical_pipeline, categorical_features)
])

What Does This Preprocessor Do?

                        +-- Numerical Pipeline --+
  Age, Income,    -->   |  Impute median         |  -->  Age: 0.3, -1.2 ...
  Net_Worth ...         |  Standardise           |       Income: 1.5, -0.8 ...
                        +------------------------+

                        +-- Categorical Pipeline +
  Education,      -->   |  Impute most frequent  |  -->  Edu_College: 1, 0 ...
  Work_Status ...       |  One-hot encode        |       Work_Employed: 1, 0 ...
                        +------------------------+

All of this happens automatically inside the pipeline.

Step 5: Build the Full Pipeline

Code
# Chain preprocessing + model into one object
pipeline_lr = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000, random_state=42))
])

That’s it. Two components:

  1. preprocessor — handles all data cleaning
  2. classifier — the actual model

When you call pipeline_lr.fit(X_train, y_train):

  • It preprocesses X_train (impute → scale → encode)
  • Then trains the logistic regression on the processed data
  • All in one call

Why Pipelines Matter

Without a pipeline:

# BAD: manual preprocessing --- error-prone, leaky
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)  # easy to forget fit vs transform!
model.fit(X_train_processed, y_train)
model.predict(X_test_processed)

With a pipeline:

# GOOD: everything in one object --- no leakage possible
pipeline.fit(X_train, y_train)
pipeline.predict(X_test)

Pipelines prevent data leakage and reduce mistakes.

Step 6: Hyperparameter Tuning

Hyperparameters = settings you choose before training.

For logistic regression:

  • C: regularisation strength (how much to penalise complex models)
Code
# Search over different hyperparameter values
param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'classifier__C': [0.01, 0.1, 1, 10, 100]
}

grid_lr = GridSearchCV(
    pipeline_lr,
    param_grid,
    cv=5,              # 5-fold cross-validation
    scoring='roc_auc', # optimise for AUC
    n_jobs=-1,         # use all CPU cores
    verbose=1
)

grid_lr.fit(X_train, y_train)

Understanding the Parameter Names

The __ (double underscore) navigates the pipeline structure:

preprocessor__num__imputer__strategy
     |         |      |        |
     |         |      |        +-- the actual parameter
     |         |      +-- step name in numerical_pipeline
     |         +-- name in ColumnTransformer
     +-- name in outer Pipeline
Code
print(f'Best imputer: {grid_lr.best_params_["preprocessor__num__imputer__strategy"]}')
print(f'Best C: {grid_lr.best_params_["classifier__C"]}')
print(f'Best CV AUC: {grid_lr.best_score_:.4f}')

Step 7: Evaluate on Test Set

Code
# This is the FINAL evaluation --- only done once
y_pred_lr = grid_lr.predict(X_test)
y_proba_lr = grid_lr.predict_proba(X_test)[:, 1]

# Classification report
print(classification_report(y_test, y_pred_lr,
                            target_names=['No Stocks', 'Has Stocks']))
Code
# Confusion matrix
fig, ax = plt.subplots(figsize=(6, 5))
ConfusionMatrixDisplay.from_predictions(
    y_test, y_pred_lr,
    display_labels=['No Stocks', 'Has Stocks'],
    cmap='Blues', ax=ax
)
ax.set_title('Logistic Regression --- Confusion Matrix')
plt.tight_layout()
plt.show()

Reading the Confusion Matrix

Part 4

Evaluation Deep Dive

Why Accuracy is Not Enough

Imagine a dataset where 90% of people do NOT hold stocks.

A model that always predicts “No Stocks” gets 90% accuracy!

But it is completely useless — it never identifies a stock holder.

We need more nuanced metrics that look at different types of errors.

Four Fundamental Metrics

From the confusion matrix:

Metric Formula Question It Answers
Sensitivity (Recall) TP / (TP + FN) Of all actual stock holders, how many did we find?
Specificity TN / (TN + FP) Of all non-holders, how many did we correctly identify?
Precision TP / (TP + FP) Of those we predicted as holders, how many actually are?
Accuracy (TP + TN) / All Overall, what fraction did we get right?

Sensitivity vs Specificity

  • Sensitivity: how good are we at finding stock holders?
  • Specificity: how good are we at finding non-holders?

The Threshold Tradeoff

Logistic regression outputs a probability (0 to 1).

We choose a threshold to convert it to a prediction:

  • Threshold = 0.5 (default): predict “Has Stocks” if probability > 50%
  • Threshold = 0.3: more aggressive — catches more holders, but more false alarms
  • Threshold = 0.7: more conservative — fewer false alarms, but misses more holders

There is always a tradeoff. Catching more positives = more false positives.

Visualising the Tradeoff

Lower the threshold → higher sensitivity, lower specificity.

ROC Curve

The ROC curve plots sensitivity vs (1 - specificity) at every threshold.

AUC (Area Under the Curve): one number summarising overall quality. Higher = better.

Precision vs Recall

Metric Focus Use When…
Precision Of predicted positives, how many are correct? False positives are costly
Recall (= Sensitivity) Of actual positives, how many did we find? Missing positives is costly

Stock participation example:

  • High precision needed: acting on predictions costs money (targeted marketing)
  • High recall needed: missing a stock holder means missing important data

F1 Score

When you care about both precision and recall:

\[F_1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\]

  • F1 = 1.0: perfect precision and recall
  • F1 = 0.0: either precision or recall is zero
  • F1 penalises models that sacrifice one for the other

Which Metric for Our Problem?

For predicting stock market participation:

  • AUC is a good default — evaluates across all thresholds
  • F1 if we need a single threshold prediction
  • Recall if we want to make sure we find all participants

In research, we typically report multiple metrics and let the reader judge.

Computing ROC for Our Model

Code
# ROC curve for logistic regression
fig, ax = plt.subplots(figsize=(8, 6))
RocCurveDisplay.from_estimator(
    grid_lr, X_test, y_test,
    name='Logistic Regression',
    ax=ax, color='steelblue', linewidth=2
)
ax.plot([0, 1], [0, 1], 'k--', alpha=0.3, label='Random (AUC = 0.5)')
ax.set_title('ROC Curve --- Stock Market Participation', fontsize=14)
ax.legend(fontsize=11)
plt.tight_layout()
plt.show()

Part 5

Model Comparison

The Power of Pipelines

Remember our pipeline structure:

pipeline = Pipeline([
    ('preprocessor', preprocessor),   # SAME for all models
    ('classifier',   ???)             # <-- only this changes
])

To try a new model, we only change one line.

Gradient Boosting

Code
# Gradient Boosting --- a powerful ensemble method
pipeline_gb = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(random_state=42))
])

param_grid_gb = {
    'classifier__n_estimators': [100, 200],
    'classifier__learning_rate': [0.05, 0.1],
    'classifier__max_depth': [3, 5]
}

grid_gb = GridSearchCV(
    pipeline_gb, param_grid_gb,
    cv=5, scoring='roc_auc', n_jobs=-1, verbose=1
)
grid_gb.fit(X_train, y_train)

print(f'Best CV AUC: {grid_gb.best_score_:.4f}')
print(f'Best params: {grid_gb.best_params_}')

Random Forest

Code
# Random Forest --- another ensemble method
pipeline_rf = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

param_grid_rf = {
    'classifier__n_estimators': [100, 200],
    'classifier__max_depth': [5, 10, None],
    'classifier__min_samples_leaf': [1, 5]
}

grid_rf = GridSearchCV(
    pipeline_rf, param_grid_rf,
    cv=5, scoring='roc_auc', n_jobs=-1, verbose=1
)
grid_rf.fit(X_train, y_train)

print(f'Best CV AUC: {grid_rf.best_score_:.4f}')

The Swap Pattern

# The pattern: only the classifier changes
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest':       RandomForestClassifier(n_estimators=200),
    'Gradient Boosting':   GradientBoostingClassifier(n_estimators=200),
    'SVM':                 SVC(probability=True),
    'KNN':                 KNeighborsClassifier(),
    'Neural Network':      MLPClassifier(max_iter=500),
}

for name, model in models.items():
    pipe = Pipeline([('preprocessor', preprocessor), ('classifier', model)])
    scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring='roc_auc')
    print(f'{name:25s}  AUC = {scores.mean():.4f} +/- {scores.std():.4f}')

Compare ROC Curves

Code
# Side-by-side ROC curves
fig, ax = plt.subplots(figsize=(9, 7))

colors = {'Logistic Regression': 'steelblue',
          'Gradient Boosting': 'darkgreen',
          'Random Forest': 'darkorange'}

for name, grid in [('Logistic Regression', grid_lr),
                   ('Gradient Boosting', grid_gb),
                   ('Random Forest', grid_rf)]:
    RocCurveDisplay.from_estimator(
        grid, X_test, y_test,
        name=name, ax=ax,
        color=colors[name], linewidth=2
    )

ax.plot([0, 1], [0, 1], 'k--', alpha=0.3)
ax.set_title('Model Comparison --- ROC Curves', fontsize=14)
ax.legend(fontsize=11, loc='lower right')
plt.tight_layout()
plt.show()

Summary Table

Code
# Build comparison table
results = []
for name, grid in [('Logistic Regression', grid_lr),
                   ('Gradient Boosting', grid_gb),
                   ('Random Forest', grid_rf)]:
    y_pred = grid.predict(X_test)
    y_proba = grid.predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    results.append({
        'Model': name,
        'CV AUC': f'{grid.best_score_:.4f}',
        'Test AUC': f'{auc(fpr, tpr):.4f}',
        'Test Accuracy': f'{accuracy_score(y_test, y_pred):.4f}',
        'Test F1': f'{f1_score(y_test, y_pred):.4f}'
    })

pd.DataFrame(results).set_index('Model')

Bias-Variance Tradeoff (Intuition)

Simple models (Logistic Regression)

  • High bias, low variance
  • Stable predictions
  • May miss complex patterns
  • Easy to interpret

Complex models (Gradient Boosting)

  • Low bias, high variance
  • Can capture subtle patterns
  • Risk of overfitting
  • Harder to interpret

No single model is always best. Compare fairly on test data and choose based on your goals.

Part 6

SHAP: Understanding Predictions

The Black Box Problem

We have a model that predicts stock participation with good AUC.

But why does it predict what it predicts?

  • Which features matter most?
  • How does each feature push the prediction up or down?
  • Are the patterns economically sensible?

SHAP (SHapley Additive exPlanations) answers these questions.

What Are SHAP Values?

For each prediction, SHAP tells you:

How much did each feature contribute to this particular prediction?

Based on Shapley values from cooperative game theory:

  • Each feature is a “player” in a game
  • The “payout” is the prediction
  • SHAP fairly distributes credit among features

Computing SHAP Values

Code
import shap

# Use the best model (e.g., Gradient Boosting)
best_model = grid_gb.best_estimator_

# Get the preprocessed test data
X_test_processed = best_model.named_steps['preprocessor'].transform(X_test)

# Get feature names after one-hot encoding
feature_names = (numerical_features +
    list(best_model.named_steps['preprocessor']
         .named_transformers_['cat']
         .named_steps['encoder']
         .get_feature_names_out(categorical_features)))

# Compute SHAP values
explainer = shap.TreeExplainer(best_model.named_steps['classifier'])
shap_values = explainer.shap_values(X_test_processed)

SHAP Summary Plot

Code
# Global feature importance: which features matter most?
shap.summary_plot(
    shap_values, X_test_processed,
    feature_names=feature_names,
    max_display=15,
    show=False
)
plt.title('SHAP Summary --- What Drives Stock Market Participation?', fontsize=13)
plt.tight_layout()
plt.show()

How to read this plot:

  • Features ranked by importance (top = most important)
  • Each dot = one observation
  • Red = high feature value, Blue = low
  • Right of center = pushes prediction toward “Has Stocks”

Interpreting the Results

Expected patterns (from economic theory):

Feature Expected Effect Why
Net Worth ↑ More likely to hold stocks Wealth enables risk-taking
Income ↑ More likely More to invest
Education (College+) More likely Financial literacy
Risk Aversion (high) Less likely Unwilling to bear risk
Age Hump-shaped? Lifecycle savings

This is where ML meets economics: the model’s learned patterns should align with theory. If they don’t — that’s even more interesting.

SHAP for Individual Predictions

Code
# Explain a single prediction
idx = 0  # first test observation
shap.waterfall_plot(
    shap.Explanation(
        values=shap_values[idx],
        base_values=explainer.expected_value,
        data=X_test_processed[idx],
        feature_names=feature_names
    )
)

The waterfall plot shows:

  • Starting from the base rate (average prediction)
  • Each feature pushes the prediction up or down
  • The final prediction is the sum of all contributions

Why SHAP Matters for Research

  1. Validation: Do the model’s patterns match economic theory?
  2. Discovery: Are there unexpected patterns we should investigate?
  3. Communication: Explain complex models to non-technical audiences
  4. Fairness: Check if the model relies on problematic features

ML is not just about getting a high AUC.

It’s about learning something useful from the data.

Wrap-Up

Summary

What We Covered Today

1. Why ML?          --> Prediction from data, different from causal inference
2. The Process      --> Split --> Preprocess --> Train --> Evaluate --> Compare
3. Live Demo        --> Complete pipeline with SCF data
4. Evaluation       --> Confusion matrix, ROC/AUC, precision/recall, F1
5. Model Comparison --> Swap algorithms in one line with Pipeline
6. SHAP             --> Open the black box, connect to economic theory

The Three Big Ideas

1. Process over algorithms

The workflow is the same for any model. Master the process.

2. Pipelines prevent mistakes

Preprocessing + model in one object. No data leakage. Easy to swap.

3. Prediction is just the beginning

SHAP and interpretability connect ML back to understanding — which is the real goal of research.

For Your Own Projects

# The template you can reuse for any classification problem:

# 1. Load and explore data
# 2. Define X (features) and y (target)
# 3. Train-test split with stratify
# 4. Build preprocessing pipeline
# 5. Build full pipeline (preprocessor + model)
# 6. GridSearchCV for hyperparameter tuning
# 7. Evaluate on test set
# 8. Try other models (just swap the classifier)
# 9. Compare with ROC curves
# 10. Interpret with SHAP

Resources

Questions?