
Machine Learning for Economics Research
2026-05-12
Machine learning is everywhere — recommendation systems, self-driving cars, language models.
But can we use it for economics research?
If it’s so powerful, why don’t we see it more in economics journals? Why doesn’t our econometrics curriculum cover it?
1. Data volume
Traditional econometrics works with small, carefully collected datasets. ML needs large samples. But survey data is getting bigger (SCF: 55,000 households), administrative data is exploding, and text data is now usable.
2. Causality
Econometrics asks why — the causal effect of X on Y. ML asks what — the best prediction of Y from everything available.
But the gap is closing. Double/Debiased ML, Causal Forests, and SHAP decomposition are bridging prediction and causal inference.
Component 1: What is ML and what does a research workflow look like?
The main algorithms, the process from data to evaluation to interpretation.
Component 2: How do we actually do it?
Modern AI coding assistants (Claude Code, Codex) let you build ML pipelines by describing what you want in natural language. Live demo today.
Component 3: Can ML handle causality?
Yes, increasingly. Methods that combine ML’s predictive power with the causal reasoning economists care about.
A set of methods that learn patterns from data instead of being explicitly programmed.
Traditional programming:
Rules + Data → Output
Machine learning:
Data + Output → Rules (learned automatically)
The computer figures out the rules by seeing examples.
Predicting whether a student passes an exam:
| Study Hours | Attendance | Pass? |
|---|---|---|
| 10 | 90% | Yes |
| 2 | 40% | No |
| 8 | 85% | Yes |
| 3 | 50% | No |
| ??? | ??? | ??? |
You can probably see the pattern. ML algorithms do this automatically, even with hundreds of variables.
Supervised Learning (today’s focus)
Unsupervised Learning
| System | What It Does | Learning Type |
|---|---|---|
| YouTube recommendations | Predicts which video you’ll click next | Supervised |
| Instagram / TikTok feed | Predicts which posts you’ll engage with | Supervised |
| Amazon “customers also bought” | Predicts what you’ll purchase | Supervised + Unsupervised |
| Email spam filter | Classifies email as spam or not | Supervised (classification) |
| System | What It Does | Learning Type |
|---|---|---|
| Self-driving cars | Detects objects, predicts trajectories | Supervised + Reinforcement |
| AlphaGo / chess engines | Learns winning strategies by playing | Reinforcement Learning |
| ChatGPT / Claude / Gemini | Predicts the next token in a sequence | Self-supervised* |
*LLMs are trained on massive text without explicit labels — the “label” is the next word itself.
Supervised: Someone tells the model the right answer during training.
Unsupervised: No right answers — the model finds structure on its own.
Reinforcement: The model learns by trial and error, receiving rewards or penalties.
Today we focus on supervised learning — the most common type in applied research.
| Classification | Regression | |
|---|---|---|
| Target | Category (Yes/No, A/B/C) | Number (price, score) |
| Example | Does this person own stocks? | How much do they invest? |
| Output | Probability + label | Continuous value |
| Metrics | Accuracy, AUC, F1 | RMSE, R² |
Today: classification — predicting stock market participation.
Traditional econometrics asks: “What is the causal effect of X on Y?”
Machine learning asks: “Can we predict Y from all available information?”
These are different questions, and both are useful.
| Econometrics | Machine Learning | |
|---|---|---|
| Goal | Causal effect of one variable | Best prediction using all variables |
| Variables | Few, carefully chosen | Many, let the data decide |
| Evaluation | Coefficient significance | Out-of-sample prediction accuracy |
Can we predict who participates in the stock market?
This is a real research question from my own work.
| Feature | Type | Examples |
|---|---|---|
| Demographics | Categorical | Education, Marital Status, Work Status |
| Financial | Numerical | Income, Net Worth, Total Assets, Debt |
| Risk attitudes | Categorical | Risk Aversion level |
| Housing | Binary | Home Ownership |
| Macro conditions | Numerical | VIX, Stock Returns, Unemployment |
| Target | Binary | Has_Total_Stock (0 or 1) |
55,004 observations across 11 survey waves.
Part 1: Why Machine Learning? <-- You are here
Part 2: The ML Process
Part 3: Live Demo with SCF Data
Part 4: Evaluation Deep Dive
Part 5: Model Comparison
Part 6: SHAP --- Understanding Predictions
By the end, you will have built a complete ML pipeline from scratch.
1. Data --> 2. Split --> 3. Preprocess --> 4. Train --> 5. Evaluate --> 6. Compare
^ |
|___________________ try another model _____________|
This workflow is the same regardless of which algorithm you use.
Master the process, and you can use any ML method.
For SCF data:
Total_Stock_Value)Before anything else, split into training and test sets.
Why? To simulate real-world performance — the model must predict on data it has never seen.
Full Data (55,004 rows)
|
+---------+---------+
| |
Training Test
38,503 rows 16,501 rows
(70%) (30%)
A common mistake:
“I’ll preprocess all the data, then split.”
This is wrong. If you compute the mean of all data and use it to fill missing values, the test set has “seen” training data information.
This is called data leakage — your test results will be too optimistic.
Rule: Split first. Preprocess training and test sets separately.
(Pipelines handle this automatically.)

Problem: A single train/test split might be lucky or unlucky.
Solution: Repeat the process multiple times.
| Fold | Data Split |
|---|---|
| 1 | Train Train Train Train Test |
| 2 | Train Train Train Test Train |
| 3 | Train Train Test Train Train |
| 4 | Train Test Train Train Train |
| 5 | Test Train Train Train Train |
Each fold gets a score → report the average.
Real data is messy. Common problems:
| Problem | Solution |
|---|---|
| Missing values | Impute (mean, median, or most frequent) |
| Different scales | Standardise (mean=0, std=1) |
| Categorical text | Encode (one-hot encoding) |
In scikit-learn, we handle this with Pipeline and ColumnTransformer — we’ll see the code in Part 3.
Plug in any algorithm:
The beauty of scikit-learn’s Pipeline: swap the algorithm in one line.

| Learned from data | Set before training | |
|---|---|---|
| Model parameters | Weights (one per feature), intercept | |
| Hyperparameters | C (regularisation), penalty type |

Easy to visualise and explain, but tends to overfit — memorises the training data.

Build hundreds of trees on random subsets. Errors cancel out → robust ensemble.

Sequential error-correction. Often achieves the highest accuracy, but slower to train.

Finds the boundary with the widest margin. The kernel trick handles non-linear cases.

Each node: inputs × weights → activation function. Stacking layers = increasingly abstract features. LLMs (GPT, Claude, Gemini) are built on this architecture.

No labels needed. The algorithm discovers structure — e.g., customer segments, country groups, household types.

Reduces many features to a few key dimensions — useful for visualisation and noise removal.
| Library | Type | Use For |
|---|---|---|
| scikit-learn | Traditional ML | Logistic regression, trees, SVM, pipelines |
| XGBoost / LightGBM | Gradient boosting | High-accuracy tabular data models |
| PyTorch / TensorFlow | Deep learning | Neural networks, custom architectures |
All scikit-learn algorithms share a uniform .fit() / .predict() interface.
XGBoost/LightGBM are compatible with this interface too.
| Algorithm | Model Parameters (learned) | Hyperparameters (you set) |
|---|---|---|
| Logistic Regression | Weights, intercept | C, penalty |
| Decision Tree | Split feature, threshold, leaf values | max_depth, min_samples_leaf |
| Random Forest | All trees’ splits + leaf values | n_estimators, max_depth |
| Gradient Boosting | All trees’ structure + leaf values | n_estimators, learning_rate |
| SVM | Support vectors, boundary weights | C, kernel, gamma |
| Neural Network | All connection weights + biases | Layers, nodes, learning rate |
Model parameters: the algorithm finds these by optimising on training data.
Hyperparameters: you choose these before training. We use GridSearchCV to find the best combination.
Never evaluate on training data!
Use the test set (held out since Step 2):
We’ll dive deep in Part 4.
Important: The process is the same every time. Only the model changes.
We won’t just type code from scratch — we’ll use AI coding assistants to help us build the ML pipeline step by step.
Two categories of tools:
| Type | Examples | How It Works |
|---|---|---|
| Application (GUI) | Cursor, Codex (ChatGPT), Claude Desktop | IDE or chat interface, point-and-click |
| CLI (Terminal) | Claude Code, Codex CLI, OpenCode, Gemini CLI | Run directly in your terminal, works alongside your code |
Today we’ll use Claude Code in the terminal.
macOS:
npm install -g @anthropic-ai/claude-codeclaude in your project folderWindows:
Linux:
In a CLI coding environment, the AI assistant can:
You describe what you want in natural language, and the assistant builds the code.
Instead of copy-pasting code blocks, we’ll give instructions like:
“Load the SCF dataset from data/raw/SCF_with_Macro_and_Weights.csv. Show me the shape and the first few rows.”
“Build a scikit-learn pipeline with logistic regression. Use Age, Income, Net_Worth, Total_Fin_Asset, and Total_Debt as numerical features. Use Education, Work_Status, Marital_Status, Home_Ownership, and Risk_Aversion as categorical features.”
“Run 5-fold cross-validation with AUC scoring and show the results.”
The assistant writes the code, runs it, and shows you the output — all in the terminal.
The code below shows what the assistant would generate at each step. In class, we’ll build this live.
# Import everything we need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.metrics import (
confusion_matrix, ConfusionMatrixDisplay,
classification_report, roc_curve, auc,
RocCurveDisplay, f1_score, accuracy_score
)
import warnings
warnings.filterwarnings('ignore')Expected output:
Dataset shape: (55004, 65)
Survey years: [1992, 1995, 1998, 2001, 2004, 2007, 2010, 2013, 2016, 2019, 2022]
# Feature selection --- based on economic reasoning
numerical_features = ['Age', 'Income', 'Net_Worth', 'Total_Fin_Asset', 'Total_Debt']
categorical_features = ['Education', 'Work_Status', 'Marital_Status',
'Home_Ownership', 'Risk_Aversion']
target = 'Has_Total_Stock'
# Select only the columns we need
X = df[numerical_features + categorical_features]
y = df[target]
print(f'Features: {X.shape[1]} ({len(numerical_features)} numerical, '
f'{len(categorical_features)} categorical)')
print(f'Observations: {X.shape[0]:,}')
print(f'Target balance: {y.mean():.1%} hold stocks')Numerical features — things we can measure:
Age: lifecycle savings theory — older people accumulate moreIncome: more income → more to investNet_Worth: wealth enables risk-takingTotal_Fin_Asset: financial sophistication proxyTotal_Debt: debt constrains investment capacityCategorical features — characteristics:
Education: financial literacy, information accessWork_Status: employment stabilityRisk_Aversion: willingness to bear stock market riskHome_Ownership: existing asset baseMarital_Status: household decision-makingTotal_Stock_Value, Direct_Stock_Value — that’s the answer!Stock_Company_Count — also reveals the answerYear — we want features that generaliseLesson: Feature selection requires domain knowledge, not just statistical criteria.
# 70% train, 30% test --- stratified to maintain class balance
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.3,
random_state=42,
stratify=y # important for imbalanced classes!
)
print(f'Training set: {X_train.shape[0]:,} rows')
print(f'Test set: {X_test.shape[0]:,} rows')
print(f'Train participation rate: {y_train.mean():.1%}')
print(f'Test participation rate: {y_test.mean():.1%}')stratify=y ensures both sets have the same proportion of stock holders.
Without this, one set might accidentally have more or fewer.
# Numerical features: fill missing --> standardise
numerical_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# Categorical features: fill missing --> one-hot encode
categorical_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
# Combine into one preprocessor
preprocessor = ColumnTransformer([
('num', numerical_pipeline, numerical_features),
('cat', categorical_pipeline, categorical_features)
]) +-- Numerical Pipeline --+
Age, Income, --> | Impute median | --> Age: 0.3, -1.2 ...
Net_Worth ... | Standardise | Income: 1.5, -0.8 ...
+------------------------+
+-- Categorical Pipeline +
Education, --> | Impute most frequent | --> Edu_College: 1, 0 ...
Work_Status ... | One-hot encode | Work_Employed: 1, 0 ...
+------------------------+
All of this happens automatically inside the pipeline.
That’s it. Two components:
preprocessor — handles all data cleaningclassifier — the actual modelWhen you call pipeline_lr.fit(X_train, y_train):
X_train (impute → scale → encode)Without a pipeline:
With a pipeline:
Pipelines prevent data leakage and reduce mistakes.
Hyperparameters = settings you choose before training.
For logistic regression:
C: regularisation strength (how much to penalise complex models)# Search over different hyperparameter values
param_grid = {
'preprocessor__num__imputer__strategy': ['mean', 'median'],
'classifier__C': [0.01, 0.1, 1, 10, 100]
}
grid_lr = GridSearchCV(
pipeline_lr,
param_grid,
cv=5, # 5-fold cross-validation
scoring='roc_auc', # optimise for AUC
n_jobs=-1, # use all CPU cores
verbose=1
)
grid_lr.fit(X_train, y_train)The __ (double underscore) navigates the pipeline structure:
preprocessor__num__imputer__strategy
| | | |
| | | +-- the actual parameter
| | +-- step name in numerical_pipeline
| +-- name in ColumnTransformer
+-- name in outer Pipeline

Imagine a dataset where 90% of people do NOT hold stocks.
A model that always predicts “No Stocks” gets 90% accuracy!
But it is completely useless — it never identifies a stock holder.
We need more nuanced metrics that look at different types of errors.
From the confusion matrix:
| Metric | Formula | Question It Answers |
|---|---|---|
| Sensitivity (Recall) | TP / (TP + FN) | Of all actual stock holders, how many did we find? |
| Specificity | TN / (TN + FP) | Of all non-holders, how many did we correctly identify? |
| Precision | TP / (TP + FP) | Of those we predicted as holders, how many actually are? |
| Accuracy | (TP + TN) / All | Overall, what fraction did we get right? |

Logistic regression outputs a probability (0 to 1).
We choose a threshold to convert it to a prediction:
There is always a tradeoff. Catching more positives = more false positives.

Lower the threshold → higher sensitivity, lower specificity.
The ROC curve plots sensitivity vs (1 - specificity) at every threshold.

AUC (Area Under the Curve): one number summarising overall quality. Higher = better.
| Metric | Focus | Use When… |
|---|---|---|
| Precision | Of predicted positives, how many are correct? | False positives are costly |
| Recall (= Sensitivity) | Of actual positives, how many did we find? | Missing positives is costly |
Stock participation example:
When you care about both precision and recall:
\[F_1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\]
For predicting stock market participation:
In research, we typically report multiple metrics and let the reader judge.
# ROC curve for logistic regression
fig, ax = plt.subplots(figsize=(8, 6))
RocCurveDisplay.from_estimator(
grid_lr, X_test, y_test,
name='Logistic Regression',
ax=ax, color='steelblue', linewidth=2
)
ax.plot([0, 1], [0, 1], 'k--', alpha=0.3, label='Random (AUC = 0.5)')
ax.set_title('ROC Curve --- Stock Market Participation', fontsize=14)
ax.legend(fontsize=11)
plt.tight_layout()
plt.show()Remember our pipeline structure:
To try a new model, we only change one line.
# Gradient Boosting --- a powerful ensemble method
pipeline_gb = Pipeline([
('preprocessor', preprocessor),
('classifier', GradientBoostingClassifier(random_state=42))
])
param_grid_gb = {
'classifier__n_estimators': [100, 200],
'classifier__learning_rate': [0.05, 0.1],
'classifier__max_depth': [3, 5]
}
grid_gb = GridSearchCV(
pipeline_gb, param_grid_gb,
cv=5, scoring='roc_auc', n_jobs=-1, verbose=1
)
grid_gb.fit(X_train, y_train)
print(f'Best CV AUC: {grid_gb.best_score_:.4f}')
print(f'Best params: {grid_gb.best_params_}')# Random Forest --- another ensemble method
pipeline_rf = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(random_state=42))
])
param_grid_rf = {
'classifier__n_estimators': [100, 200],
'classifier__max_depth': [5, 10, None],
'classifier__min_samples_leaf': [1, 5]
}
grid_rf = GridSearchCV(
pipeline_rf, param_grid_rf,
cv=5, scoring='roc_auc', n_jobs=-1, verbose=1
)
grid_rf.fit(X_train, y_train)
print(f'Best CV AUC: {grid_rf.best_score_:.4f}')# The pattern: only the classifier changes
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
models = {
'Logistic Regression': LogisticRegression(max_iter=1000),
'Random Forest': RandomForestClassifier(n_estimators=200),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=200),
'SVM': SVC(probability=True),
'KNN': KNeighborsClassifier(),
'Neural Network': MLPClassifier(max_iter=500),
}
for name, model in models.items():
pipe = Pipeline([('preprocessor', preprocessor), ('classifier', model)])
scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring='roc_auc')
print(f'{name:25s} AUC = {scores.mean():.4f} +/- {scores.std():.4f}')# Side-by-side ROC curves
fig, ax = plt.subplots(figsize=(9, 7))
colors = {'Logistic Regression': 'steelblue',
'Gradient Boosting': 'darkgreen',
'Random Forest': 'darkorange'}
for name, grid in [('Logistic Regression', grid_lr),
('Gradient Boosting', grid_gb),
('Random Forest', grid_rf)]:
RocCurveDisplay.from_estimator(
grid, X_test, y_test,
name=name, ax=ax,
color=colors[name], linewidth=2
)
ax.plot([0, 1], [0, 1], 'k--', alpha=0.3)
ax.set_title('Model Comparison --- ROC Curves', fontsize=14)
ax.legend(fontsize=11, loc='lower right')
plt.tight_layout()
plt.show()# Build comparison table
results = []
for name, grid in [('Logistic Regression', grid_lr),
('Gradient Boosting', grid_gb),
('Random Forest', grid_rf)]:
y_pred = grid.predict(X_test)
y_proba = grid.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_proba)
results.append({
'Model': name,
'CV AUC': f'{grid.best_score_:.4f}',
'Test AUC': f'{auc(fpr, tpr):.4f}',
'Test Accuracy': f'{accuracy_score(y_test, y_pred):.4f}',
'Test F1': f'{f1_score(y_test, y_pred):.4f}'
})
pd.DataFrame(results).set_index('Model')Simple models (Logistic Regression)
Complex models (Gradient Boosting)
No single model is always best. Compare fairly on test data and choose based on your goals.
We have a model that predicts stock participation with good AUC.
But why does it predict what it predicts?
SHAP (SHapley Additive exPlanations) answers these questions.
For each prediction, SHAP tells you:
How much did each feature contribute to this particular prediction?
Based on Shapley values from cooperative game theory:
import shap
# Use the best model (e.g., Gradient Boosting)
best_model = grid_gb.best_estimator_
# Get the preprocessed test data
X_test_processed = best_model.named_steps['preprocessor'].transform(X_test)
# Get feature names after one-hot encoding
feature_names = (numerical_features +
list(best_model.named_steps['preprocessor']
.named_transformers_['cat']
.named_steps['encoder']
.get_feature_names_out(categorical_features)))
# Compute SHAP values
explainer = shap.TreeExplainer(best_model.named_steps['classifier'])
shap_values = explainer.shap_values(X_test_processed)How to read this plot:
Expected patterns (from economic theory):
| Feature | Expected Effect | Why |
|---|---|---|
| Net Worth ↑ | More likely to hold stocks | Wealth enables risk-taking |
| Income ↑ | More likely | More to invest |
| Education (College+) | More likely | Financial literacy |
| Risk Aversion (high) | Less likely | Unwilling to bear risk |
| Age | Hump-shaped? | Lifecycle savings |
This is where ML meets economics: the model’s learned patterns should align with theory. If they don’t — that’s even more interesting.
The waterfall plot shows:
ML is not just about getting a high AUC.
It’s about learning something useful from the data.
1. Why ML? --> Prediction from data, different from causal inference
2. The Process --> Split --> Preprocess --> Train --> Evaluate --> Compare
3. Live Demo --> Complete pipeline with SCF data
4. Evaluation --> Confusion matrix, ROC/AUC, precision/recall, F1
5. Model Comparison --> Swap algorithms in one line with Pipeline
6. SHAP --> Open the black box, connect to economic theory
1. Process over algorithms
The workflow is the same for any model. Master the process.
2. Pipelines prevent mistakes
Preprocessing + model in one object. No data leakage. Easy to swap.
3. Prediction is just the beginning
SHAP and interpretability connect ML back to understanding — which is the real goal of research.
# The template you can reuse for any classification problem:
# 1. Load and explore data
# 2. Define X (features) and y (target)
# 3. Train-test split with stratify
# 4. Build preprocessing pipeline
# 5. Build full pipeline (preprocessor + model)
# 6. GridSearchCV for hyperparameter tuning
# 7. Evaluate on test set
# 8. Try other models (just swap the classifier)
# 9. Compare with ROC curves
# 10. Interpret with SHAP