ML / Classification Project
Bank Term Deposit Prediction
A two-stage machine learning pipeline predicting whether a bank client will subscribe to a term deposit after a phone marketing campaign.
Project objective
Maximise recall, controlled by F2.
The positive class is small, so accuracy alone is misleading. The model is designed to catch as many likely subscribers as possible, while F2 keeps precision from becoming completely uncontrolled.
Recall
Primary
F2
Guardrail
ROC-AUC
Tracked
Lift
Business
Problem understanding
Predict subscriber intent.
Only 11.7% of clients subscribed. A naive model can predict “no” for everyone and still achieve high accuracy.
Because missing a real subscriber is worse than making an extra call, the pipeline prioritises recall and uses F2 as the model-selection metric.
Target distribution
Dataset split
70 / 15 / 15stratified on target
45,211 records · random_state = 42
Input features
15 features + 1 dropped
Compact feature summary with type, encoding notes, and missing-value strategy.
agenumericDescription
Client age in years.
Values / Encoding
18 – 95
Missing strategy
No special missing-value handling required.
Exploratory data analysis
What the data tells us
The main signals are class imbalance, right-skewed numeric features, and strong duration effect.
Numeric feature IQR distributions
age
18 → 95
Outliers 4.9%
balance
-8,019 → 102,127
Outliers 9.8%
day
1 → 31
Outliers 0%
duration
0 → 4,918
Outliers 7.1%
campaign
1 → 63
Outliers 8.2%
pdays
-1 → 871
Outliers 12.4%
previous
0 → 275
Outliers 14.3%
pdays = −1 for most records, meaning the client was never previously contacted. duration, balance, campaign, and previous show strong right-skew.
Feature types breakdown
7
Numeric
4
Categorical
3
Binary
2
Ordinal
Job distribution
Job distribution
Contact month
Education level
Marital status
Multi-metric radar comparison
Multi-metric radar comparison
Cumulative lift curve
Dashed line = random baseline. Top deciles capture subscribers faster than random targeting.
Missing value summary
| Feature | Missing | % | Strategy | Fill |
|---|---|---|---|---|
| job | 74 | 0.16% | Mode fill | 'blue-collar' |
| education | 174 | 0.39% | Mode fill | 'secondary' |
| contact | 1,324 | 2.93% | Mode fill | 'unknown' |
| poutcome | 36,959 | 81.7% | DROPPED | — |
Data preprocessing
Cleaning, encoding & balancing
All train-only transformations are preserved to avoid leakage and keep inference reproducible.
Fill missing values
job'blue-collar'74 rowseducation'secondary'174 rowscontact'unknown'1,324 rowspoutcomeDROPPED36,959 rowsSMOTENC oversampling
Applied only on the training set to avoid leakage. It balances the minority class while respecting categorical feature boundaries.
Strategy
0.25
Scope
Train only
Handles
Mixed types
Purpose
Recall boost
Feature encoding
StandardScaler
Fitted on the training data only, then applied to validation and test sets.
Scaled numeric columns
age, balance, day, duration,
campaign, pdays, previousPipeline
How the model pipeline works
Six reproducible stages from raw data to final model evaluation.
Problem & Goal
Predict term deposit subscription. Optimise for recall using F2 as the tuning guardrail.
EDA
Explore imbalance, distributions, outliers, missing values, and target correlations.
Preprocessing
Fill missing values, drop high-missing columns, encode categorical features, scale numeric columns.
Stage 1 Training
Train seven classifiers and shortlist the strongest candidates by recall and F2.
Stage 2 Tuning
Run a finer search on the winners and compare tuned models on validation performance.
Evaluation
Review final metrics, confusion matrices, feature importance, radar comparison, and lift curve.
Model training
Two-stage model selection
Stage 1 evaluates seven classifiers. Stage 2 fine-tunes the shortlisted winners.
Stage 1 — all models evaluated
Seven classifiers trained on preprocessed data
Recall vs F2 — pre-tuning
Full metrics table
| Model | Recall | F2 | Precision | Accuracy | ROC-AUC |
|---|---|---|---|---|---|
| Logistic Reg. | 87.8% | 61.4% | 27.8% | 71.9% | 86.5% |
| Grad. Boost | 66.5% | 55.2% | 32.9% | 80.2% | 83.0% |
| Extra Trees | 76.1% | 54.8% | 25.9% | 71.7% | 81.1% |
| Rand. Forest | 30.2% | 30.7% | 31.0% | 82.8% | 76.6% |
| KNN | 35.4% | 35.1% | 25.1% | 76.8% | 73.0% |
| Dec. Tree | 39.3% | 38.9% | 20.0% | 73.2% | 66.8% |
| Baseline | 19.4% | 17.0% | 11.4% | 72.9% | 49.7% |
Model selection
Recall-specificity goodness plot
Closer to the top-right ideal point means stronger recall with fewer false positives.
Hyperparameter search — Stage 1
GridSearchCV · cv=3 · F2 scorer
Before Stage 1 tuning
After Stage 1 tuning
Improvement after Stage 1 tuning
| Model | Recall before | Recall after | Δ Recall | F2 before | F2 after | Δ F2 |
|---|---|---|---|---|---|---|
| LR ★ | 87.8% | 90.8% | 3.0% | 61.4% | 64.9% | 3.5% |
| Grad. Boost | 66.5% | 71.0% | 4.5% | 55.2% | 58.1% | 2.9% |
| Extra Trees | 76.1% | 79.2% | 3.1% | 54.8% | 56.7% | 1.9% |
Hyperparameter search — Stage 2
Finer grid on Stage 1 winners
Before Stage 2 tuning
After Stage 2 tuning
Improvement after Stage 2 tuning
| Model | Recall before | Recall after | Δ Recall | F2 before | F2 after | Δ F2 |
|---|---|---|---|---|---|---|
| LR Stage 1 ★ | 80.6% | 90.8% | 10.2% | 56.2% | 64.9% | 8.7% |
| Grad. Boost | 74.1% | 71.0% | -3.1% | 53.1% | 58.1% | 5.0% |
| Extra Trees | — | 79.2% | —% | — | 56.7% | —% |
| Baseline | — | 19.4% | —% | — | 17.0% | —% |
Final results
Model evaluation
The final model prioritises finding subscribers while keeping F2 stronger than the alternatives.
Final Recall vs F2
Final metrics table
| Model | Recall | F2 | Precision | Accuracy | ROC-AUC |
|---|---|---|---|---|---|
| LR Stage 1 ★ | 90.8% | 64.9% | 26.9% | 69.3% | 87.3% |
| Grad. Boost | 71.0% | 58.1% | 32.1% | 79.6% | 84.3% |
| Extra Trees | 79.2% | 56.7% | 25.5% | 70.6% | 82.0% |
| Baseline | 19.4% | 17.0% | 11.4% | 72.9% | 49.7% |
Confusion matrices — final models
LR S1 ★
4,181
61.6%
TN
1,807
26.6%
FP
97
1.4%
FN
697
10.3%
TP
LR S1 ★
4,181
61.6%
TN
1,807
26.6%
FP
97
1.4%
FN
697
10.3%
TP
GB S1
4,913
72.4%
TN
1,075
15.9%
FP
266
3.9%
FN
528
7.8%
TP
ET S2
4,257
62.8%
TN
1,731
25.5%
FP
190
2.8%
FN
604
8.9%
TP
Baseline
4,787
70.6%
TN
1,201
17.7%
FP
640
9.4%
FN
154
2.3%
TP
LR S1 ★
4,181
61.6%
TN
1,807
26.6%
FP
97
1.4%
FN
697
10.3%
TP
GB S1
4,913
72.4%
TN
1,075
15.9%
FP
266
3.9%
FN
528
7.8%
TP
ET S2
4,257
62.8%
TN
1,731
25.5%
FP
190
2.8%
FN
604
8.9%
TP
Baseline
4,787
70.6%
TN
1,201
17.7%
FP
640
9.4%
FN
154
2.3%
TP
Multi-metric radar comparison
Cumulative lift curve
Dashed line = random baseline. Top deciles capture subscribers faster than random targeting.
Feature importance
Feature importance table
| Feature | Importance | Rank |
|---|---|---|
| duration | 0.312 | #1 |
| balance | 0.118 | #2 |
| age | 0.094 | #3 |
| pdays | 0.087 | #4 |
| day | 0.076 | #5 |
| campaign | 0.068 | #6 |
| previous | 0.052 | #7 |
| month | 0.041 | #8 |
| job | 0.038 | #9 |
| education | 0.033 | #10 |
Ready to try the model?
Enter a single client profile or upload a CSV batch.