contagion model · type 1 stress · v2 — trend-aware

Fundamental Stress
Pipeline

Computes a size-neutral, trend-aware stress score S(bank) ∈ [0,1] for the latest year using 2023–2025 data. Current position (2025) + trajectory (slope across 3 years) feed a single PCA — ready as the initial shock vector in a contagion model.

MongoDB
2023–25
raw input
Flatten
(bank, year)
step 2
Impute
Missing
step 3
Winsorize
5–95%
step 4
Z-Score
per Year
step 5
Flip
Signs
step 6
Trend
Slopes
step 7 — new
PCA
16 feat.
step 8
S(bank)
[0, 1]
output — 2025

A bank's raw financial numbers cannot be compared directly. Two banks might both have strong fundamentals, but because one is ten times larger, its absolute Gross NPA and Net Profit numbers will dwarf the smaller bank's — creating a false signal of stress.

Size Bias — The Bug This Pipeline Solves

Canara Bank's Gross NPA: ₹46,159 cr  |  IDFC First's Gross NPA: ₹3,884 cr
Using absolute values, Canara scores stress = 1.0 even though its Net NPA % (0.7%) is lower than IDFC's (0.86%). Every metric must be a ratio — normalised by assets, deposits, or advances — so large and small banks are on equal footing.

v2 Addition — Trajectory Matters

A bank at NPA = 5% trending toward 8% is riskier than one stable at 5% — even if their 2025 snapshots look identical. v2 adds 8 trend-slope features (one per metric) computed from 2023→2024→2025 z-scores. PCA now sees both where the bank stands and where it is heading. Slopes are computed on the already-z-scored values so they capture relative deterioration vs. peers, not industry-wide macro moves.

01
Fetch & Flatten
Pull all bank documents from MongoDB performance_metrics. Each document contains yearly sub-objects (2023, 2024, 2025). Flatten into a single DataFrame with one row per (bank, year) pair — up to 3× more rows than the old 2-year version.
02
Handle Missing Values
Some banks don't report every metric every year. Missing values are imputed with the cross-sectional median for that year — a conservative choice that assigns average health, not false stress or false safety. Note that Provision Coverage Ratio has ~77% missing data; see the Data Quality section below.
03
Winsorize outlier fix
Clip each metric at the 5th and 95th percentile within each year. Without this, a single distressed bank (e.g. NPA = 40% when peers are 1–3%) inflates the standard deviation and compresses everyone else's scores toward zero — destroying the signal.
04
Z-Score Normalise per Year scale fix
For each metric within each year: subtract the mean, divide by standard deviation. Normalising within each year means the benchmark is always the current year's peer group — a healthy NPA in 2023 may differ from 2025. Trend slopes are computed on these z-scores, so they capture relative moves vs. peers rather than absolute level shifts.
05
Flip Signs for Direction Consistency
After z-scoring, HIGH value must always mean HIGH STRESS. For "higher is better" metrics (CAR, ROA), a below-average bank has a negative z-score but is stressed — multiply by −1. For "lower is better" metrics (NPA %, OpEx %), an above-average bank already has a positive z-score. Formula: stressed_z = z × (−1 × direction). This also makes trend slopes directionally consistent: positive slope = worsening.
06
Compute Trend Features new in v2
For each bank, fit a linear slope across the 3 available years of sign-flipped z-scores for each metric. Produces 8 trend columns (trend_NPA, trend_CAR, …). Slope > 0 means the metric is worsening relative to peers year-on-year. Banks with only one year of data receive slope = 0. These slopes are then standardised (z-scored) before entering PCA so their scale matches the snapshot features.
07
PCA — 16-Feature Stress Score no arbitrary weights
Take only the 2025 (LATEST_YEAR) rows and merge in the trend slopes → 16 features per bank (8 snapshot + 8 trend). Run PCA; PC1 is the dominant stress axis. The NPA loading is checked — if it is negative, scores are flipped. PC1 loadings are printed with a [trend] prefix on trend features so you can see which dimensions drive the score.
08
Normalise to [0, 1] & Save output
Min-max scale PC1 scores across all banks in the 2025 universe. Score of 0.0 = least stressed. Score of 1.0 = most stressed. Save to fundamental_stress_scores.csv. The column fundamental_stress_normalized is S(bank) — your initial shock vector for contagion. One row per bank, no per-year duplication.

All metrics are percentages or ratios normalised by a size denominator (assets, deposits, advances). Absolute values like Gross NPA and Net Profit were deliberately excluded to prevent size bias. Each metric also generates a corresponding trend feature in v2.

Net NPA as % to Net Advances↓ lower better
Capital Adequacy Ratio (Basel-III)↑ higher better
Provision Coverage Ratio (%)↑ higher better
Credit Deposit Ratio↓ lower better
Investment Deposit Ratio↑ higher better
Return on Assets (%)↑ higher better
Spread as % of Total Assets↑ higher better
Operating Expenses as % to Total Expenses↓ lower better

The pipeline reports missing percentages before imputation. Understanding these is critical — high missingness means the imputed median is doing the work, not real data.

Metric Missing % Verdict
Net NPA as % to Net Advances
11.8%
ACCEPTABLE
Capital Adequacy Ratio (Basel-III)
11.8%
ACCEPTABLE
Provision Coverage Ratio (%)
77.3%
PROBLEM
Credit Deposit Ratio
19.3%
MODERATE
Investment Deposit Ratio
13.7%
ACCEPTABLE
Return on Assets (%)
11.8%
ACCEPTABLE
Spread as % of Total Assets
11.8%
ACCEPTABLE
Operating Expenses as % to Total Expenses
11.8%
ACCEPTABLE
Provision Coverage Ratio — Action Required

77.3% of Provision Coverage Ratio values are filled with the cross-sectional median — meaning only ~23% of banks actually reported this metric. The imputed values are identical for all non-reporters, contributing near-zero unique signal to PCA. Consider either sourcing this data from a more complete dataset, or removing PCR from METRICS until coverage improves. The 11–14% missing on other metrics is expected: foreign/small banks and cooperative banks often skip IBA sub-metrics.

// Step 1: Winsorize (per metric j, within year t)
x̃ᵢⱼₜ = clip(xᵢⱼₜ, p05ⱼₜ, p95ⱼₜ)
 
// Step 2: Z-score within each year t
zᵢⱼₜ = (x̃ᵢⱼₜ − μⱼₜ) / σⱼₜ  // μ, σ computed within year t
 
// Step 3: Align direction (high z = high stress)
sᵢⱼₜ = zᵢⱼₜ × (−1 × dⱼ)  // dⱼ ∈ {+1, −1}
 
// Step 4: Trend slope per metric (new in v2)
βᵢⱼ = slope of {sᵢⱼ,2023, sᵢⱼ,2024, sᵢⱼ,2025}  // linear regression; β > 0 = worsening
 
// Step 5: PCA on 16-feature vector per bank (LATEST_YEAR only)
vᵢ = [ sᵢ,2025 (8)  ||  std(βᵢ) (8) ]
Fᵢ = wᵀ · vᵢ  // w = PC1 loadings, learned from data
 
// Step 6: Normalise
S(bank) = (Fᵢ − min F) / (max F − min F)  ∈ [0, 1]

Example of what the ranked output looks like (one row per bank, no per-year duplication). Higher bar = more stressed relative to the 2025 peer universe, incorporating 2023–2025 trajectory.

0.0 — healthiest 0.5 1.0 — most stressed
Output Used As
df['fundamental_stress_normalized']

→ Initial shock vector S(bank) [one score per bank, 2025]
→ One scalar per node in your contagion graph
→ Higher S = bank enters the cascade more stressed
→ Also usable as threshold: S > 0.6 = contagion-eligible
→ Score reflects BOTH current position AND 3-year trajectory
# Install dependencies pip install pymongo pandas scikit-learn numpy python-dotenv # .env file (repo root) MONGO_URI=mongodb+srv://your-uri MONGO_DB=financial_kg MONGO_COLLECTION=performance_metrics # Run make run-bank-stress-mapper # or: python engine/stress/bank_stress_mapper.py # → outputs: fundamental_stress_scores.csv (one row per bank) # Config (top of bank_stress_mapper.py) YEARS = ["2023", "2024", "2025"] # years used for trend computation LATEST_YEAR = "2025" # year whose score is output