Fundamental Stress Pipeline

end-to-end flow

MongoDB
2023–25

raw input

→

Flatten
(bank, year)

step 2

→

Impute
Missing

step 3

→

Winsorize
5–95%

step 4

→

Z-Score
per Year

step 5

→

Flip
Signs

step 6

→

Trend
Slopes

step 7 — new

→

PCA
16 feat.

step 8

→

S(bank)
[0, 1]

output — 2025

the core problem

A bank's raw financial numbers cannot be compared directly. Two banks might both have strong fundamentals, but because one is ten times larger, its absolute Gross NPA and Net Profit numbers will dwarf the smaller bank's — creating a false signal of stress.

Size Bias — The Bug This Pipeline Solves

Canara Bank's Gross NPA: ₹46,159 cr | IDFC First's Gross NPA: ₹3,884 cr
Using absolute values, Canara scores stress = 1.0 even though its Net NPA % (0.7%) is lower than IDFC's (0.86%). Every metric must be a ratio — normalised by assets, deposits, or advances — so large and small banks are on equal footing.

v2 Addition — Trajectory Matters

A bank at NPA = 5% trending toward 8% is riskier than one stable at 5% — even if their 2025 snapshots look identical. v2 adds 8 trend-slope features (one per metric) computed from 2023→2024→2025 z-scores. PCA now sees both where the bank stands and where it is heading. Slopes are computed on the already-z-scored values so they capture relative deterioration vs. peers, not industry-wide macro moves.

pipeline steps

01

Fetch & Flatten

Pull all bank documents from MongoDB performance_metrics. Each document contains yearly sub-objects (2023, 2024, 2025). Flatten into a single DataFrame with one row per (bank, year) pair — up to 3× more rows than the old 2-year version.

02

Handle Missing Values

Some banks don't report every metric every year. Missing values are imputed with the cross-sectional median for that year — a conservative choice that assigns average health, not false stress or false safety. Note that Provision Coverage Ratio has ~77% missing data; see the Data Quality section below.

03

Winsorize outlier fix

Clip each metric at the 5th and 95th percentile within each year. Without this, a single distressed bank (e.g. NPA = 40% when peers are 1–3%) inflates the standard deviation and compresses everyone else's scores toward zero — destroying the signal.

04

Z-Score Normalise per Year scale fix

For each metric within each year: subtract the mean, divide by standard deviation. Normalising within each year means the benchmark is always the current year's peer group — a healthy NPA in 2023 may differ from 2025. Trend slopes are computed on these z-scores, so they capture relative moves vs. peers rather than absolute level shifts.

05

Flip Signs for Direction Consistency

After z-scoring, HIGH value must always mean HIGH STRESS. For "higher is better" metrics (CAR, ROA), a below-average bank has a negative z-score but is stressed — multiply by −1. For "lower is better" metrics (NPA %, OpEx %), an above-average bank already has a positive z-score. Formula: stressed_z = z × (−1 × direction). This also makes trend slopes directionally consistent: positive slope = worsening.

06

Compute Trend Features new in v2

For each bank, fit a linear slope across the 3 available years of sign-flipped z-scores for each metric. Produces 8 trend columns (trend_NPA, trend_CAR, …). Slope > 0 means the metric is worsening relative to peers year-on-year. Banks with only one year of data receive slope = 0. These slopes are then standardised (z-scored) before entering PCA so their scale matches the snapshot features.

07

PCA — 16-Feature Stress Score no arbitrary weights

Take only the 2025 (LATEST_YEAR) rows and merge in the trend slopes → 16 features per bank (8 snapshot + 8 trend). Run PCA; PC1 is the dominant stress axis. The NPA loading is checked — if it is negative, scores are flipped. PC1 loadings are printed with a [trend] prefix on trend features so you can see which dimensions drive the score.

08

Normalise to [0, 1] & Save output

Min-max scale PC1 scores across all banks in the 2025 universe. Score of 0.0 = least stressed. Score of 1.0 = most stressed. Save to fundamental_stress_scores.csv. The column fundamental_stress_normalized is S(bank) — your initial shock vector for contagion. One row per bank, no per-year duplication.

metrics used — ratios only

All metrics are percentages or ratios normalised by a size denominator (assets, deposits, advances). Absolute values like Gross NPA and Net Profit were deliberately excluded to prevent size bias. Each metric also generates a corresponding trend feature in v2.

Net NPA as % to Net Advances↓ lower better

Capital Adequacy Ratio (Basel-III)↑ higher better

Provision Coverage Ratio (%)↑ higher better

Credit Deposit Ratio↓ lower better

Investment Deposit Ratio↑ higher better

Return on Assets (%)↑ higher better

Spread as % of Total Assets↑ higher better

Operating Expenses as % to Total Expenses↓ lower better

data quality — missing value report

The pipeline reports missing percentages before imputation. Understanding these is critical — high missingness means the imputed median is doing the work, not real data.

Metric	Missing %	Verdict
Net NPA as % to Net Advances	11.8%	ACCEPTABLE
Capital Adequacy Ratio (Basel-III)	11.8%	ACCEPTABLE
Provision Coverage Ratio (%)	77.3%	PROBLEM
Credit Deposit Ratio	19.3%	MODERATE
Investment Deposit Ratio	13.7%	ACCEPTABLE
Return on Assets (%)	11.8%	ACCEPTABLE
Spread as % of Total Assets	11.8%	ACCEPTABLE
Operating Expenses as % to Total Expenses	11.8%	ACCEPTABLE

Provision Coverage Ratio — Action Required

77.3% of Provision Coverage Ratio values are filled with the cross-sectional median — meaning only ~23% of banks actually reported this metric. The imputed values are identical for all non-reporters, contributing near-zero unique signal to PCA. Consider either sourcing this data from a more complete dataset, or removing PCR from METRICS until coverage improves. The 11–14% missing on other metrics is expected: foreign/small banks and cooperative banks often skip IBA sub-metrics.

mathematical summary — v2

// Step 1: Winsorize (per metric j, within year t)

x̃ᵢⱼₜ = clip(xᵢⱼₜ, p05ⱼₜ, p95ⱼₜ)

// Step 2: Z-score within each year t

zᵢⱼₜ = (x̃ᵢⱼₜ − μⱼₜ) / σⱼₜ // μ, σ computed within year t

// Step 3: Align direction (high z = high stress)

sᵢⱼₜ = zᵢⱼₜ × (−1 × dⱼ) // dⱼ ∈ {+1, −1}

// Step 4: Trend slope per metric (new in v2)

βᵢⱼ = slope of {sᵢⱼ,2023, sᵢⱼ,2024, sᵢⱼ,2025} // linear regression; β > 0 = worsening

// Step 5: PCA on 16-feature vector per bank (LATEST_YEAR only)

vᵢ = [ sᵢ,2025 (8) || std(βᵢ) (8) ]

Fᵢ = wᵀ · vᵢ // w = PC1 loadings, learned from data

// Step 6: Normalise

S(bank) = (Fᵢ − min F) / (max F − min F) ∈ [0, 1]

illustrative output — 2025 trend-adjusted stress scores

Example of what the ranked output looks like (one row per bank, no per-year duplication). Higher bar = more stressed relative to the 2025 peer universe, incorporating 2023–2025 trajectory.

0.0 — healthiest 0.5 1.0 — most stressed

how this feeds into contagion

Output Used As

df['fundamental_stress_normalized']

→ Initial shock vector S(bank) [one score per bank, 2025]
→ One scalar per node in your contagion graph
→ Higher S = bank enters the cascade more stressed
→ Also usable as threshold: S > 0.6 = contagion-eligible
→ Score reflects BOTH current position AND 3-year trajectory

installation

                # Install dependencies
                pip install pymongo pandas scikit-learn numpy
                    python-dotenv

                # .env file (repo root)
                MONGO_URI=mongodb+srv://your-uri
                MONGO_DB=financial_kg
                MONGO_COLLECTION=performance_metrics

                # Run
                make run-bank-stress-mapper
                # or: python engine/stress/bank_stress_mapper.py
                # → outputs: fundamental_stress_scores.csv (one row per bank)

                # Config (top of bank_stress_mapper.py)
                YEARS = ["2023", "2024", "2025"] # years used for trend computation
                LATEST_YEAR = "2025" # year whose score is output
            

Fundamental StressPipeline

Fundamental Stress
Pipeline