Resilience Scorecard
Four indicators that characterize how a multi-agent system behaves under sustained faults: Fault Tolerance, NRR, Survival Rate, and Critical Time.
[!NOTE] Live observatory context. The Resilience Scorecard is the real-time instrument panel for interactive MAFIS sessions. It shows FT, NRR, Survival Rate, and Critical Time as the simulation runs. For batch experiment reporting, MAFIS uses a separate set of differential metrics (FT, CT, TWTE, AR, Cascade Depth, Rapidity) — see Fault Metrics documentation.
The Resilience Scorecard is computed live during the fault injection phase. It distills raw metrics into research-backed indicators that together answer: is this system resilient, degrading, or collapsing?
RESILIENCE SCORECARD
Fault Tolerance 0.82
NRR 0.91
Survival Rate 0.88
Critical Time 0.15
Composite Score 0.86 → RESILIENT
1. Fault Tolerance (FT)
How much throughput does the system retain under faults?
| Variable | Meaning |
|---|---|
| Average throughput during fault injection | |
| Baseline throughput (fault-free) |
- Range: 0 (no throughput under faults) → 1+ (throughput matches or exceeds baseline)
- Weight in composite: 40%
Origin: Adapted from Milner (2023), “Quantifying Fault Tolerance in Autonomous Multi-Robot Systems”, which defines fault tolerance as the ratio of degraded performance to nominal performance.
Real-life example: A warehouse fleet delivers 100 packages/hour normally. Under faults, it delivers 82/hour. FT = 0.82. A fleet manager uses this to decide: “Can I absorb a 3-robot failure during peak hours without missing SLAs?”
[!TIP] Animation concept: Two throughput bars side by side: a tall “baseline” bar and a shorter “under faults” bar. The ratio between them fills a gauge labeled “Fault Tolerance.”
2. NRR (Normalized Recovery Ratio)
How much time does the system spend recovering versus operating?
| Variable | Meaning |
|---|---|
| Mean Time To Recovery (ticks to resume after a fault) | |
| Mean Time Between Faults (ticks between fault events) |
- Range: 0 (always recovering) → 1 (recovers instantly relative to fault frequency)
- Weight in composite: 35% (redistributed to other metrics when N/A)
- Requires at least 2 fault events to compute MTBF, and at least one cascade-affected agent for MTTR.
- N/A when: all fault events occur at the same tick (burst, zone outage) — MTBF is undefined with no inter-arrival intervals. Also N/A for permanent deaths with no cascade neighbors. When N/A, the composite score redistributes NRR’s weight: FT 65%, CT 35%.
Origin: Or (2025), “MTTR-A: Measuring Cognitive Recovery Latency in Multi-Agent Systems”, which defines NRR as the uptime bound, proving that steady-state operational fraction satisfies .
Real-life example: If a fleet takes 10 ticks to recover and faults occur every 100 ticks, NRR = 0.90, meaning the fleet is operational 90%+ of the time. If recovery takes 50 ticks with faults every 60 ticks, NRR = 0.17, meaning the fleet is almost always recovering and rarely productive.
[!TIP] Animation concept: A timeline with alternating green (operational) and red (recovering) segments. The ratio of green to total fills a gauge labeled “NRR.” High NRR = mostly green.
3. Survival Rate (SR)
What fraction of the initial fleet is still alive?
| Variable | Meaning |
|---|---|
| alive agents | Agents not permanently killed by a fault |
| initial fleet size | Total agents at simulation start |
- Range: 0 (all agents dead) → 1 (no deaths)
- Displayed in scorecard UI but not included in the composite score — SR is a supporting indicator, not a composite input.
Interpretation: Survival Rate is a direct headcount. It separates fleet attrition (agents lost to faults) from throughput degradation (performance lost without deaths). A system can have high SR (few deaths) but low Fault Tolerance (widespread cascade stalls), or low SR (many deaths) but high FT if surviving agents absorb the work efficiently.
Real-life example: A warehouse fleet starts with 100 robots. After a wear-based fault wave, 88 remain alive. SR = 0.88. A fleet manager uses this alongside FT to distinguish: “Did my throughput drop because robots died, or because the survivors got stuck?”
[!TIP] Animation concept: A fleet health bar showing alive agents (green) versus dead agents (red) as a fraction of the original fleet. Faults chip away at the bar as agents die. SR = the green fraction.
4. Critical Time (CT)
How much time does the system spend in a critical state?
| Variable | Meaning |
|---|---|
| Ticks where throughput < 50% of baseline | |
| Total ticks since first fault |
- Range: 0 (never critical) → 1 (always critical)
- Weight in composite: 25% (inverted: )
- Threshold: 50% of baseline throughput
Origin: Adapted from Ghasemieh (2024), “Transient Analysis of Fault-Tolerant Systems”, which uses time-below-threshold as a measure of system criticality during transient degradation.
Real-life example: After a cascade failure, throughput drops to 30% of baseline for 45 out of 300 ticks. CT = 0.15. A reliability engineer asks: “How often is my system dangerously degraded?” CT = 0.15 means “only 15% of the time,” which is acceptable. CT = 0.60 means “more often than not,” and redesign is needed.
[!TIP] Animation concept: A throughput curve over time with a dashed red line at 50% baseline. The segments below the line flash red. CT = the red fraction of the total timeline.
Composite Score
Three metrics combine into a single resilience score. Survival Rate is shown in the UI but not included in the composite.
When NRR is available (recurring faults, 2+ events with distinct ticks):
When NRR is N/A (burst/zone outage, or no cascade neighbors):
The verdict banner classifies the result:
| Score | Verdict | Meaning |
|---|---|---|
| RESILIENT | System absorbs faults and recovers | |
| MODERATE | System is partially degraded | |
| DEGRADED | System is losing ground over time | |
| FRAGILE | System cannot sustain operation |
Export
All scorecard values plus the composite score are included in JSON/CSV exports for offline analysis and cross-run comparison.