Docs Simulator Blog About Github ↗

Fault Types

The five fault types in MAFIS organized in a 3-category taxonomy (Recoverable, Permanent-distributed, Permanent-localized) and how FaultSource distinguishes automatic from manual injection.

MAFIS supports five fault types organized in a 3-category taxonomy based on duration and scope. All types go through the same cascade pipeline (ADG → BFS → replan), which makes their resilience metrics scientifically comparable regardless of how they were triggered.

3-Category Taxonomy

CategoryTypesDurationScope
RecoverableTemporaryBlockage, LatencyTemporary (N ticks)Cell or agent
Permanent-distributedOverheat, BreakdownPermanentIndividual agents, randomly distributed
Permanent-localizedPermanentZoneOutagePermanentEntire zone (contiguous area)

Each category produces a distinct resilience signature:

  • Recoverable faults test how quickly the system adapts and recovers.
  • Permanent-distributed faults model fleet attrition. Individual agents die and become permanent obstacles, randomly distributed across the grid.
  • Permanent-localized faults eliminate entire zones from the operational map, testing global replanning capacity.

Full Fault Type Reference

TypeCategoryAgent StateGrid EffectDurationRecovery
OverheatPermanent-distributedDeadCell becomes obstaclePermanentNone
BreakdownPermanent-distributedDeadCell becomes obstaclePermanentNone
TemporaryBlockageRecoverableN/A (cell-based)Cell becomes unwalkableConfigurable (N ticks)Auto-removes after N ticks
LatencyRecoverableAlive, degradedNoneConfigurable (N ticks)Agent resumes after N ticks
PermanentZoneOutagePermanent-localizedAgents in zone dieZone cells become obstaclesPermanentNone

Overheat

Triggered when an agent’s accumulated heat exceeds overheat_threshold (see Heat System). The agent dies, its cell becomes a permanent obstacle, and all agents whose paths cross that cell must replan. Overheat faults are automatic, arising from sustained congestion and waiting.

Breakdown

A hardware death fault triggered by breakdown_probability on each tick, configurable via FaultConfig. Like Overheat, the agent dies and becomes a permanent obstacle. The distinction matters for analysis: Breakdown is stochastic and uncorrelated with congestion; Overheat is caused by congestion. Both produce identical cascade consequences.

[!WARNING] Overheat and Breakdown are permanent. The agent dies and its cell becomes an obstacle for the remainder of the simulation. Plan your fault intensity accordingly.

TemporaryBlockage

A cell-based fault, not agent-based. A cell becomes unwalkable for a configurable number of ticks (e.g., simulating a human walking through an aisle, a spill, or a dropped package). After N ticks, the cell automatically becomes walkable again. This is a new fault type (not present in earlier versions of MAFIS).

Agents whose paths cross the blocked cell must replan around it. When the blockage clears, agents are not automatically rerouted. They continue on their current paths, which now naturally pass through the restored cell on the next replan cycle.

Latency

An agent-level degradation fault. The affected agent executes Action::Wait for N consecutive ticks regardless of what the solver would assign. After N ticks, the agent resumes normal operation. The agent is alive and occupying a cell during latency. It is not an obstacle, but it is unresponsive to the planner.

Real-world analogy: a robot’s sensor system lags, a communication packet is dropped, or a software hang causes the robot to freeze briefly before recovering.

[!NOTE] Latency faults are the mildest fault type. The agent is alive, occupies its cell, and recovers automatically. Use them to study congestion propagation without permanent fleet attrition.

PermanentZoneOutage

A permanent, localized fault that blocks an entire zone at a configurable tick. The busiest zone is selected deterministically, and its walkable cells become permanent obstacles. Agents standing on blocked cells die immediately. All task assignments into the zone are invalidated.

Parameters:

  • at_tick: when the blockage fires (e.g., tick 100)
  • block_percent: fraction of zone cells to block (1–100%; default 100%)

Real-world analogy: a fire in a warehouse aisle permanently closes an entire section; a structural collapse blocks a storage zone; a water leak forces evacuation of a delivery area.

[!WARNING] PermanentZoneOutage is the most destructive fault type. A 100% blockage removes all zone cells from the operational map for the remainder of the run.

This fault type tests a failure mode that prior work on k-robust MAPF and delay-based fault models does not cover.

FaultSource

All faults carry a FaultSource tag:

pub enum FaultSource {
    Automatic,  // System-generated via heat/probability
    Manual,     // Researcher-injected via UI
    Scheduled,  // From a FaultSchedule scenario
}

Manual faults are injected while the simulation is paused (click a robot → “Kill” / “Block for N ticks” / “Slow for N ticks”). They are tagged FaultSource::Manual so they can be distinguished in analysis and export, but their metrics are computed through the same cascade pipeline as automatic faults. A manual kill produces the same cascade depth, spread, and recovery dynamics as an automatic breakdown.

[!IMPORTANT] This ensures that manual injection experiments produce scientifically valid comparisons to automatic fault runs. Manual and automatic faults go through the identical cascade pipeline.

Fault Intensity Configuration

The rate of automatic fault generation is controlled by FaultConfig:

ParameterEffect
breakdown_probabilityPer-tick probability that any living agent suffers a Breakdown
overheat_thresholdHeat level that triggers Overheat (lower = more frequent)
heat_per_waitHeat accumulated per tick an agent waits
heat_per_moveHeat accumulated per tick an agent moves
congestion_heat_bonusExtra heat added per nearby agent within congestion_heat_radius
heat_dissipationHeat lost per tick when not congested

The UI exposes fault intensity presets (Off / Low / Medium / High) that set these parameters together.

All Faults Through One Pipeline

Regardless of type or source, every fault goes through the same pipeline:

  1. Heat/FaultCheck phase: fault is registered, agent state updated, cell obstacle status updated
  2. ADG construction: Agent Dependency Graph identifies which agents’ paths are blocked by the new state
  3. BFS propagation: cascade depth and spread computed
  4. Replan phase: affected agents get new plans from the active solver
  5. Metrics: MTTR, cascade depth/spread, throughput delta computed in AnalysisSet::Metrics

See Cascade Propagation for the ADG pipeline in detail.