Observability Model
Responsibility View
This document defines how the platform monitors its health and alerts on failures. It answers the question: How do we know if something is wrong?
Signal Flow
The platform collects signals from all layers and aggregates them into actionable dashboards and alerts.
flowchart LR
Nodes[Hardware/OS] --> Collector
Pods[Workloads] --> Collector
Ingress[Traffic] --> Collector
Collector --> TSDB[(Metrics / Logs)]
TSDB --> Dashboards[Visualization]
TSDB --> AlertManager[Alerting]
AlertManager --> Notification{Notification}
Core Signals
Metrics (Availability & Performance)
Numerical data points (CPU, Memory, Latency, Error Rate) used to determine the real-time health of a component.
Logs (Context & Security)
Textual records of events. Used for post-mortem analysis, security auditing, and troubleshooting complex failures.
Health Checks (Integrity)
Active probing of service endpoints (e.g., /healthz). This determines if a workload is ready to receive traffic or needs to be restarted.