Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Role: Operations

Responsibility

The Operations role encompasses the tools and processes required to maintain, monitor, and update the platform. It ensures repeatability through GitOps and visibility through observability.

Key Guarantees

  • Observability: Centralized metrics, logs, and alerting for all platform components.
  • Change Management: Git (GitOps) drives all platform changes.
  • Disaster Recovery: Automated backups and verified restore paths for stateful data.

Implementation Options

OptionBest FitGood AtCosts / RisksIntegration Notes
Prometheus / Grafana / LokiAll stacksStandard dashboards; large community; mature.Can sprawl; needs retention planning.K8s has the most turnkey packaging (kube-prometheus-stack).
GitOps (Argo CD / Flux)K8s, K3sRepeatability; drift control; clear audit trail.Higher initial platform complexity.Very mature on K8s; Nomad has options but less standardized.
VeleroK8s, K3sK8s-native backups; CSI integration for snapshots.K8s-specific.Best for K8s/K3s clusters.
ResticAnyGeneral purpose; deduplication; encrypted backups.More manual configuration on K8s.Great for Nomad/ZFS/NAS-based approaches.

Typical Stack Pairings

  • K8s/K3s: Prometheus + ArgoCD/Flux + Velero
  • Nomad/Other: Prometheus + Restic