Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Architecture Overview

Purpose

This document serves as the canonical source of truth for the Home Lab platform. It defines what exists and why, establishing the fundamental principles that guide all architectural decisions.

Definitions

Public Platform

The set of services and infrastructure components explicitly designed to be reachable from the public internet. This platform resides behind the Public Ingress and is subject to strict exposure policies.

Internal Platform

The core of the home lab, consisting of services reachable only from the local network (LAN) or via an authorized VPN connection. Access is defined by network presence and identity.

Trust Boundaries

Clear lines of demarcation between different security zones (Internet, LAN, VPN, Management). Every interaction across a boundary must be explicitly allowed and authenticated.

Platform Roles

The platform is composed of several stable roles that provide foundational services:

Invariants

These rules are absolute and must not be violated by any implementation:

  1. Internal Isolation: Internal services are never internet-routable. No direct NAT or port-forwarding to internal services is permitted.
  2. Identity First: No service shall be exposed without an identity-aware proxy or native SSO integration unless explicitly justified in a Service Contract.
  3. Source of Truth: The Git repository is the sole authority for the state of the platform. Manual “hot-fixes” are technical debt that must be codified immediately.
  4. Data Durability: Critical data must exist in at least two physical locations at all times.

Non-Goals

  • Real-time global availability (HA is local/cluster-based, not geo-distributed).
  • Public hosting of third-party data.
  • Replacement of enterprise-grade cloud services for high-risk workloads.

System Context

Map View

The following diagram provides a high-level orientation of the actors and systems involved in the Home Lab ecosystem.

flowchart TD
    subgraph Users [Users]
        Family["Family User"]
        Owner["Admin"]
        Public["Public Visitor"]
    end

    PublicPlane["Public Platform (behind Public Ingress)"]
    InternalPlane["Internal Platform (LAN/VPN only)"]

    subgraph Control ["Change Automation"]
        Automation["CI/CD + IaC Pipelines"]
    end

    subgraph External ["External Dependencies"]
        DNS["Cloud DNS"]
        Internet["The Internet"]
    end

    Family -- "HTTPS / LAN / VPN" --> InternalPlane
    Owner -- "SSH / Git / HTTPS" --> InternalPlane
    Owner -- "Git / CI/CD" --> Automation
    Public -- "HTTPS" --> PublicPlane

    Automation -- "Deploys / Config" --> InternalPlane
    Automation -- "Deploys / Config" --> PublicPlane
    Automation -- "DNS record management (automation)" --> DNS

    PublicPlane -- "Traffic" --> Internet
    InternalPlane -- "Traffic" --> Internet

Actors & Systems

EntityRoleDescription
Family UserInternal UserAccesses personal services (Wiki, Photos, Chat) from within the LAN or via VPN.
AdminInfrastructure OwnerManages the platform, security, and service configurations via SSH, Git, and HTTPS.
Public VisitorExternal UserAccesses public-facing content and websites hosted on the platform.
Public PlatformPublic PlaneInternet-facing services reachable through the public ingress.
Internal PlatformInternal PlaneCore services and management endpoints reachable only from LAN or VPN.
Change AutomationControl PlaneCI/CD and IaC pipelines that apply platform changes and manage DNS records.
Cloud DNSExternal SystemManaged DNS provider (risu.tech) updated by automation for split-horizon or public resolution.
The InternetNetworkPublic network through which external visitors arrive and internal resources are reached.

Network Model v1 (Power-Constrained Phase)

Purpose

Document the as-built network state, the rationale behind it, and the intended evolution path. This is the baseline substrate for ingress, naming, and service exposure decisions.

As-Built Topology

Physical Topology

Internet
   |
ISP Modem (Bridge Mode)
   |
OpenWRT Router (Single NAT / DHCP / DNS)
   |
LAN Clients + Server Nodes
(IPMI connected on-demand only)

Logical Roles

RoleDevice/Service
Edge NATOpenWRT
DHCP AuthorityOpenWRT
DNSOpenWRT (AdGuard)
VPN Client EgressOpenWRT (WireGuard -> iVPN)
ISP ModemBridge mode only (no routing)

IP Plan (Current)

  • Single flat LAN (one subnet).
  • DHCP and DNS are authoritative only on OpenWRT.
  • Specific CIDR, DHCP ranges, and static reservations live in OpenWRT config.

Trade-offs (Intentional)

  • No VLAN segmentation yet: Deferred due to hardware and power constraints.
  • No dedicated firewall: OpenWRT fulfills boundary duties for now.
  • No managed switch: The network spine is temporary; port/power constraints apply.
  • IPMI not always-on: Connected only when needed to conserve ports and power.

Evolution Roadmap

  • Phase 1 (Current): Single NAT/DHCP/DNS, flat LAN.
  • Phase 2: Add managed switch and introduce VLANs.
  • Phase 3: Dedicated firewall and segmented trust zones.

Network Evolution Plan (VLANs and Ingress Separation)

Purpose

Define the next phases for segmentation and ingress separation so the current flat LAN can evolve without disruptive renumbering.

Phase Targets

  • Keep the existing LAN (10.0.0.0/24) stable during transition.
  • Introduce clear trust boundaries: Clients, Servers, Management, DMZ, IoT, Guest, Lab.
  • Reserve address space and VIP ranges now to simplify later MetalLB/kube-vip usage.
  • Separate public and internal ingress paths, with split-horizon DNS.

VLANs and Subnets (Proposed)

VLANNameSubnetPurposeTypical Residents
10LAN10.0.0.0/24Default user networkPCs, phones, TVs
20SERVER10.0.20.0/24App workloads, cluster nodesTalos/K8s nodes, storage
30MGMT10.0.30.0/24Out-of-band + adminIPMI/BMC, switch/AP management
40DMZ10.0.40.0/24Public-facing edge onlyPublic ingress VIPs / edge svc
50IOT10.0.50.0/24Untrusted devicesCameras, smart devices
60GUEST10.0.60.0/24Visitor accessGuest Wi-Fi clients
70LAB10.0.70.0/24ExperimentsTest gear, ephemeral nodes

DHCP and Gateway Plan (Examples)

Assuming router-on-a-stick (trunk to switch):

VLANGatewayDHCP ScopeNotes
1010.0.0.110.0.0.10–250Keep current allocations
2010.0.20.110.0.20.50–250Reserve low IPs for VIPs/statics
3010.0.30.1none or limitedPrefer static/reservations
4010.0.40.1none or limitedDMZ should be explicit
5010.0.50.110.0.50.50–250Tight egress rules
6010.0.60.110.0.60.50–250Internet only
7010.0.70.1optionalLab isolation

Default Inter-VLAN Policy (Allow Only What Is Needed)

  • LAN (10) → Internal ingress/services (20): allow service ports.
  • LAN (10) → MGMT (30): deny, except specific admin workstation or VPN admin group.
  • VPN/Admin → MGMT (30): allow.
  • DMZ (40) → Servers (20): allow only public ingress backends.
  • IOT (50) → anywhere: deny by default, allow minimal egress if needed.
  • GUEST (60) → internal: deny (internet only).

Ingress Separation Model

  • Public Ingress: Internet-reachable hostnames only; prefer placement in DMZ (VLAN 40) when available.
  • Internal Ingress: LAN/VPN-only hostnames; placed in SERVER (VLAN 20) or LAN (VLAN 10) during early phase.
  • Start with both ingress controllers in VLAN 20 (simpler); move Public Ingress VIPs to VLAN 40 when DMZ exists.

VIP Reservations (Examples)

  • Internal ingress VIPs: 10.0.20.10–10.0.20.19
  • Public ingress VIPs: 10.0.40.10–10.0.40.19
  • Gateways: .1, network services: .2–.9

DNS Expectations (Split-Horizon)

  • Use the unified namespace *.risu.tech (per Exposure Policy and Split-Horizon ADR).
  • Internal-only names resolve to internal VIPs (e.g., wiki.risu.tech10.0.20.10 on LAN/VPN).
  • Public names resolve externally only when intentionally exposed (e.g., status.risu.tech).
  • Internal resolvers must not return public IPs for internal-only names.

Diagram (Ingress and Trust Zones)

flowchart TD
  Internet((Internet)) --> WAN[WAN]
  WAN --> Edge["Router/Firewall: OpenWRT now, dedicated later (policy gate)"]

  subgraph VLAN10[LAN 10 - 10.0.0.0/24]
    Clients[LAN Clients]
  end

  subgraph VLAN20[SERVER 20 - 10.0.20.0/24]
    Nodes[K8s/Talos Nodes]
    IntIngress[Internal Ingress VIPs]
    Services[Internal Services]
  end

  subgraph VLAN30[MGMT 30 - 10.0.30.0/24]
    IPMI[IPMI/BMC]
    NetMgmt[Switch/AP Mgmt]
  end

  subgraph VLAN40[DMZ 40 - 10.0.40.0/24]
    PubIngress[Public Ingress VIPs]
  end

  Edge --> VLAN10
  Edge --> VLAN20
  Edge --> VLAN30
  Edge --> VLAN40

  Clients --> IntIngress --> Services
  Internet -.->|Allowed 80/443 only via firewall/NAT| PubIngress --> Services

Migration Steps (Incremental)

  1. Current (flat): keep everything on 10.0.0.0/24, single DHCP (done).
  2. Add managed switch: trunk to router, keep most devices untagged on VLAN 10.
  3. Move servers to VLAN 20; keep clients on VLAN 10.
  4. Move management to VLAN 30 (static/reserved IPs).
  5. Add DMZ VLAN 40 for public ingress VIPs; expose only 80/443 as needed.

Platform Roles

The Home Lab platform is built upon a set of stable, well-defined roles. These roles represent the “bones” of the infrastructure—foundational capabilities that must remain stable regardless of which specific applications are running.

Role Catalog

Mapping Roles to Implementation

Each role is defined by its responsibilities and requirements. The specific technologies used to fulfill these roles (e.g., K3s, Authelia, Traefik) may evolve, but the roles themselves remain constant.

Role: Edge & Boundary

Responsibility

The Edge & Boundary role is the first line of defense. It is responsible for terminating public traffic and enforcing the transition from untrusted networks (Internet) to trusted networks (Home Network/VPN).

Key Guarantees

  • Traffic Termination: All public HTTPS traffic must terminate at the Edge.
  • L7 Load Balancing: Spreading requests across multiple “floating” service instances regardless of their physical node location.
  • Protocol Enforcement: Only authorized protocols (HTTPS, WireGuard) are permitted to cross the boundary.
  • Isolation: Publicly reachable services must be logically isolated from the internal-only platform.

Implementation Options

OptionBest FitGood AtCosts / RisksIntegration Notes
TraefikK3s, K8s, NomadSimple config, Let’s Encrypt, dynamic discovery, forward-auth.Some features are Traefik-native; middleware sprawl.Works with OIDC via forward-auth; requires standard headers.
NGINX IngressK8s, TalosVery common, strong annotation ecosystem.Auth relies on external proxies (oauth2-proxy); annotation-heavy.Pairs well with oauth2-proxy; explicit ingress classes needed.
CaddyNomad, Small K8sTLS automation; simple reverse proxy story.Less “platformy” out of the box; varies by env.Decide if identity is enforced here or at auth gateway.

Typical Stack Pairings

  • K3s: Traefik (native feel)
  • Talos/K8s: NGINX Ingress (most common)
  • Nomad: Traefik or Caddy

Role: Identity & Access

Responsibility

The Identity role provides the “Who” for the entire platform. it manages user identities, credentials, and group memberships, and provides a unified authentication experience (SSO).

Key Guarantees

  • Centralized Truth: One directory for all human users.
  • MFA Enforcement: Critical services must require multifactor authentication.
  • SSO: Users should only need to authenticate once to access multiple platform services.

Implementation Options

OptionBest FitGood AtCosts / RisksIntegration Notes
AuthentikAll stacksFlexible flows, “one IdP for everything”.Operating an IdP (backup, upgrades, DB).Choose enforcement at ingress: forward-auth, oauth2-proxy, or mesh-based.
KeycloakK8s, TalosEnterprise-grade, standard OIDC/SAML, great docs.Heavy; tuning/upgrade complexity.Pairs well with oauth2-proxy and standard OIDC clients.
AutheliaK3s, NomadLight-weight auth portal, simple 2FA, forward-auth.Less of a “platform” than Authentik/Keycloak.If OIDC is needed for apps, a full IdP might still be required.

Typical Stack Pairings

  • Traefik: Authentik + forward-auth (or Authelia)
  • NGINX Ingress: Authentik/Keycloak + oauth2-proxy
  • Any: IdP + apps using OIDC directly (for “native SSO” apps)

Role: Connectivity & Naming

Responsibility

This role ensures that users and services can find each other. It handles DNS resolution and internal routing, maintaining a consistent namespace across local and remote connections.

Key Guarantees

  • Unified Namespace: Use of *.risu.tech globally.
  • Split-Horizon DNS: Internal names resolve to internal IPs; external names point to the Edge.
  • Service Discovery: Automatic detection and registration of “floating” workloads.
  • L4 Load Balancing (VIP): Providing stable virtual IPs for cluster-wide services (like Ingress) to ensure they are reachable even if nodes fail.

Current Stack Choice

  • OpenWRT as the bootstrap resolver, Technitium DNS as the internal authority, and ExternalDNS for Kubernetes-driven automation. Details and runbooks live in Connectivity & Naming Stack.

Implementation Options

OptionBest FitGood AtCosts / RisksIntegration Notes
CoreDNS + ExternalDNSK8s, K3s, TalosK8s-native, clean in-cluster discovery.Split-horizon needs careful design.Decide “source of truth” (Git/IaC) and VPN DNS view.
Pi-hole / AdGuard HomeAny (External)Easy local DNS + blocking; great for split-horizon.Another stateful service; HA takes effort.Ensure VPN hands out this DNS; avoid public leaks.
WireGuard / TailscaleAnyStable remote access.Tailscale is managed-ish; WireGuard is DIY.DNS distribution over VPN is the key integration point.
MetalLB / Kube-vipK8s, K3s, TalosProvides L4 LoadBalancer IPs on bare-metal.Requires network support (ARP/BGP); configuration overhead.Essential for giving the Ingress Controller a stable IP in a cluster.
ConsulNomadFirst-class in Nomad ecosystems.Adds a control plane component.Decide how Consul names map to your DNS naming scheme.

Typical Stack Pairings

  • K8s: MetalLB/Kube-vip + CoreDNS + ExternalDNS + WireGuard/Tailscale
  • Nomad: Consul (+ DNS integration) + Traefik/Fabio + WireGuard/Tailscale
  • Hybrid: Pi-hole/AdGuard as “front” DNS for LAN/VPN regardless of orchestrator

Connectivity & Naming Stack (OpenWRT + Technitium + ExternalDNS)

Scope

Defines how names under risu.tech are resolved for LAN, VPN, and Kubernetes workloads, and how DNS automation and failure modes behave.

Goals

  • Internal names (e.g., wiki.risu.tech) resolve only on LAN/VPN.
  • Public names resolve from anywhere without exposing internal metadata.
  • Internal DNS is authoritative and automated from Kubernetes via ExternalDNS.
  • DNS outages degrade safely: public domains keep resolving and the platform can bootstrap without internal DNS.

Non-Goals

  • Multi-site or geo-distributed DNS.
  • Automating the public zone in this phase.
  • Making the router a permanent authoritative DNS platform.

Roles and Responsibilities

  • OpenWRT (bootstrap resolver): DHCP authority, default resolver for clients, recursion to public upstreams, conditional forward of internal zones to Technitium, local static overrides for recovery.
  • Technitium DNS (internal authority): Hosts authoritative internal records and optional recursion; reachable only from LAN/VPN; uses IP-based upstream configuration.
  • ExternalDNS (automation controller): Watches Kubernetes resources and reconciles allowed records into Technitium; limited to explicitly delegated hostnames.

Resolution Flows

  • Internal name (normal): Client → OpenWRT → Technitium → internal VIP/endpoint.
  • Public name: Client → OpenWRT → public recursive resolution.

Dependency-Loop Prevention

  • Principle: nothing required to bootstrap the platform should depend on Technitium.
  • Invariants:
    • Clients always use OpenWRT as resolver in Phase 1.
    • OpenWRT keeps minimal static records (Technitium VIP and internal ingress VIP) to reach recovery paths.
    • Technitium upstreams are configured by IP or forward recursion to OpenWRT by IP.
    • ExternalDNS targets Technitium by stable IP/VIP, not hostname.

Failure Behavior

  • Technitium down: Internal names fail except the static overrides; public names still resolve via OpenWRT.
  • ExternalDNS down: Existing records served; no new automation until it returns.
  • OpenWRT DNS down: Clients lose DNS (Phase 1 SPOF); acceptable until resolver redundancy is added.

Zone Strategy

  • Preferred: split-horizon risu.tech (same zone name internal and public).
  • Safety controls: Technitium not reachable from WAN; ExternalDNS constrained via annotation/label allowlists, TXT ownership, and domain filters; public DNS managed separately.
  • Alternative: internal sub-zone such as int.risu.tech if split-horizon proves risky.

Record Ownership

  • ExternalDNS-managed: Annotated Kubernetes services/ingresses that are allowed for automation.
  • Manually managed: Bootstrap overrides on OpenWRT, core infrastructure names, sensitive records.

Kubernetes Integration

  • CoreDNS handles in-cluster service discovery and does not depend on Technitium.
  • ExternalDNS maintains registry markers to avoid overwriting manual records.
  • Prefer VIP/stable IP for Technitium reachable from OpenWRT and workloads.

Testing

  • Power-cycle the cluster with OpenWRT up: public DNS must still resolve.
  • Bring up cluster with Technitium delayed: control-plane access must work via OpenWRT overrides.
  • Kill Technitium: public DNS works; internal names fail as expected.
  • Kill ExternalDNS: existing internal names still resolve.
  • WAN test: internal-only names do not resolve from cellular; LAN/VPN resolve to internal VIPs.

Implementation Notes

  • Keep OpenWRT as the client-facing resolver in Phase 1; migrate clients later only after resolver redundancy exists.
  • Favor IP-based configuration for anything that talks to DNS to avoid “DNS requires DNS” loops.
  • Use stable VIPs where possible so OpenWRT, Technitium, and ExternalDNS share a consistent target.

Role: Storage & Persistence

Responsibility

The Storage role manages the state of the platform. It provides persistent volumes to applications and ensures that data is replicated and backed up according to its criticality.

Key Guarantees

  • Data Durability: Protection against single-node or single-disk failure.
  • RPO/RTO Compliance: Backups must be performed and verified according to policy.
  • Abstraction: Applications should request storage via standard interfaces (e.g., PVCs) without knowing the underlying disk layout.

Implementation Options

OptionBest FitGood AtCosts / RisksIntegration Notes
LonghornK3s, K8sEasy replicated block storage.Performance/latency tradeoffs.Tie into snapshot/backup story (Velero).
Rook-CephK8s, TalosPowerful HA, block/file/object storage.Complexity; resource hungry; learning curve.Needs disciplined node disks and upgrade choreography.
ZFSNomad, K8sSolid local storage, snapshots, replication.Not a distributed fabric; HA is “replication + restore”.Orchestration integration varies; great for “pet data”.
NFS / SMBAnySimple shared storage.Central dependency; HA depends on NAS.Backups are straightforward; locking semantics vary.

Typical Stack Pairings

  • K3s: Longhorn
  • Talos/K8s: Rook-Ceph (for strong HA) or NFS (for simplicity)
  • Nomad: ZFS (host-based) + replication or NFS

Role: Compute & Orchestration

Responsibility

The Compute role provides the execution environment for all platform workloads. It handles scheduling, lifecycle management (start/stop/restart), and resource isolation between tenants.

Key Guarantees

  • Automated Scheduling: Workloads are placed on nodes based on resource availability and constraints.
  • Self-Healing: Automatic recovery of failed workloads.
  • Resource Isolation: Enforced limits on CPU, memory, and disk to prevent “noisy neighbor” effects.

Implementation Options

OptionBest FitGood AtCosts / RisksIntegration Notes
K3sPragmatic homelabLightweight Kubernetes; easy setup; bundled extras (Traefik).Still Kubernetes complexity; some opinionated defaults.Best with Traefik + Longhorn patterns.
Kubernetes on TalosEnterprise-grade HAImmutable OS; security; “OS as appliance” feel.Steeper learning curve; non-traditional debugging.Pairs with NGINX + Rook-Ceph + Velero.
NomadSimple/FlexibleEasy to run; handles non-container workloads; low overhead.Smaller ecosystem than K8s; stateful patterns more DIY.Best with Consul/Vault, ZFS host volumes, Traefik/Caddy.

Typical Stack Pairings

  • K3s: Traefik + Longhorn
  • Talos/K8s: NGINX + Rook-Ceph
  • Nomad: Traefik + ZFS + Consul

Role: Operations

Responsibility

The Operations role encompasses the tools and processes required to maintain, monitor, and update the platform. It ensures repeatability through GitOps and visibility through observability.

Key Guarantees

  • Observability: Centralized metrics, logs, and alerting for all platform components.
  • Change Management: Git (GitOps) drives all platform changes.
  • Disaster Recovery: Automated backups and verified restore paths for stateful data.

Implementation Options

OptionBest FitGood AtCosts / RisksIntegration Notes
Prometheus / Grafana / LokiAll stacksStandard dashboards; large community; mature.Can sprawl; needs retention planning.K8s has the most turnkey packaging (kube-prometheus-stack).
GitOps (Argo CD / Flux)K8s, K3sRepeatability; drift control; clear audit trail.Higher initial platform complexity.Very mature on K8s; Nomad has options but less standardized.
VeleroK8s, K3sK8s-native backups; CSI integration for snapshots.K8s-specific.Best for K8s/K3s clusters.
ResticAnyGeneral purpose; deduplication; encrypted backups.More manual configuration on K8s.Great for Nomad/ZFS/NAS-based approaches.

Typical Stack Pairings

  • K8s/K3s: Prometheus + ArgoCD/Flux + Velero
  • Nomad/Other: Prometheus + Restic

Trust Boundaries & Access Model

Trust View

This document defines the network reachability and security posture of the platform. It answers the question: From where can traffic originate and where can it go?

System Boundaries

The platform is divided into distinct zones with hard boundaries.

flowchart LR
  Internet((Internet)) -->|HTTPS| PublicIngress[Public Ingress]
  PolicyNote["No inbound NAT or public path to Internal Ingress"]

  subgraph Home["Home Network Boundary"]
    LAN[LAN Clients] --> PrivateDNS[Private DNS]
    VPN[VPN Clients] --> PrivateDNS
    PrivateDNS --> InternalIngress
    InternalIngress --> Services[Internal Services]
    PublicIngress --> PublicServices[Public Services]
  end

  PublicDNS[Public DNS] --> PublicIngress

Zone Definitions

The Internet (Untrusted)

Any client originating outside the home network. Only allowed to communicate with the Public Ingress via HTTPS.

The Home Network (Trusted Boundary)

A secure zone containing both LAN and VPN clients.

  • LAN Clients: Physical devices connected to the home router.
  • VPN Clients: Remote devices with an active, authenticated tunnel.
    • Enrollment: Only verified devices are permitted to join the network.
    • Experience: Remote devices experience connectivity identical to local network access (Split-Horizon DNS + Private IPs).
    • Security: Encrypted communication channels are maintained for all remote traffic.

Internal Platform (Protected)

Services that are never exposed to the internet. Reachability is strictly limited to clients already inside the Home Network Boundary.

Reachability Matrix

From \ ToPublic ServicesInternal ServicesManagement (SSH/Git)
InternetHTTPS❌ Blocked❌ Blocked
LANHTTPSHTTPSAuthorized Only
VPNHTTPSHTTPSAuthorized Only

Key Security Postures

  • No Inbound NAT: There are no port-forwarding rules from the internet to internal service IPs.
  • Split-Horizon DNS: Service names (e.g., app.risu.tech) resolve to different IPs depending on if the client is on the Internet or the Home Network.
  • Authenticated Ingress: All internal services require identity verification at the Ingress layer.

Identity & Login Model

Responsibility View

This document explains how identity and authentication work end-to-end. It answers the question: How does a user gain access to a service?

Authentication Flow

The following diagram illustrates the branching logic for user authentication, including session persistence and MFA requirements.

flowchart TD
    Start([User visits service.risu.tech]) --> Ingress[Ingress / Auth Proxy]
    Ingress --> CheckAppSession{Valid Session?}
    
    CheckAppSession -- "Yes" --> Forward[Forward to App]
    CheckAppSession -- "No" --> IdP[Redirect to Identity Provider]
    
    subgraph IdPFlow [Identity Provider]
        IdP --> CheckIdPSession{IdP Session exists?}
        CheckIdPSession -- "No" --> Login[User Credentials Prompt]
        Login --> Validate[Validate Credentials]
        Validate -- "Success" --> CheckMFA{MFA required?}
        Validate -- "Failure" --> Login
        
        CheckIdPSession -- "Yes" --> CheckMFA
        
        CheckMFA -- "Yes" --> MFAPrompt[MFA Challenge]
        MFAPrompt --> ValidateMFA[Validate MFA]
        ValidateMFA -- "Success" --> Authorize[Authorize Access]
        ValidateMFA -- "Failure" --> MFAPrompt
        
        CheckMFA -- "No" --> Authorize
    end
    
    Authorize --> Grant[Redirect with Session Cookie]
    Grant --> Ingress
    
    Forward --> Service[App Content]
    Service --> End([Success])

Functional Components

Auth Proxy / Ingress

The first line of defense. It intercepts requests and verifies the presence of a valid session cookie. If missing or expired, it handles the OIDC/SAML handshake with the IdP.

Identity Provider (IdP)

The source of truth for user accounts and groups. It manages credentials, MFA enrollment, and issues tokens upon successful authentication.

Application (Service)

The destination. Most applications are “auth-blind,” relying on the Auth Proxy to provide user information via headers (e.g., X-Forwarded-User).

Platform Policies

These policies define the “Hard Gates” and operational constraints that all platform implementations and services must satisfy. They ensure consistency, security, and durability across the environment.

Policy Index

Exposure Policy

Rules

This document defines how services are exposed to users and the network requirements for each exposure category.

Exposure Categories

Public

  • Definition: Services reachable from the internet.
  • DNS: Must resolve to the Public IP of the platform.
  • Auth: Must enforce SSO/MFA at the Ingress layer.
  • TLS: Must use valid, publicly trusted certificates.

Internal

  • Definition: Services reachable only from LAN or VPN.
  • DNS: Must resolve to a Private IP (RFC1918).
  • Auth: Must enforce SSO at the Ingress layer.
  • TLS: Should use certificates (internal or public CA).

VPN-Only

  • Definition: Services reachable only from VPN clients.
  • DNS: Must resolve to a Private IP (RFC1918) only on VPN resolvers.
  • Auth: Must enforce SSO at the Ingress layer.
  • TLS: Should use certificates (internal or public CA).

Management

  • Definition: Administrative endpoints (SSH, Git, control plane consoles).
  • DNS: Must resolve to management-only records or private IPs.
  • Auth: Must enforce MFA and privileged access controls.
  • TLS: Must use certificates (internal or public CA).

Mandatory Ingress Behavior

CategoryAllowed IngressAllowed Source NetworksDNS Resolution
PublicPublic Ingress onlyInternetPublic IP
InternalInternal Ingress onlyLAN + VPNPrivate IP
VPN-OnlyInternal Ingress onlyVPN onlyPrivate IP (VPN resolvers only)
ManagementManagement endpoints onlyAdmin LAN + VPNPrivate IP / management records

Mandatory Auth Requirements

CategoryAuthenticationAuthorization
PublicSSO + MFA at IngressGroup-based access (IdP)
InternalSSO at IngressGroup-based access (IdP)
VPN-OnlySSO at IngressGroup-based access (IdP)
ManagementMFA + privileged accessAdmin-only groups, audited access

Naming Rules

  • All services MUST use the *.risu.tech domain.
  • Internal service names MUST match their public counterparts (if they exist) to ensure a seamless user experience.
  • The platform uses Split-Horizon DNS to ensure that app.risu.tech resolves to the correct IP based on the client’s network location.

Traffic Constraints

  • Public Ingress MUST NOT route traffic to backends tagged as “Internal.”
  • Internal Ingress MUST drop any traffic originating from outside the Home Network Boundary.
  • No direct port-forwarding (NAT) to backend services is allowed. All traffic must pass through an Ingress controller.

Identity Policy

Rules

This document defines the rules all services and users must obey regarding identity and access.

Guarantees

  • Unified Login: A single set of credentials and session is used across all platform services.
  • MFA Enforcement: Multi-factor authentication is mandatory for all administrative access and any service exposed to the public internet (where supported).
  • Session Isolation: Authentication is handled by the platform, not the application, ensuring a uniform security posture.

Service Requirements

All services integrated into the platform MUST:

  1. Delegate Auth: Rely on the platform’s Identity Provider via OIDC, SAML, or Auth Proxy headers.
  2. Use Group-Based Access: Authorization should be based on IdP groups (e.g., admins, family), not individual user accounts.
  3. Support SSO: Be configured to allow seamless login via the platform session.

Auth Requirements by Exposure Category

CategoryAuthenticationAuthorizationNotes
PublicSSO + MFA enforced at ingressIdP groups requiredNo anonymous access unless explicitly approved in a Service Contract.
InternalSSO enforced at ingressIdP groups requiredLocal accounts disallowed except break-glass.
VPN-OnlySSO enforced at ingressIdP groups requiredVPN enrollment required for network access.
ManagementMFA required for all accessAdmin-only groupsSSH keys or short-lived certs required for shell access.

Management Access Rules

  • Administrative endpoints MUST be reachable only from Admin LAN or VPN networks.
  • SSH access MUST use keys or short-lived certificates; passwords are forbidden.
  • All management access MUST be attributable to a named admin identity and logged.

Negative Constraints

  • Services MUST NOT maintain their own local user databases for “standard” access.
  • Local “admin” or “break-glass” accounts MUST have high-entropy, randomly generated passwords stored in a secure vault.
  • Clear-text passwords MUST NEVER be stored in the Git repository.

Backup Policy

Rules

This document defines the rules for protecting data and ensuring its recoverability.

Data Tiers & RPO/RTO

TierDescriptionRPORTO
CriticalCore identity, config, and family data.1 Hour4 Hours
StandardApplication data, media, and tools.24 Hours24 Hours
DisposableCaches, logs, temporary files.N/ABest Effort

Retention Rules

  • Critical Data: Must be backed up daily, with weekly offsite replication. Retain for 30 days minimum.
  • System Config: Must be backed up after every confirmed change (via Git).
  • Offsite Copies: At least one copy of critical data must be physically separated from the primary site.

Verification Requirements

  • Automated Checks: Every backup job must report its status to the Observability platform.
  • Restore Drills: A manual restore test must be performed for each “Critical” service at least once every 6 months.
  • Immutability: Backups should be stored in a way that prevents modification or deletion by a compromised system (e.g., append-only mode).

Change Management Policy

Rules

This document defines how changes are made to the platform to ensure stability, auditability, and reproducibility.

The Source of Truth

The platform is defined entirely in code. The Git repository is the sole source of truth for:

  1. Infrastructure Configuration: YAML, HCL, and scripts.
  2. Architecture Decisions: ADRs in Markdown.
  3. Technical Documentation: This book.

Change Workflow

All changes (except for emergency “break-glass” scenarios) must follow this flow:

  1. Draft: Propose the change in a new branch.
  2. Review: Peer review or self-review (for minor changes).
  3. Merge: Merge into the main branch.
  4. Deploy: Automated CI/CD pipelines apply the change.

Documentation Requirements

  • Significant architectural shifts MUST be recorded as an ADR.
  • All service deployments MUST have a corresponding entry in the Service Catalog.
  • Manual configuration on nodes is strictly forbidden unless codified immediately after.

Secrets Management

  • Clear-text secrets MUST NEVER be committed to Git.
  • Use a dedicated secrets manager or encrypted storage (e.g., SOPS) for credentials.
  • Secrets MUST be rotated if a compromise is suspected or as per the defined rotation schedule.

Data Durability Model

Responsibility View

This document defines how data is stored, replicated, and protected. It answers the question: How is data kept safe and available?

Data Pipeline

The following diagram shows the lifecycle of data from the application to offsite storage.

flowchart TB
  App[Stateful App] --> Request["Storage Storage Interface"]
  Request --> Storage[Storage Fabric]
  Storage --> Replicas["Replicated Copies (N>=2)"]
  Replicas --> Backup[Backup / Snapshot System]
  Backup --> Offsite[(Optional: Offsite Copy)]

Layers of Protection

Storage Fabric

The active storage layer (e.g., Ceph, ZFS, or RAID). It provides immediate availability and protection against single-drive or single-node failures via real-time replication.

Snapshots

Point-in-time, read-only views of the storage. These provide “undo” capability for accidental deletions or software corruption without requiring a full restore.

Backup System

A separate, immutable copy of the data stored on different physical media. This protects against catastrophic failure of the primary storage fabric.

Workload Orchestration Model

Responsibility View

This document defines how applications are deployed and managed across the platform. It answers the question: How are workloads kept running and healthy?

Orchestration Lifecycle

The platform automatically manages the lifecycle of applications, ensuring they are placed on suitable nodes and restarted if they fail.

flowchart TD
  Def[Workload Definition] --> Desired[(Desired State Store)]
  
  subgraph ControlPlane["Control Plane (Decides)"]
    Recon[Reconciler / Controller]
    Sched[Scheduler]
  end

  subgraph DataPlane["Data Plane (Runs)"]
    subgraph Nodes["Nodes"]
      A[Node Agent]
      B[Node Agent]
      C[Node Agent]
    end
    WL[Running Workloads]
  end

  %% Observe
  Nodes --> Obs[Health & Telemetry Signals]
  WL --> Obs

  %% Decide
  Desired --> Recon
  Obs --> Recon
  Recon -->|needs placement| Sched
  Sched -->|bind workload| Nodes

  %% Actuate
  Recon -->|start/stop/restart| Nodes
  Nodes -->|run| WL

Key Capabilities

Automated Scheduling

Workloads are assigned to nodes based on resource availability (CPU/RAM) and affinity rules. This ensures that no single node is overwhelmed while others are idle.

Self-Healing

If a node or a specific workload fails, the scheduler automatically attempts to restart the workload on a healthy node, minimizing downtime.

Resource Governance

Every workload must have defined resource requests and limits. This prevents a single “noisy neighbor” from consuming all cluster resources.

Observability Model

Responsibility View

This document defines how the platform monitors its health and alerts on failures. It answers the question: How do we know if something is wrong?

Signal Flow

The platform collects signals from all layers and aggregates them into actionable dashboards and alerts.

flowchart LR
  Nodes[Hardware/OS] --> Collector
  Pods[Workloads] --> Collector
  Ingress[Traffic] --> Collector
  Collector --> TSDB[(Metrics / Logs)]
  TSDB --> Dashboards[Visualization]
  TSDB --> AlertManager[Alerting]
  AlertManager --> Notification{Notification}

Core Signals

Metrics (Availability & Performance)

Numerical data points (CPU, Memory, Latency, Error Rate) used to determine the real-time health of a component.

Logs (Context & Security)

Textual records of events. Used for post-mortem analysis, security auditing, and troubleshooting complex failures.

Health Checks (Integrity)

Active probing of service endpoints (e.g., /healthz). This determines if a workload is ready to receive traffic or needs to be restarted.

Control Plane Model

Purpose

This model defines where configuration lives, how it is applied, and what runs continuously vs only during deploys.

Control Flow

flowchart LR
  Git[Git Repository] --> CICD[CI/CD Pipeline]
  CICD --> Apply[Apply Mechanism]
  Apply --> Cluster[Cluster State]

  subgraph BreakGlass["Break-Glass Path"]
    Admin[Admin Session] --> Manual[Manual Change]
  end

  Manual --> Cluster
  Manual -. "Post-codify in Git" .-> Git

Configuration Sources of Truth

  • Primary: Git repository (IaC, manifests, scripts, docs).
  • Secrets: Encrypted secrets store (referenced from Git, never committed in clear text).

Apply Mechanism

  • CI/CD: Executes validation, build, and apply steps on merge to main.
  • IaC Tooling: Terraform/Ansible/Helm (implementation TBD, interchangeable by contract).
  • Controllers: In-cluster controllers reconcile desired state continuously.

Continuous vs Deploy-Time

  • Continuous: Ingress controllers, identity proxy, DNS sync jobs, monitoring/alerting.
  • Deploy-Time: Schema migrations, config changes, new service rollouts.

Break-Glass Rules

  • Manual changes are allowed only for incident response.
  • Any manual change MUST be codified in Git immediately after stabilization.

Management Plane Model

Purpose

This model defines where administrative endpoints live, how administrators authenticate, and which networks can reach management services.

Management Reachability

flowchart LR
  Admin[Admin Operator] -->|SSH / Git / HTTPS| MgmtEndpoints[Management Endpoints]
  PolicyNote["No inbound path from Internet"]

  subgraph Home["Home Network Boundary"]
    AdminLAN[Admin LAN] --> MgmtEndpoints
    VPN[Admin via VPN] --> MgmtEndpoints
  end

  MgmtEndpoints --> ControlPlane[Control Plane Services]
  MgmtEndpoints --> Nodes[Cluster Nodes]

  Internet((Internet)) -.-> PolicyNote

Access Rules

  • Management endpoints are never exposed to the public internet.
  • Only admin devices on Admin LAN or VPN can reach management endpoints.
  • Administrative access requires MFA and membership in privileged IdP groups.

Authentication Requirements

  • SSH: Keys or short-lived certificates only; passwords are forbidden.
  • Git/HTTPS: SSO with MFA enforced; audit logging enabled.
  • Break-Glass: Emergency accounts are stored in a secure vault and rotated after use.

Implementation Selection

Purpose

Move from architectural roles to concrete implementation choices by evaluating how different options compose into functional platform stacks.

1) Role Implementation Matrix

The following matrix summarizes the primary implementation options for each architectural role. For detailed trade-offs and integration notes, refer to the individual role documents.

RoleImplementation OptionsPrimary Best-Fit Stacks
Edge & BoundaryTraefik, NGINX Ingress, CaddyK3s, K8s, Nomad
Identity & AccessAuthentik, Keycloak, AutheliaAll Stacks
Connectivity & NamingCoreDNS, ExternalDNS, Pi-hole, ConsulK8s, Nomad
Storage & PersistenceLonghorn, Rook-Ceph, ZFS, NFSK3s, K8s, Nomad
Compute & OrchestrationK3s, K8s (Talos), Nomad-
OperationsProm/Grafana/Loki, GitOps, Velero, ResticAll Stacks

2) Stack Assemblies

Instead of starting with pre-baked bundles, we derive platform “stacks” as compatible sets of implementations that naturally compose together.

The Pragmatic Homelab (K3s-based)

Focuses on ease of use and low overhead while maintaining Kubernetes compatibility.

  • Orchestrator: K3s
  • Ingress: Traefik (Forward-auth)
  • LB (L4): Klipper (bundled) or MetalLB
  • Identity: Authentik
  • Storage: Longhorn
  • Backups: Velero + Restic
  • Observability: Prometheus + Grafana + Loki

The Appliance Cluster (Talos/K8s-based)

Focuses on HA, security, and immutability.

  • Orchestrator: Kubernetes on Talos
  • Ingress: NGINX Ingress (OAuth2-proxy)
  • LB (L4): Kube-vip (Layer 2)
  • Identity: Authentik or Keycloak
  • Storage: Rook-Ceph
  • Backups: Velero (CSI Snapshots)
  • Observability: Prometheus + Grafana + Loki

The Flexible Scheduler (Nomad-based)

Focuses on simplicity and host-integrated storage.

  • Orchestrator: Nomad
  • Ingress: Traefik or Caddy
  • Discovery/LB: Consul + Fabio/Traefik
  • Identity: Authentik (Forward-auth)
  • Storage: ZFS (Host volumes + Replication)
  • Backups: Restic
  • Observability: Prometheus + Grafana + Loki

3) Selection Criteria & Validation

We evaluate these stacks against our Non-Functional Requirements and Policies.

Hard Gates

These are non-negotiable policy checks.

  • No Inbound NAT: Must support exposure via tunnels or relay Exposure Policy.
  • Identity-First: All exposure points must enforce IdP-backed auth Identity Policy.
  • Cluster Reachability: Load balancing (L4/L7) must be addressed at Day 1; “floating” workloads require a stable entry point to be usable.
  • Durability: Must meet RPO 1h / RTO 4h for critical data Backup Policy.

Acceptance Tests

  1. Internal DNS: internal.service.risu.tech resolves internally and is unreachable from WAN.
  2. VPN access: VPN client resolves internal names and can access internal ingress.
  3. Public isolation: public ingress serves only public services, never internal.
  4. Identity flow: auth proxy + IdP flow works end-to-end for internal and public routes.
  5. Stateful proof: dummy stateful service gets storage, replica, backup job signal, and a restore test plan.

Non-Functional Requirements

This document details the non-functional requirements (NFRs) that govern the design, implementation, and operation of the homelab infrastructure.

Security

  • Secure Boundary Enforcement: Private services must be strictly isolated to prevent accidental exposure to the public internet.
  • Identity & Access Management: A centralized identity provider must be utilized, supporting multifactor authentication (MFA).
  • Secrets Governance: All credentials and sensitive data must be managed through defined storage and rotation policies.
  • Network Segmentation: Traffic flow between services must be restricted according to clearly defined security policies.

Connectivity & Networking

  • Seamless Remote Access: Remote devices must maintain an experience identical to local network connectivity via secure VPN.
  • Naming Consistency: A unified naming scheme (*.risu.tech) must be maintained across both public and private services using split-horizon DNS.

Availability & Reliability

  • High Availability (HA): The system must remain operational across multiple nodes, ensuring service continuity and data consistency.
  • Workload Rescheduling: Applications must automatically relocate to healthy nodes in the event of hardware or software failure.
  • Data Persistence: The storage fabric must guarantee data consistency and replication across failure domains.

Data Protection

  • Resilient Backup: Critical data must be protected through immutable and offline copies.
  • Disaster Recovery: Restoration procedures must meet defined Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO).
  • Restore Verification: Backup integrity must be regularly validated through systematic restore testing.

Usability

  • Low-Friction UX: The infrastructure must provide an intuitive and accessible experience for non-technical users.
  • Single Sign-On (SSO): Authentication must be streamlined to minimize login prompts through a unified session.

Maintainability

  • Advanced Observability: Centralized logging and metrics must be implemented to facilitate rapid troubleshooting and performance analysis.
  • Reproducibility: The entire infrastructure configuration must be defined within a central source-of-truth repository.
  • Documentation: Maintenance tasks must be supported by clear, actionable runbooks.
  • Automated Documentation Delivery: The source of truth for documentation must be automatically built and deployed to ensure accessibility and consistency.

Pipelines

Pipelines are found in the .forgejo/workflows/ directory in the source code repository, utilizing Forgejo Actions.

  • docs_deploy: Build mdBook and deploy static HTML to the documentation server via rsync/SSH.

Service Catalog

This catalog contains the citizens of the platform. Each service is defined by a contract that specifies its requirements and how it fits into the platform’s architecture.

Catalog Entries

Lifecycle

Each service must satisfy the platform rules defined in Architecture Overview before it is shipped.

Service Contract:

Ownership

  • Owner:
  • Steward:

Purpose

What problem does it solve for the family/me?

Exposure

  • Category: Public | Internal | VPN-only | Management
  • Ingress: Public | Internal | Management
  • DNS names:

Identity

  • AuthN: SSO required | SSO + MFA required | local accounts (justify)
  • AuthZ: IdP group(s) required
  • Break-glass account: yes/no (location)

Data

  • Persistence: ephemeral | persistent
  • Data class: disposable | standard | critical
  • Estimated storage growth:

Network

  • Allowed source networks: Internet | LAN | VPN | Admin LAN
  • Egress requirements:

Availability

  • HA required: yes/no
  • Acceptable downtime:

Backup

  • Tier: none | standard | critical
  • Restore test cadence:

Dependencies

  • Needs database:
  • Needs object storage:
  • Needs SMTP:
  • Other:

Observability

  • Metrics:
  • Logs:
  • Alerts:

Change Control

  • Deployment method:
  • Rollback plan:

Notes / Risks

What could go wrong?

Service Contract: OpenWRT Bootstrap Resolver

Purpose

Authoritative DHCP/DNS front door for LAN/VPN clients; performs public recursion and conditionally forwards internal zones to Technitium while holding static overrides for recovery.

Exposure

  • Category: Internal | VPN-only
  • Ingress: Management
  • DNS names: distributed via DHCP; management UI reachable via static IP

Identity

  • AuthN: Local admin accounts
  • AuthZ: Admin account required for configuration changes
  • Break-glass account: Yes (documented in password vault)

Data

  • Persistence: Persistent (config backups required)
  • Data class: Standard
  • Estimated storage growth: Negligible

Network

  • Allowed source networks: LAN, VPN
  • Egress requirements: Public DNS upstreams; Internet for firmware updates

Availability

  • HA required: No (Phase 1 single resolver)
  • Acceptable downtime: Short maintenance windows; restores must be priority

Backup

  • Tier: Standard (export config before/after major changes)
  • Restore test cadence: After firmware updates or quarterly

Dependencies

  • Needs database: No
  • Needs object storage: No
  • Needs SMTP: No
  • Other: Stable upstream DNS IPs

Observability

  • Metrics: DNS query/error counters (if available)
  • Logs: DNS and DHCP logs
  • Alerts: Loss of upstream resolution; DHCP pool exhaustion

Change Control

  • Deployment method: OpenWRT config/UI + git-backed config export
  • Rollback plan: Restore last known-good config backup

Notes / Risks

Phase 1 single point of failure for DNS; keep static overrides for Technitium and ingress VIP to enable recovery.

Service Contract: Technitium DNS

Purpose

Authoritative DNS for internal service names, serving LAN/VPN clients and Kubernetes-ingress endpoints; optional recursion or forwarding to OpenWRT.

Exposure

  • Category: Internal | VPN-only
  • Ingress: Internal
  • DNS names: dns.risu.tech (internal-only)

Identity

  • AuthN: Local admin accounts
  • AuthZ: Admin role required for zone changes
  • Break-glass account: Yes (stored in password vault)

Data

  • Persistence: Persistent (zones/config)
  • Data class: Standard
  • Estimated storage growth: Minimal

Network

  • Allowed source networks: LAN, VPN, cluster nodes
  • Egress requirements: Upstream DNS IPs (public or OpenWRT)

Availability

  • HA required: High (for internal service resolution) but not required for platform bootstrap
  • Acceptable downtime: Minutes; recovery path via OpenWRT static overrides

Backup

  • Tier: Standard (regular export of zones/config)
  • Restore test cadence: After major upgrades or quarterly

Dependencies

  • Needs database: No (embedded)
  • Needs object storage: No
  • Needs SMTP: No
  • Other: Stable Service IP/VIP; upstream DNS reachable by IP

Observability

  • Metrics: Query rate, NXDOMAIN/servfail counts
  • Logs: Query/zone change logs
  • Alerts: Service availability; zone integrity errors

Change Control

  • Deployment method: Kubernetes (Talos) workload
  • Rollback plan: Redeploy previous version and restore last config backup

Notes / Risks

Must avoid DNS self-dependency: configure all upstreams and ExternalDNS endpoints by IP; keep WAN exposure disabled.

Service Contract: ExternalDNS

Purpose

Automate internal DNS records by reconciling annotated Kubernetes resources into Technitium with clear ownership boundaries.

Exposure

  • Category: Internal (cluster-only)
  • Ingress: Internal
  • DNS names: None (API-driven)

Identity

  • AuthN: Kubernetes service account
  • AuthZ: ClusterRole scoped to read ingress/service resources
  • Break-glass account: Not applicable

Data

  • Persistence: Ephemeral
  • Data class: Standard
  • Estimated storage growth: None

Network

  • Allowed source networks: Cluster nodes
  • Egress requirements: Technitium Service IP/VIP; Kubernetes API

Availability

  • HA required: No (automation only)
  • Acceptable downtime: Hours; existing records continue to resolve

Backup

  • Tier: None (state is declarative via Kubernetes + Technitium registry)
  • Restore test cadence: Not required

Dependencies

  • Needs database: No
  • Needs object storage: No
  • Needs SMTP: No
  • Other: Stable Technitium IP/VIP; domain filters/ownership registry configured

Observability

  • Metrics: Reconciliation success/fail counts
  • Logs: Controller logs for record changes
  • Alerts: Persistent reconciliation failures

Change Control

  • Deployment method: Kubernetes deployment/helm/manifest
  • Rollback plan: Revert deployment manifest/helm release

Notes / Risks

Restrict domain filters and ownership to internal hostnames to avoid accidental public zone changes.

Runbooks

Operational runbooks for the homelab platform. Each runbook is designed to be copy-paste friendly and scoped to a single failure or procedure.

Catalog

Runbook: DNS Bootstrap & Recovery (OpenWRT + Technitium + ExternalDNS)

Purpose

Bring up or restore internal DNS while avoiding dependency loops. Applies to split-horizon risu.tech with OpenWRT as bootstrap resolver, Technitium as internal authority, and ExternalDNS for automation.

Preconditions

  • OpenWRT reachable with admin access.
  • Reserved stable IPs/VIPs for Technitium and internal ingress.
  • Access to Kubernetes cluster (Talos) for Technitium/ExternalDNS deployments.

Bootstrap Steps (greenfield or re-seed)

  1. OpenWRT
    • Ensure DHCP is enabled and advertises itself as DNS.
    • Verify public recursion works using upstream DNS IPs.
  2. Static overrides on OpenWRT
    • Add host overrides:
      • dns.risu.tech → Technitium IP/VIP
      • ingress-internal.risu.tech → internal ingress VIP (optional but recommended)
  3. Deploy Technitium
    • Deploy to the cluster with a stable Service IP/VIP.
    • Configure upstream resolvers by IP (public) or forward recursion to OpenWRT by IP.
    • Keep WAN exposure disabled.
  4. Conditional forward on OpenWRT
    • Add forward rule: risu.tech → Technitium IP/VIP.
  5. Deploy ExternalDNS
    • Scope with domain filters/ownership registry to internal hostnames only.
    • Set provider endpoint to the Technitium IP/VIP (not hostname).

Recovery: Technitium Down

  1. From a LAN/VPN client, confirm public DNS still works via OpenWRT.
  2. Use OpenWRT static overrides to reach the cluster ingress/UI.
  3. Restart Technitium workload; restore config/zones if needed.
  4. Validate conditional forwarding resumes and internal names resolve.

Recovery: ExternalDNS Down

  1. Confirm Technitium answers existing records.
  2. Restart ExternalDNS deployment; check logs for reconciliation success.

Recovery: OpenWRT DNS Down

  1. Clients lose DNS; bring OpenWRT back first (single resolver in Phase 1).
  2. Verify DHCP/DNS service restores; re-check conditional forward to Technitium.

Verification & Tests

  • Power-cycle the cluster with OpenWRT up: public DNS must still resolve.
  • Start cluster with Technitium intentionally delayed: control plane reachable via overrides.
  • Kill Technitium: public DNS works; internal names fail (expected).
  • Kill ExternalDNS: existing internal names resolve; no new records created.
  • WAN test: internal-only names do not resolve from cellular; LAN/VPN resolve to internal VIPs.

Notes

  • Keep all DNS dependencies by IP to avoid “DNS needs DNS.”
  • Once resolver redundancy exists, you may move clients to Technitium directly; update this runbook accordingly.

Architecture Decision Records

This directory contains a historical log of significant architectural decisions made throughout the evolution of the homelab project. Each record details the context, decision, and resulting consequences to provide transparency and rationale for the system’s design.

Records Index

ADR 0001: Use Codeberg as Public Git Host

Status

Accepted

Context

The homelab project requires a public git repository to host its architecture documentation, infrastructure-as-code (IaC), and potentially public-facing service configurations. This host serves as the “public face” of the project and must align with the project’s values regarding open source, privacy, and community-driven infrastructure.

While a self-hosted instance (e.g., Forgejo/Gitea) will be used for internal management and private code, a reliable public host is needed for:

  • Public visibility and collaboration.
  • External CI/CD triggers (e.g., for documentation deployment).
  • Mirroring and redundancy for critical configurations.

Decision

We will use Codeberg as the primary public git host for the homelab project.

Codeberg is chosen because:

  • It is based on Forgejo (a community fork of Gitea), which aligns with our internal management plane preferences.
  • It is a non-profit, community-driven platform that prioritizes privacy and freedom.
  • It provides a reliable, high-performance environment for hosting public repositories without the commercial baggage of larger platforms.

Consequences

  • The homelab repository (and associated subprojects) will be maintained on Codeberg.
  • Automation for documentation deployment (mdBook) will be integrated with Codeberg’s CI/CD (Woodpecker or Forgejo Actions) or triggered by Codeberg webhooks.
  • Public contributions and issues will be managed via the Codeberg interface.
  • Secret management must be strictly enforced to ensure no private credentials are leaked to the public Codeberg repositories.

ADR 0002: Record Architecture Decisions

Status

Accepted

Context

A formal mechanism is required to document architectural decisions made during the development and evolution of the homelab project. This ensures long-term consistency, provides critical context for future modifications, and facilitates knowledge transfer.

Decision

The project will utilize Architecture Decision Records (ADRs) to document significant architectural choices. These records will be maintained within the doc/src/adr/ directory, following a sequential numbering scheme.

Consequences

  • Enhanced Transparency: Provides clear visibility into the reasoning behind key architectural choices.
  • Historical Context: Establishes a permanent record of the system’s evolution.
  • Sustainable Maintenance: Facilitates easier onboarding and long-term system maintenance by preserving intent.

ADR 0003: Split-Horizon DNS for Unified Naming

Status

Accepted

Context

The project requires a unified naming scheme (*.risu.tech) that functions seamlessly across both public and private services. Key requirements include maintaining strict isolation for private services and providing a frictionless remote access experience that mirrors local network connectivity.

Decision

We will implement a split-horizon DNS architecture:

  • Public DNS Authority: Resolves records exclusively for public-facing endpoints.
  • Private DNS Authority: Resolves records for internal services and serves as the primary authority for LAN and VPN clients.
  • Context-Aware Routing: Ingress controllers will enforce hostname-based routing determined by the traffic’s origin (public vs. private).

Consequences

  • Unified User Experience: Users utilize consistent service names regardless of their physical or network location.
  • Enhanced Security Profile: Internal service names and metadata are not exposed to public DNS.
  • Operational Complexity: Requires the management and synchronization of two distinct sets of DNS records.

ADR 0004: Documentation Delivery System

Status

Delayed (Time constraints on runners prevent cargo from compiling for dependencies–needs a polite workaround)

Context

Infrastructure documentation must be easily accessible to all authorized users and updated automatically to reflect the current state of the repository. The documentation is authored in Markdown and managed by mdBook. We need a robust pipeline to build and deliver this documentation to a private (internal server) destination.

Decision

We will implement an automated documentation delivery system with the following components:

  • Source of Truth: The homelab repository on Codeberg.
  • Build Engine: Forgejo Actions (using Forgejo Runners), triggered on pushes to the main branch (specifically for changes within the doc/ directory) or via manual trigger (workflow_dispatch).
  • Single-Target Delivery:
    • Private: Automated deployment to an internal server at /var/www/doc via SSH/rsync for local access.
  • Security: SSH-based deployment will use a dedicated, restricted user and an SSH key stored as a secret in the CI environment.
  • Serving: Nginx will be used to serve the static HTML output on the internal server.

Consequences

  • Automated Consistency: Documentation is guaranteed to be up-to-date with the repository’s main branch.
  • Reduced Complexity: Focusing on a single, internal delivery target simplifies the pipeline and avoids dependency on external “best-effort” services.
  • Standardized Process: Leverages Forgejo Actions, providing compatibility with GitHub Actions-style workflows and existing Runner infrastructure.
  • Secret Management: Requires careful handling of SSH keys within the CI platform.

ADR 0005: No Inbound NAT for Internal Services

Status

Accepted

Context

The platform hosts both public and internal services. Internal services must never be internet-routable to preserve a strong trust boundary. The architecture already assumes split-horizon DNS and internal ingress controls, but the routing posture must be explicit and enforceable.

Decision

There will be no inbound NAT or port-forwarding from the internet to internal service IPs. All internal services are reachable only from LAN or VPN networks through the internal ingress.

Consequences

  • Internet-originated traffic can never reach internal services directly.
  • Public exposure is limited to explicitly designated public services via the public ingress.
  • Network policies and firewall rules must reflect the absence of inbound NAT.

ADR 0006: Identity-First Ingress for Service Access

Status

Accepted

Context

The platform exposes services to multiple audiences (public, internal, VPN-only, management). To enforce consistent access control and auditing, authentication should be centralized and uniform rather than implemented independently by each service.

Decision

All services must be fronted by an ingress layer that enforces identity at the platform level. Services must integrate with the platform Identity Provider via SSO (OIDC/SAML) or trusted auth proxy headers, with MFA required for public and management access.

Consequences

  • Services must not expose unauthenticated endpoints unless explicitly approved in a Service Contract.
  • The ingress layer becomes a critical security control that must be monitored and hardened.
  • Service onboarding requires identity integration as a first-class step.

ADR 0007: Kubernetes with TalosOS

Status

Accepted

Context

The homelab platform targets a multi-node server environment with room for future capability expansion (for example, optional non-default plugins). K3s was considered, but its optimization for edge/IoT and bundled defaults are less aligned with the desired flexibility. Nomad was also evaluated for its simplicity and support for both containerized and non-containerized workloads. In this environment, infrastructure-as-code and an immutable OS reduce Nomad’s operational advantages, and non-containerized workloads are unlikely.

Decision

Adopt a full Kubernetes stack running on TalosOS as the base orchestration platform.

Consequences

  • Ecosystem Flexibility: Kubernetes provides a broad ecosystem, extension points, and standard service discovery and load-balancing patterns.
  • Operational Model: TalosOS delivers an immutable, API-managed Kubernetes host OS and supports extensions and secure networking (for example, KubeSpan).
  • Complexity Trade-off: Operational complexity is higher than Nomad in isolation, but is mitigated by IaC and TalosOS automation.
  • Workload Standardization: Workloads will be standardized on containers unless a future ADR explicitly permits exceptions.

ADR 0008: Adopt Authentik as Central Identity Provider

Status

Accepted

Context

The platform needs a centralized identity and access solution that:

  • Supports SSO and MFA.
  • Protects both modern apps (OIDC/SAML) and legacy apps without federation support.
  • Integrates cleanly with the Edge/Boundary reverse proxy and internal DNS.
  • Is reproducible and manageable as code in a self-hosted environment.

Candidates included Authentik, Authelia, Zitadel, and Keycloak. The key differentiator is robust proxy-based enforcement combined with standards-based federation in a single system.

Decision

Adopt Authentik as the platform’s central IdP and access control system:

  • Use OIDC/SAML for apps that natively support federation.
  • Use Authentik proxy/outposts to protect web apps without OIDC/SAML.
  • Enforce MFA via Authentik policies/flows, with step-up where appropriate.

Consequences

  • Centralized Access: Consistent login/MFA experience across nearly all services.
  • Coverage for Legacy Apps: Proxy enforcement reduces per-app auth workarounds.
  • Critical Dependency: Authentik downtime can block access to protected services; monitoring and break-glass access are required.
  • Operational Discipline: Flows, policies, and outposts require configuration-as-code to avoid drift.
  • Container Standardization: Authentik becomes a core platform service and must meet backup/restore and upgrade standards.

Alternatives Considered

  • Keycloak + oauth2-proxy: Mature IdP, but requires additional gateway components.
  • Authelia: Strong proxy gate, weaker as a full IdP with rich flows.
  • Zitadel: Modern OIDC UX, proxy protection is not a core feature.

ADR 0009: Eliminate Dual DHCP and Establish a Single Boundary

Status

Accepted

Context

The network previously had both the ISP gateway and OpenWRT serving DHCP on the same subnet. This created an ambiguous boundary and undermined consistent policy enforcement at the edge.

Decision Drivers

  • Avoid non-deterministic gateway assignment and client routing.
  • Ensure consistent DNS behavior to support split-horizon.
  • Prepare for future HA/VIP routing patterns without conflicting DHCP sources.
  • Maintain a clear, singular security boundary for policy enforcement.

Decision

  • Place the ISP router/modem into bridge mode.
  • Make OpenWRT the sole DHCP and NAT authority for the subnet.
  • Keep IPMI disconnected by default due to port exhaustion and power constraints; connect only when needed.

Consequences

  • Single Boundary: A single NAT/DHCP boundary improves policy enforcement and troubleshooting.
  • Predictable Clients: Gateway and DNS assignment become deterministic.
  • Future Migration: Simplifies future migration to a dedicated firewall or HA topology.
  • Operational Trade-off: IPMI access is on-demand rather than always available.

ADR 0010: Prefer Perimeter Firewall with Dual Ingress for Exposure

Status

Accepted

Context

Three exposure stacks were evaluated:

  • Model A — Perimeter firewall (OpenWRT now, upgradable later) owns routing/NAT; Kubernetes hosts two ingress controllers (internal-only and public).
  • Model B — Kubernetes-native edge using Gateway API with CNI-integrated data plane (e.g., Cilium) to terminate edge traffic directly on the cluster.
  • Model C — Cloud tunnel/overlay (e.g., Cloudflare Tunnel, Tailscale Funnel) to expose services without direct inbound paths.

The homelab prioritizes a clear internal/public boundary, minimal external dependencies, and the ability to swap in a dedicated firewall when hardware/power constraints ease. Existing OpenWRT already acts as the single boundary (see ADR 0009), and split-horizon DNS is assumed (ADR 0003). Identity-first ingress is required for user-facing access (ADR 0006).

Decision Drivers

  • Preserve a single, enforceable perimeter where north-south policy and logging live.
  • Keep internal ingress paths isolated from public ingress while supporting split-horizon DNS.
  • Allow future replacement of OpenWRT with a dedicated firewall without re-architecting cluster ingress.
  • Avoid new external dependencies for routine access; tolerate them only as scoped exceptions.
  • Fit power/port constraints and current hardware while enabling later VLAN/DMZ phases.

Considered Options

Model A — Perimeter Firewall + Dual Ingress

  • Pros: Clear boundary; firewall enforces 80/443 exposure; ingress controllers stay inside the cluster; works with current OpenWRT and future firewall/DMZ; keeps routing off the control plane.
  • Cons: Requires hairpin/port-forward rules and VIP management; firewall must forward to cluster nodes.

Model B — Kubernetes-Native Edge (Gateway API + CNI data plane)

  • Pros: Uniform policy definition inside K8s; fewer port-forwards; rich L7 features.
  • Cons: Pushes the trust boundary into the cluster; cluster health becomes prerequisite for edge routing; complicates future dedicated firewall insertion; higher operational complexity today.

Model C — Cloud Tunnel / Overlay Exposure

  • Pros: Quick public exposure; hides home IP; minimal edge config.
  • Cons: Adds third-party dependency and opaque failure modes; blurs boundary and bypasses local policy/logging; harder to reason about internal vs. public reachability.

Decision

Adopt Model A (Perimeter firewall + dual ingress):

  • Keep routing/NAT/policy on the perimeter firewall (OpenWRT now; replaceable with a dedicated firewall later) and continue to expose only the minimal ports (80/443) required for public ingress.
  • Run two ingress controllers in the cluster:
    • Internal Ingress: LAN/VPN-only, resolves via split-horizon DNS to an internal VIP.
    • Public Ingress: Receives only firewall-forwarded 80/443 traffic to a public VIP; backs the small set of intentionally exposed hostnames.
  • Use identity-first auth at ingress per ADR 0006; no generic port-forwarding to services.
  • Allow cloud tunnels only as scoped, documented exceptions (e.g., break-glass outbound-only tunnels) with explicit change control.

Consequences

  • Boundary Clarity: North-south enforcement, logging, and DDoS controls stay at the perimeter; internal ingress remains shielded from the internet.
  • Upgrade Path: A future dedicated firewall or DMZ VLAN can replace OpenWRT without reworking cluster ingress (aligns with the Network Evolution Plan).
  • Operational Simplicity: Fewer moving parts at the edge; ingress lifecycle stays inside Kubernetes, where certificates and auth already live.
  • Constraints-Friendly: Works within current power/port limits; no requirement to run edge data plane on K8s nodes.
  • Risk: Firewall misconfiguration could still overexpose services; requires to be disciplined VIP/reservation management and monitoring of port-forwards.

Implementation Notes / Next Steps

  • Reserve VIPs for internal/public ingress in the SERVER/DMZ ranges defined in the Network Evolution Plan.
  • Maintain firewall rules: 80/443 to public ingress VIP only; no generic NAT for internal services (per ADR 0005).
  • Keep split-horizon DNS records aligned with the two ingress VIPs.
  • Document any exception tunnels with owners, scope, and teardown criteria.

ADR 0011: ExternalDNS + Technitium for Internal DNS Automation

Status

Accepted

Context

Internal DNS needs to provide LAN/VPN-only resolution for service hostnames while remaining automatable from Kubernetes. The solution must avoid bootstrap dependency loops (DNS needing DNS) and keep public DNS management separate from internal records.

Decision

Adopt Technitium as the internal authoritative DNS service and use ExternalDNS to reconcile annotated Kubernetes resources into Technitium. Keep OpenWRT as the client-facing bootstrap resolver, providing public recursion and conditional forwarding to Technitium with minimal static overrides for recovery.

Consequences

  • Enables automated, authoritative internal DNS with clear ownership boundaries.
  • Avoids DNS dependency loops by using IP-based upstreams and keeping clients pointed at OpenWRT.
  • Increases operational complexity compared to static DNS; requires guardrails for split-horizon risu.tech and tight scoping of ExternalDNS domain filters.