Architecture Overview

Purpose

This document serves as the canonical source of truth for the Home Lab platform. It defines what exists and why, establishing the fundamental principles that guide all architectural decisions.

The set of services and infrastructure components explicitly designed to be reachable from the public internet. This platform resides behind the Public Ingress and is subject to strict exposure policies.

Internal Platform

The core of the home lab, consisting of services reachable only from the local network (LAN) or via an authorized VPN connection. Access is defined by network presence and identity.

Trust Boundaries

Clear lines of demarcation between different security zones (Internet, LAN, VPN, Management). Every interaction across a boundary must be explicitly allowed and authenticated.

Platform Roles

The platform is composed of several stable roles that provide foundational services:

Connectivity: DNS, Ingress, and VPN.
Identity: Centralized authentication and SSO.
Persistence: Replicated storage and backups.
Operations: Monitoring, alerting, and change management.

Invariants

These rules are absolute and must not be violated by any implementation:

Internal Isolation: Internal services are never internet-routable. No direct NAT or port-forwarding to internal services is permitted.
Identity First: No service shall be exposed without an identity-aware proxy or native SSO integration unless explicitly justified in a Service Contract.
Source of Truth: The Git repository is the sole authority for the state of the platform. Manual “hot-fixes” are technical debt that must be codified immediately.
Data Durability: Critical data must exist in at least two physical locations at all times.

Non-Goals

Real-time global availability (HA is local/cluster-based, not geo-distributed).
Public hosting of third-party data.
Replacement of enterprise-grade cloud services for high-risk workloads.

System Context

Map View

The following diagram provides a high-level orientation of the actors and systems involved in the Home Lab ecosystem.

flowchart TD
    subgraph Users [Users]
        Family["Family User"]
        Owner["Admin"]
        Public["Public Visitor"]
    end

    PublicPlane["Public Platform (behind Public Ingress)"]
    InternalPlane["Internal Platform (LAN/VPN only)"]

    subgraph Control ["Change Automation"]
        Automation["CI/CD + IaC Pipelines"]
    end

    subgraph External ["External Dependencies"]
        DNS["Cloud DNS"]
        Internet["The Internet"]
    end

    Family -- "HTTPS / LAN / VPN" --> InternalPlane
    Owner -- "SSH / Git / HTTPS" --> InternalPlane
    Owner -- "Git / CI/CD" --> Automation
    Public -- "HTTPS" --> PublicPlane

    Automation -- "Deploys / Config" --> InternalPlane
    Automation -- "Deploys / Config" --> PublicPlane
    Automation -- "DNS record management (automation)" --> DNS

    PublicPlane -- "Traffic" --> Internet
    InternalPlane -- "Traffic" --> Internet

Actors & Systems

Entity	Role	Description
Family User	Internal User	Accesses personal services (Wiki, Photos, Chat) from within the LAN or via VPN.
Admin	Infrastructure Owner	Manages the platform, security, and service configurations via SSH, Git, and HTTPS.
Public Visitor	External User	Accesses public-facing content and websites hosted on the platform.
Public Platform	Public Plane	Internet-facing services reachable through the public ingress.
Internal Platform	Internal Plane	Core services and management endpoints reachable only from LAN or VPN.
Change Automation	Control Plane	CI/CD and IaC pipelines that apply platform changes and manage DNS records.
Cloud DNS	External System	Managed DNS provider (risu.tech) updated by automation for split-horizon or public resolution.
The Internet	Network	Public network through which external visitors arrive and internal resources are reached.

Network Model v1 (Power-Constrained Phase)

Purpose

Document the as-built network state, the rationale behind it, and the intended evolution path. This is the baseline substrate for ingress, naming, and service exposure decisions.

As-Built Topology

Physical Topology

Internet
   |
ISP Modem (Bridge Mode)
   |
OpenWRT Router (Single NAT / DHCP / DNS)
   |
LAN Clients + Server Nodes
(IPMI connected on-demand only)

Logical Roles

Role	Device/Service
Edge NAT	OpenWRT
DHCP Authority	OpenWRT
DNS	OpenWRT (AdGuard)
VPN Client Egress	OpenWRT (WireGuard -> iVPN)
ISP Modem	Bridge mode only (no routing)

IP Plan (Current)

Single flat LAN (one subnet).
DHCP and DNS are authoritative only on OpenWRT.
Specific CIDR, DHCP ranges, and static reservations live in OpenWRT config.

Trade-offs (Intentional)

No VLAN segmentation yet: Deferred due to hardware and power constraints.
No dedicated firewall: OpenWRT fulfills boundary duties for now.
No managed switch: The network spine is temporary; port/power constraints apply.
IPMI not always-on: Connected only when needed to conserve ports and power.

Evolution Roadmap

Phase 1 (Current): Single NAT/DHCP/DNS, flat LAN.
Phase 2: Add managed switch and introduce VLANs.
Phase 3: Dedicated firewall and segmented trust zones.

ADR 0005: No Inbound NAT for Internal Services
ADR 0009: Eliminate Dual DHCP and Establish a Single Boundary

Network Evolution Plan (VLANs and Ingress Separation)

Purpose

Define the next phases for segmentation and ingress separation so the current flat LAN can evolve without disruptive renumbering.

Phase Targets

Keep the existing LAN (10.0.0.0/24) stable during transition.
Introduce clear trust boundaries: Clients, Servers, Management, DMZ, IoT, Guest, Lab.
Reserve address space and VIP ranges now to simplify later MetalLB/kube-vip usage.
Separate public and internal ingress paths, with split-horizon DNS.

VLANs and Subnets (Proposed)

VLAN	Name	Subnet	Purpose	Typical Residents
10	LAN	`10.0.0.0/24`	Default user network	PCs, phones, TVs
20	SERVER	`10.0.20.0/24`	App workloads, cluster nodes	Talos/K8s nodes, storage
30	MGMT	`10.0.30.0/24`	Out-of-band + admin	IPMI/BMC, switch/AP management
40	DMZ	`10.0.40.0/24`	Public-facing edge only	Public ingress VIPs / edge svc
50	IOT	`10.0.50.0/24`	Untrusted devices	Cameras, smart devices
60	GUEST	`10.0.60.0/24`	Visitor access	Guest Wi-Fi clients
70	LAB	`10.0.70.0/24`	Experiments	Test gear, ephemeral nodes

DHCP and Gateway Plan (Examples)

Assuming router-on-a-stick (trunk to switch):

VLAN	Gateway	DHCP Scope	Notes
10	`10.0.0.1`	`10.0.0.10–250`	Keep current allocations
20	`10.0.20.1`	`10.0.20.50–250`	Reserve low IPs for VIPs/statics
30	`10.0.30.1`	none or limited	Prefer static/reservations
40	`10.0.40.1`	none or limited	DMZ should be explicit
50	`10.0.50.1`	`10.0.50.50–250`	Tight egress rules
60	`10.0.60.1`	`10.0.60.50–250`	Internet only
70	`10.0.70.1`	optional	Lab isolation

Default Inter-VLAN Policy (Allow Only What Is Needed)

LAN (10) → Internal ingress/services (20): allow service ports.
LAN (10) → MGMT (30): deny, except specific admin workstation or VPN admin group.
VPN/Admin → MGMT (30): allow.
DMZ (40) → Servers (20): allow only public ingress backends.
IOT (50) → anywhere: deny by default, allow minimal egress if needed.
GUEST (60) → internal: deny (internet only).

Ingress Separation Model

Public Ingress: Internet-reachable hostnames only; prefer placement in DMZ (VLAN 40) when available.
Internal Ingress: LAN/VPN-only hostnames; placed in SERVER (VLAN 20) or LAN (VLAN 10) during early phase.
Start with both ingress controllers in VLAN 20 (simpler); move Public Ingress VIPs to VLAN 40 when DMZ exists.

VIP Reservations (Examples)

Internal ingress VIPs: 10.0.20.10–10.0.20.19
Public ingress VIPs: 10.0.40.10–10.0.40.19
Gateways: .1, network services: .2–.9

DNS Expectations (Split-Horizon)

Use the unified namespace *.risu.tech (per Exposure Policy and Split-Horizon ADR).
Internal-only names resolve to internal VIPs (e.g., wiki.risu.tech → 10.0.20.10 on LAN/VPN).
Public names resolve externally only when intentionally exposed (e.g., status.risu.tech).
Internal resolvers must not return public IPs for internal-only names.

Diagram (Ingress and Trust Zones)

flowchart TD
  Internet((Internet)) --> WAN[WAN]
  WAN --> Edge["Router/Firewall: OpenWRT now, dedicated later (policy gate)"]

  subgraph VLAN10[LAN 10 - 10.0.0.0/24]
    Clients[LAN Clients]
  end

  subgraph VLAN20[SERVER 20 - 10.0.20.0/24]
    Nodes[K8s/Talos Nodes]
    IntIngress[Internal Ingress VIPs]
    Services[Internal Services]
  end

  subgraph VLAN30[MGMT 30 - 10.0.30.0/24]
    IPMI[IPMI/BMC]
    NetMgmt[Switch/AP Mgmt]
  end

  subgraph VLAN40[DMZ 40 - 10.0.40.0/24]
    PubIngress[Public Ingress VIPs]
  end

  Edge --> VLAN10
  Edge --> VLAN20
  Edge --> VLAN30
  Edge --> VLAN40

  Clients --> IntIngress --> Services
  Internet -.->|Allowed 80/443 only via firewall/NAT| PubIngress --> Services

Migration Steps (Incremental)

Current (flat): keep everything on 10.0.0.0/24, single DHCP (done).
Add managed switch: trunk to router, keep most devices untagged on VLAN 10.
Move servers to VLAN 20; keep clients on VLAN 10.
Move management to VLAN 30 (static/reserved IPs).
Add DMZ VLAN 40 for public ingress VIPs; expose only 80/443 as needed.

ADR 0005: No Inbound NAT for Internal Services
ADR 0009: Eliminate Dual DHCP and Establish a Single Boundary
ADR 0003: Split-Horizon DNS

Platform Roles

The Home Lab platform is built upon a set of stable, well-defined roles. These roles represent the “bones” of the infrastructure—foundational capabilities that must remain stable regardless of which specific applications are running.

Role Catalog

Edge & Boundary: Defining where the public internet stops and the private network begins.
Identity & Access: Providing a single source of truth for users and permissions.
Connectivity & Naming: Ensuring services are reachable via consistent, human-readable names.
Storage & Persistence: Guaranteeing data durability and availability across the cluster.
Compute & Orchestration: Managing the lifecycle of containerized workloads.
Operations: Handling observability, change management, and backups.

Mapping Roles to Implementation

Each role is defined by its responsibilities and requirements. The specific technologies used to fulfill these roles (e.g., K3s, Authelia, Traefik) may evolve, but the roles themselves remain constant.

Role: Edge & Boundary

Responsibility

The Edge & Boundary role is the first line of defense. It is responsible for terminating public traffic and enforcing the transition from untrusted networks (Internet) to trusted networks (Home Network/VPN).

Key Guarantees

Traffic Termination: All public HTTPS traffic must terminate at the Edge.
L7 Load Balancing: Spreading requests across multiple “floating” service instances regardless of their physical node location.
Protocol Enforcement: Only authorized protocols (HTTPS, WireGuard) are permitted to cross the boundary.
Isolation: Publicly reachable services must be logically isolated from the internal-only platform.

Trust Boundaries & Access Model
Exposure Policy
Network Model v1 (Power-Constrained Phase)
Network Evolution Plan (VLANs and Ingress Separation)

Implementation Options

Option	Best Fit	Good At	Costs / Risks	Integration Notes
Traefik	K3s, K8s, Nomad	Simple config, Let’s Encrypt, dynamic discovery, forward-auth.	Some features are Traefik-native; middleware sprawl.	Works with OIDC via forward-auth; requires standard headers.
NGINX Ingress	K8s, Talos	Very common, strong annotation ecosystem.	Auth relies on external proxies (oauth2-proxy); annotation-heavy.	Pairs well with oauth2-proxy; explicit ingress classes needed.
Caddy	Nomad, Small K8s	TLS automation; simple reverse proxy story.	Less “platformy” out of the box; varies by env.	Decide if identity is enforced here or at auth gateway.

Typical Stack Pairings

K3s: Traefik (native feel)
Talos/K8s: NGINX Ingress (most common)
Nomad: Traefik or Caddy

Role: Identity & Access

Responsibility

The Identity role provides the “Who” for the entire platform. it manages user identities, credentials, and group memberships, and provides a unified authentication experience (SSO).

Key Guarantees

Centralized Truth: One directory for all human users.
MFA Enforcement: Critical services must require multifactor authentication.
SSO: Users should only need to authenticate once to access multiple platform services.

Identity & Login Model
Identity Policy

Implementation Options

Option	Best Fit	Good At	Costs / Risks	Integration Notes
Authentik	All stacks	Flexible flows, “one IdP for everything”.	Operating an IdP (backup, upgrades, DB).	Choose enforcement at ingress: forward-auth, oauth2-proxy, or mesh-based.
Keycloak	K8s, Talos	Enterprise-grade, standard OIDC/SAML, great docs.	Heavy; tuning/upgrade complexity.	Pairs well with oauth2-proxy and standard OIDC clients.
Authelia	K3s, Nomad	Light-weight auth portal, simple 2FA, forward-auth.	Less of a “platform” than Authentik/Keycloak.	If OIDC is needed for apps, a full IdP might still be required.

Typical Stack Pairings

Traefik: Authentik + forward-auth (or Authelia)
NGINX Ingress: Authentik/Keycloak + oauth2-proxy
Any: IdP + apps using OIDC directly (for “native SSO” apps)

Role: Connectivity & Naming

Responsibility

This role ensures that users and services can find each other. It handles DNS resolution and internal routing, maintaining a consistent namespace across local and remote connections.

Key Guarantees

Unified Namespace: Use of *.risu.tech globally.
Split-Horizon DNS: Internal names resolve to internal IPs; external names point to the Edge.
Service Discovery: Automatic detection and registration of “floating” workloads.
L4 Load Balancing (VIP): Providing stable virtual IPs for cluster-wide services (like Ingress) to ensure they are reachable even if nodes fail.

Current Stack Choice

OpenWRT as the bootstrap resolver, Technitium DNS as the internal authority, and ExternalDNS for Kubernetes-driven automation. Details and runbooks live in Connectivity & Naming Stack.

Trust Boundaries & Access Model
Exposure Policy

Implementation Options

Option	Best Fit	Good At	Costs / Risks	Integration Notes
CoreDNS + ExternalDNS	K8s, K3s, Talos	K8s-native, clean in-cluster discovery.	Split-horizon needs careful design.	Decide “source of truth” (Git/IaC) and VPN DNS view.
Pi-hole / AdGuard Home	Any (External)	Easy local DNS + blocking; great for split-horizon.	Another stateful service; HA takes effort.	Ensure VPN hands out this DNS; avoid public leaks.
WireGuard / Tailscale	Any	Stable remote access.	Tailscale is managed-ish; WireGuard is DIY.	DNS distribution over VPN is the key integration point.
MetalLB / Kube-vip	K8s, K3s, Talos	Provides L4 LoadBalancer IPs on bare-metal.	Requires network support (ARP/BGP); configuration overhead.	Essential for giving the Ingress Controller a stable IP in a cluster.
Consul	Nomad	First-class in Nomad ecosystems.	Adds a control plane component.	Decide how Consul names map to your DNS naming scheme.

Typical Stack Pairings

K8s: MetalLB/Kube-vip + CoreDNS + ExternalDNS + WireGuard/Tailscale
Nomad: Consul (+ DNS integration) + Traefik/Fabio + WireGuard/Tailscale
Hybrid: Pi-hole/AdGuard as “front” DNS for LAN/VPN regardless of orchestrator

Connectivity & Naming Stack (OpenWRT + Technitium + ExternalDNS)

Scope

Defines how names under risu.tech are resolved for LAN, VPN, and Kubernetes workloads, and how DNS automation and failure modes behave.

Goals

Internal names (e.g., wiki.risu.tech) resolve only on LAN/VPN.
Public names resolve from anywhere without exposing internal metadata.
Internal DNS is authoritative and automated from Kubernetes via ExternalDNS.
DNS outages degrade safely: public domains keep resolving and the platform can bootstrap without internal DNS.

Non-Goals

Multi-site or geo-distributed DNS.
Automating the public zone in this phase.
Making the router a permanent authoritative DNS platform.

Roles and Responsibilities

OpenWRT (bootstrap resolver): DHCP authority, default resolver for clients, recursion to public upstreams, conditional forward of internal zones to Technitium, local static overrides for recovery.
Technitium DNS (internal authority): Hosts authoritative internal records and optional recursion; reachable only from LAN/VPN; uses IP-based upstream configuration.
ExternalDNS (automation controller): Watches Kubernetes resources and reconciles allowed records into Technitium; limited to explicitly delegated hostnames.

Resolution Flows

Internal name (normal): Client → OpenWRT → Technitium → internal VIP/endpoint.
Public name: Client → OpenWRT → public recursive resolution.

Dependency-Loop Prevention

Principle: nothing required to bootstrap the platform should depend on Technitium.
Invariants:
- Clients always use OpenWRT as resolver in Phase 1.
- OpenWRT keeps minimal static records (Technitium VIP and internal ingress VIP) to reach recovery paths.
- Technitium upstreams are configured by IP or forward recursion to OpenWRT by IP.
- ExternalDNS targets Technitium by stable IP/VIP, not hostname.

Failure Behavior

Technitium down: Internal names fail except the static overrides; public names still resolve via OpenWRT.
ExternalDNS down: Existing records served; no new automation until it returns.
OpenWRT DNS down: Clients lose DNS (Phase 1 SPOF); acceptable until resolver redundancy is added.

Zone Strategy

Preferred: split-horizon risu.tech (same zone name internal and public).
Safety controls: Technitium not reachable from WAN; ExternalDNS constrained via annotation/label allowlists, TXT ownership, and domain filters; public DNS managed separately.
Alternative: internal sub-zone such as int.risu.tech if split-horizon proves risky.

Record Ownership

ExternalDNS-managed: Annotated Kubernetes services/ingresses that are allowed for automation.
Manually managed: Bootstrap overrides on OpenWRT, core infrastructure names, sensitive records.

Kubernetes Integration

CoreDNS handles in-cluster service discovery and does not depend on Technitium.
ExternalDNS maintains registry markers to avoid overwriting manual records.
Prefer VIP/stable IP for Technitium reachable from OpenWRT and workloads.

Testing

Power-cycle the cluster with OpenWRT up: public DNS must still resolve.
Bring up cluster with Technitium delayed: control-plane access must work via OpenWRT overrides.
Kill Technitium: public DNS works; internal names fail as expected.
Kill ExternalDNS: existing internal names still resolve.
WAN test: internal-only names do not resolve from cellular; LAN/VPN resolve to internal VIPs.

Implementation Notes

Keep OpenWRT as the client-facing resolver in Phase 1; migrate clients later only after resolver redundancy exists.
Favor IP-based configuration for anything that talks to DNS to avoid “DNS requires DNS” loops.
Use stable VIPs where possible so OpenWRT, Technitium, and ExternalDNS share a consistent target.

Role: Connectivity & Naming
ADR 0011: ExternalDNS + Technitium for Internal DNS Automation
Service contracts: OpenWRT, Technitium DNS, ExternalDNS
Runbooks: DNS Bootstrap & Recovery

Role: Storage & Persistence

Responsibility

The Storage role manages the state of the platform. It provides persistent volumes to applications and ensures that data is replicated and backed up according to its criticality.

Key Guarantees

Data Durability: Protection against single-node or single-disk failure.
RPO/RTO Compliance: Backups must be performed and verified according to policy.
Abstraction: Applications should request storage via standard interfaces (e.g., PVCs) without knowing the underlying disk layout.

Data Durability Model
Backup Policy

Implementation Options

Option	Best Fit	Good At	Costs / Risks	Integration Notes
Longhorn	K3s, K8s	Easy replicated block storage.	Performance/latency tradeoffs.	Tie into snapshot/backup story (Velero).
Rook-Ceph	K8s, Talos	Powerful HA, block/file/object storage.	Complexity; resource hungry; learning curve.	Needs disciplined node disks and upgrade choreography.
ZFS	Nomad, K8s	Solid local storage, snapshots, replication.	Not a distributed fabric; HA is “replication + restore”.	Orchestration integration varies; great for “pet data”.
NFS / SMB	Any	Simple shared storage.	Central dependency; HA depends on NAS.	Backups are straightforward; locking semantics vary.

Typical Stack Pairings

K3s: Longhorn
Talos/K8s: Rook-Ceph (for strong HA) or NFS (for simplicity)
Nomad: ZFS (host-based) + replication or NFS

Role: Compute & Orchestration

Responsibility

The Compute role provides the execution environment for all platform workloads. It handles scheduling, lifecycle management (start/stop/restart), and resource isolation between tenants.

Key Guarantees

Automated Scheduling: Workloads are placed on nodes based on resource availability and constraints.
Self-Healing: Automatic recovery of failed workloads.
Resource Isolation: Enforced limits on CPU, memory, and disk to prevent “noisy neighbor” effects.

Workload Orchestration Model

Implementation Options

Option	Best Fit	Good At	Costs / Risks	Integration Notes
K3s	Pragmatic homelab	Lightweight Kubernetes; easy setup; bundled extras (Traefik).	Still Kubernetes complexity; some opinionated defaults.	Best with Traefik + Longhorn patterns.
Kubernetes on Talos	Enterprise-grade HA	Immutable OS; security; “OS as appliance” feel.	Steeper learning curve; non-traditional debugging.	Pairs with NGINX + Rook-Ceph + Velero.
Nomad	Simple/Flexible	Easy to run; handles non-container workloads; low overhead.	Smaller ecosystem than K8s; stateful patterns more DIY.	Best with Consul/Vault, ZFS host volumes, Traefik/Caddy.

Typical Stack Pairings

K3s: Traefik + Longhorn
Talos/K8s: NGINX + Rook-Ceph
Nomad: Traefik + ZFS + Consul

Role: Operations

Responsibility

The Operations role encompasses the tools and processes required to maintain, monitor, and update the platform. It ensures repeatability through GitOps and visibility through observability.

Key Guarantees

Observability: Centralized metrics, logs, and alerting for all platform components.
Change Management: Git (GitOps) drives all platform changes.
Disaster Recovery: Automated backups and verified restore paths for stateful data.

Observability Model
Management Plane
Change Management Policy
Backup Policy

Implementation Options

Option	Best Fit	Good At	Costs / Risks	Integration Notes
Prometheus / Grafana / Loki	All stacks	Standard dashboards; large community; mature.	Can sprawl; needs retention planning.	K8s has the most turnkey packaging (kube-prometheus-stack).
GitOps (Argo CD / Flux)	K8s, K3s	Repeatability; drift control; clear audit trail.	Higher initial platform complexity.	Very mature on K8s; Nomad has options but less standardized.
Velero	K8s, K3s	K8s-native backups; CSI integration for snapshots.	K8s-specific.	Best for K8s/K3s clusters.
Restic	Any	General purpose; deduplication; encrypted backups.	More manual configuration on K8s.	Great for Nomad/ZFS/NAS-based approaches.

Typical Stack Pairings

K8s/K3s: Prometheus + ArgoCD/Flux + Velero
Nomad/Other: Prometheus + Restic

Trust Boundaries & Access Model

Trust View

This document defines the network reachability and security posture of the platform. It answers the question: From where can traffic originate and where can it go?

System Boundaries

The platform is divided into distinct zones with hard boundaries.

flowchart LR
  Internet((Internet)) -->|HTTPS| PublicIngress[Public Ingress]
  PolicyNote["No inbound NAT or public path to Internal Ingress"]

  subgraph Home["Home Network Boundary"]
    LAN[LAN Clients] --> PrivateDNS[Private DNS]
    VPN[VPN Clients] --> PrivateDNS
    PrivateDNS --> InternalIngress
    InternalIngress --> Services[Internal Services]
    PublicIngress --> PublicServices[Public Services]
  end

  PublicDNS[Public DNS] --> PublicIngress

Zone Definitions

The Internet (Untrusted)

Any client originating outside the home network. Only allowed to communicate with the Public Ingress via HTTPS.

The Home Network (Trusted Boundary)

A secure zone containing both LAN and VPN clients.

LAN Clients: Physical devices connected to the home router.
VPN Clients: Remote devices with an active, authenticated tunnel.
- Enrollment: Only verified devices are permitted to join the network.
- Experience: Remote devices experience connectivity identical to local network access (Split-Horizon DNS + Private IPs).
- Security: Encrypted communication channels are maintained for all remote traffic.

Internal Platform (Protected)

Services that are never exposed to the internet. Reachability is strictly limited to clients already inside the Home Network Boundary.

Reachability Matrix

From \ To	Public Services	Internal Services	Management (SSH/Git)
Internet	HTTPS	❌ Blocked	❌ Blocked
LAN	HTTPS	HTTPS	Authorized Only
VPN	HTTPS	HTTPS	Authorized Only

Key Security Postures

No Inbound NAT: There are no port-forwarding rules from the internet to internal service IPs.
Split-Horizon DNS: Service names (e.g., app.risu.tech) resolve to different IPs depending on if the client is on the Internet or the Home Network.
Authenticated Ingress: All internal services require identity verification at the Ingress layer.

Responsibility View

This document explains how identity and authentication work end-to-end. It answers the question: How does a user gain access to a service?

Authentication Flow

The following diagram illustrates the branching logic for user authentication, including session persistence and MFA requirements.

flowchart TD
    Start([User visits service.risu.tech]) --> Ingress[Ingress / Auth Proxy]
    Ingress --> CheckAppSession{Valid Session?}
    
    CheckAppSession -- "Yes" --> Forward[Forward to App]
    CheckAppSession -- "No" --> IdP[Redirect to Identity Provider]
    
    subgraph IdPFlow [Identity Provider]
        IdP --> CheckIdPSession{IdP Session exists?}
        CheckIdPSession -- "No" --> Login[User Credentials Prompt]
        Login --> Validate[Validate Credentials]
        Validate -- "Success" --> CheckMFA{MFA required?}
        Validate -- "Failure" --> Login
        
        CheckIdPSession -- "Yes" --> CheckMFA
        
        CheckMFA -- "Yes" --> MFAPrompt[MFA Challenge]
        MFAPrompt --> ValidateMFA[Validate MFA]
        ValidateMFA -- "Success" --> Authorize[Authorize Access]
        ValidateMFA -- "Failure" --> MFAPrompt
        
        CheckMFA -- "No" --> Authorize
    end
    
    Authorize --> Grant[Redirect with Session Cookie]
    Grant --> Ingress
    
    Forward --> Service[App Content]
    Service --> End([Success])

Functional Components

Auth Proxy / Ingress

The first line of defense. It intercepts requests and verifies the presence of a valid session cookie. If missing or expired, it handles the OIDC/SAML handshake with the IdP.

Identity Provider (IdP)

The source of truth for user accounts and groups. It manages credentials, MFA enrollment, and issues tokens upon successful authentication.

Application (Service)

The destination. Most applications are “auth-blind,” relying on the Auth Proxy to provide user information via headers (e.g., X-Forwarded-User).

Platform Policies

These policies define the “Hard Gates” and operational constraints that all platform implementations and services must satisfy. They ensure consistency, security, and durability across the environment.

Policy Index

Exposure Policy: Defining how services are made reachable and protecting the boundary.
Identity Policy: Mandating identity-first access for all platform components.
Backup Policy: Setting requirements for data durability (RPO/RTO).
Change Management Policy: Defining how the platform and its services are updated.

Exposure Policy

Rules

This document defines how services are exposed to users and the network requirements for each exposure category.

Exposure Categories

Public

Definition: Services reachable from the internet.
DNS: Must resolve to the Public IP of the platform.
Auth: Must enforce SSO/MFA at the Ingress layer.
TLS: Must use valid, publicly trusted certificates.

Internal

Definition: Services reachable only from LAN or VPN.
DNS: Must resolve to a Private IP (RFC1918).
Auth: Must enforce SSO at the Ingress layer.
TLS: Should use certificates (internal or public CA).

VPN-Only

Definition: Services reachable only from VPN clients.
DNS: Must resolve to a Private IP (RFC1918) only on VPN resolvers.
Auth: Must enforce SSO at the Ingress layer.
TLS: Should use certificates (internal or public CA).

Management

Definition: Administrative endpoints (SSH, Git, control plane consoles).
DNS: Must resolve to management-only records or private IPs.
Auth: Must enforce MFA and privileged access controls.
TLS: Must use certificates (internal or public CA).

Mandatory Ingress Behavior

Category	Allowed Ingress	Allowed Source Networks	DNS Resolution
Public	Public Ingress only	Internet	Public IP
Internal	Internal Ingress only	LAN + VPN	Private IP
VPN-Only	Internal Ingress only	VPN only	Private IP (VPN resolvers only)
Management	Management endpoints only	Admin LAN + VPN	Private IP / management records

Mandatory Auth Requirements

Category	Authentication	Authorization
Public	SSO + MFA at Ingress	Group-based access (IdP)
Internal	SSO at Ingress	Group-based access (IdP)
VPN-Only	SSO at Ingress	Group-based access (IdP)
Management	MFA + privileged access	Admin-only groups, audited access

Naming Rules

All services MUST use the *.risu.tech domain.
Internal service names MUST match their public counterparts (if they exist) to ensure a seamless user experience.
The platform uses Split-Horizon DNS to ensure that app.risu.tech resolves to the correct IP based on the client’s network location.

Traffic Constraints

Public Ingress MUST NOT route traffic to backends tagged as “Internal.”
Internal Ingress MUST drop any traffic originating from outside the Home Network Boundary.
No direct port-forwarding (NAT) to backend services is allowed. All traffic must pass through an Ingress controller.

Identity Policy

Rules

This document defines the rules all services and users must obey regarding identity and access.

Guarantees

Unified Login: A single set of credentials and session is used across all platform services.
MFA Enforcement: Multi-factor authentication is mandatory for all administrative access and any service exposed to the public internet (where supported).
Session Isolation: Authentication is handled by the platform, not the application, ensuring a uniform security posture.

Service Requirements

All services integrated into the platform MUST:

Delegate Auth: Rely on the platform’s Identity Provider via OIDC, SAML, or Auth Proxy headers.
Use Group-Based Access: Authorization should be based on IdP groups (e.g., admins, family), not individual user accounts.
Support SSO: Be configured to allow seamless login via the platform session.

Auth Requirements by Exposure Category

Category	Authentication	Authorization	Notes
Public	SSO + MFA enforced at ingress	IdP groups required	No anonymous access unless explicitly approved in a Service Contract.
Internal	SSO enforced at ingress	IdP groups required	Local accounts disallowed except break-glass.
VPN-Only	SSO enforced at ingress	IdP groups required	VPN enrollment required for network access.
Management	MFA required for all access	Admin-only groups	SSH keys or short-lived certs required for shell access.

Management Access Rules

Administrative endpoints MUST be reachable only from Admin LAN or VPN networks.
SSH access MUST use keys or short-lived certificates; passwords are forbidden.
All management access MUST be attributable to a named admin identity and logged.

Negative Constraints

Services MUST NOT maintain their own local user databases for “standard” access.
Local “admin” or “break-glass” accounts MUST have high-entropy, randomly generated passwords stored in a secure vault.
Clear-text passwords MUST NEVER be stored in the Git repository.

Backup Policy

Rules

This document defines the rules for protecting data and ensuring its recoverability.

Data Tiers & RPO/RTO

Tier	Description	RPO	RTO
Critical	Core identity, config, and family data.	1 Hour	4 Hours
Standard	Application data, media, and tools.	24 Hours	24 Hours
Disposable	Caches, logs, temporary files.	N/A	Best Effort

Retention Rules

Critical Data: Must be backed up daily, with weekly offsite replication. Retain for 30 days minimum.
System Config: Must be backed up after every confirmed change (via Git).
Offsite Copies: At least one copy of critical data must be physically separated from the primary site.

Verification Requirements

Automated Checks: Every backup job must report its status to the Observability platform.
Restore Drills: A manual restore test must be performed for each “Critical” service at least once every 6 months.
Immutability: Backups should be stored in a way that prevents modification or deletion by a compromised system (e.g., append-only mode).

Change Management Policy

Rules

This document defines how changes are made to the platform to ensure stability, auditability, and reproducibility.

The Source of Truth

The platform is defined entirely in code. The Git repository is the sole source of truth for:

Infrastructure Configuration: YAML, HCL, and scripts.
Architecture Decisions: ADRs in Markdown.
Technical Documentation: This book.

Change Workflow

All changes (except for emergency “break-glass” scenarios) must follow this flow:

Draft: Propose the change in a new branch.
Review: Peer review or self-review (for minor changes).
Merge: Merge into the main branch.
Deploy: Automated CI/CD pipelines apply the change.

Documentation Requirements

Significant architectural shifts MUST be recorded as an ADR.
All service deployments MUST have a corresponding entry in the Service Catalog.
Manual configuration on nodes is strictly forbidden unless codified immediately after.

Secrets Management

Clear-text secrets MUST NEVER be committed to Git.
Use a dedicated secrets manager or encrypted storage (e.g., SOPS) for credentials.
Secrets MUST be rotated if a compromise is suspected or as per the defined rotation schedule.

Data Durability Model

Responsibility View

This document defines how data is stored, replicated, and protected. It answers the question: How is data kept safe and available?

Data Pipeline

The following diagram shows the lifecycle of data from the application to offsite storage.

flowchart TB
  App[Stateful App] --> Request["Storage Storage Interface"]
  Request --> Storage[Storage Fabric]
  Storage --> Replicas["Replicated Copies (N>=2)"]
  Replicas --> Backup[Backup / Snapshot System]
  Backup --> Offsite[(Optional: Offsite Copy)]

Layers of Protection

Storage Fabric

The active storage layer (e.g., Ceph, ZFS, or RAID). It provides immediate availability and protection against single-drive or single-node failures via real-time replication.

Snapshots

Point-in-time, read-only views of the storage. These provide “undo” capability for accidental deletions or software corruption without requiring a full restore.

Backup System

A separate, immutable copy of the data stored on different physical media. This protects against catastrophic failure of the primary storage fabric.

Workload Orchestration Model

Responsibility View

This document defines how applications are deployed and managed across the platform. It answers the question: How are workloads kept running and healthy?

Orchestration Lifecycle

The platform automatically manages the lifecycle of applications, ensuring they are placed on suitable nodes and restarted if they fail.

flowchart TD
  Def[Workload Definition] --> Desired[(Desired State Store)]
  
  subgraph ControlPlane["Control Plane (Decides)"]
    Recon[Reconciler / Controller]
    Sched[Scheduler]
  end

  subgraph DataPlane["Data Plane (Runs)"]
    subgraph Nodes["Nodes"]
      A[Node Agent]
      B[Node Agent]
      C[Node Agent]
    end
    WL[Running Workloads]
  end

  %% Observe
  Nodes --> Obs[Health & Telemetry Signals]
  WL --> Obs

  %% Decide
  Desired --> Recon
  Obs --> Recon
  Recon -->|needs placement| Sched
  Sched -->|bind workload| Nodes

  %% Actuate
  Recon -->|start/stop/restart| Nodes
  Nodes -->|run| WL

Key Capabilities

Automated Scheduling

Workloads are assigned to nodes based on resource availability (CPU/RAM) and affinity rules. This ensures that no single node is overwhelmed while others are idle.

Self-Healing

If a node or a specific workload fails, the scheduler automatically attempts to restart the workload on a healthy node, minimizing downtime.

Resource Governance

Every workload must have defined resource requests and limits. This prevents a single “noisy neighbor” from consuming all cluster resources.

Observability Model

Responsibility View

This document defines how the platform monitors its health and alerts on failures. It answers the question: How do we know if something is wrong?

Signal Flow

The platform collects signals from all layers and aggregates them into actionable dashboards and alerts.

flowchart LR
  Nodes[Hardware/OS] --> Collector
  Pods[Workloads] --> Collector
  Ingress[Traffic] --> Collector
  Collector --> TSDB[(Metrics / Logs)]
  TSDB --> Dashboards[Visualization]
  TSDB --> AlertManager[Alerting]
  AlertManager --> Notification{Notification}

Core Signals

Metrics (Availability & Performance)

Numerical data points (CPU, Memory, Latency, Error Rate) used to determine the real-time health of a component.

Logs (Context & Security)

Textual records of events. Used for post-mortem analysis, security auditing, and troubleshooting complex failures.

Health Checks (Integrity)

Active probing of service endpoints (e.g., /healthz). This determines if a workload is ready to receive traffic or needs to be restarted.

Control Plane Model

Purpose

This model defines where configuration lives, how it is applied, and what runs continuously vs only during deploys.

Control Flow

flowchart LR
  Git[Git Repository] --> CICD[CI/CD Pipeline]
  CICD --> Apply[Apply Mechanism]
  Apply --> Cluster[Cluster State]

  subgraph BreakGlass["Break-Glass Path"]
    Admin[Admin Session] --> Manual[Manual Change]
  end

  Manual --> Cluster
  Manual -. "Post-codify in Git" .-> Git

Configuration Sources of Truth

Primary: Git repository (IaC, manifests, scripts, docs).
Secrets: Encrypted secrets store (referenced from Git, never committed in clear text).

Apply Mechanism

CI/CD: Executes validation, build, and apply steps on merge to main.
IaC Tooling: Terraform/Ansible/Helm (implementation TBD, interchangeable by contract).
Controllers: In-cluster controllers reconcile desired state continuously.

Continuous vs Deploy-Time

Continuous: Ingress controllers, identity proxy, DNS sync jobs, monitoring/alerting.
Deploy-Time: Schema migrations, config changes, new service rollouts.

Break-Glass Rules

Manual changes are allowed only for incident response.
Any manual change MUST be codified in Git immediately after stabilization.

Management Plane Model

Purpose

This model defines where administrative endpoints live, how administrators authenticate, and which networks can reach management services.

Management Reachability

flowchart LR
  Admin[Admin Operator] -->|SSH / Git / HTTPS| MgmtEndpoints[Management Endpoints]
  PolicyNote["No inbound path from Internet"]

  subgraph Home["Home Network Boundary"]
    AdminLAN[Admin LAN] --> MgmtEndpoints
    VPN[Admin via VPN] --> MgmtEndpoints
  end

  MgmtEndpoints --> ControlPlane[Control Plane Services]
  MgmtEndpoints --> Nodes[Cluster Nodes]

  Internet((Internet)) -.-> PolicyNote

Access Rules

Management endpoints are never exposed to the public internet.
Only admin devices on Admin LAN or VPN can reach management endpoints.
Administrative access requires MFA and membership in privileged IdP groups.

Authentication Requirements

SSH: Keys or short-lived certificates only; passwords are forbidden.
Git/HTTPS: SSO with MFA enforced; audit logging enabled.
Break-Glass: Emergency accounts are stored in a secure vault and rotated after use.

Implementation Selection

Purpose

Move from architectural roles to concrete implementation choices by evaluating how different options compose into functional platform stacks.

1) Role Implementation Matrix

The following matrix summarizes the primary implementation options for each architectural role. For detailed trade-offs and integration notes, refer to the individual role documents.

Role	Implementation Options	Primary Best-Fit Stacks
Edge & Boundary	Traefik, NGINX Ingress, Caddy	K3s, K8s, Nomad
Identity & Access	Authentik, Keycloak, Authelia	All Stacks
Connectivity & Naming	CoreDNS, ExternalDNS, Pi-hole, Consul	K8s, Nomad
Storage & Persistence	Longhorn, Rook-Ceph, ZFS, NFS	K3s, K8s, Nomad
Compute & Orchestration	K3s, K8s (Talos), Nomad	-
Operations	Prom/Grafana/Loki, GitOps, Velero, Restic	All Stacks

2) Stack Assemblies

Instead of starting with pre-baked bundles, we derive platform “stacks” as compatible sets of implementations that naturally compose together.

The Pragmatic Homelab (K3s-based)

Focuses on ease of use and low overhead while maintaining Kubernetes compatibility.

Orchestrator: K3s
Ingress: Traefik (Forward-auth)
LB (L4): Klipper (bundled) or MetalLB
Identity: Authentik
Storage: Longhorn
Backups: Velero + Restic
Observability: Prometheus + Grafana + Loki

The Appliance Cluster (Talos/K8s-based)

Focuses on HA, security, and immutability.

Orchestrator: Kubernetes on Talos
Ingress: NGINX Ingress (OAuth2-proxy)
LB (L4): Kube-vip (Layer 2)
Identity: Authentik or Keycloak
Storage: Rook-Ceph
Backups: Velero (CSI Snapshots)
Observability: Prometheus + Grafana + Loki

The Flexible Scheduler (Nomad-based)

Focuses on simplicity and host-integrated storage.

Orchestrator: Nomad
Ingress: Traefik or Caddy
Discovery/LB: Consul + Fabio/Traefik
Identity: Authentik (Forward-auth)
Storage: ZFS (Host volumes + Replication)
Backups: Restic
Observability: Prometheus + Grafana + Loki

3) Selection Criteria & Validation

We evaluate these stacks against our Non-Functional Requirements and Policies.

Hard Gates

These are non-negotiable policy checks.

No Inbound NAT: Must support exposure via tunnels or relay Exposure Policy.
Identity-First: All exposure points must enforce IdP-backed auth Identity Policy.
Cluster Reachability: Load balancing (L4/L7) must be addressed at Day 1; “floating” workloads require a stable entry point to be usable.
Durability: Must meet RPO 1h / RTO 4h for critical data Backup Policy.

Acceptance Tests

Internal DNS: internal.service.risu.tech resolves internally and is unreachable from WAN.
VPN access: VPN client resolves internal names and can access internal ingress.
Public isolation: public ingress serves only public services, never internal.
Identity flow: auth proxy + IdP flow works end-to-end for internal and public routes.
Stateful proof: dummy stateful service gets storage, replica, backup job signal, and a restore test plan.

Non-Functional Requirements

This document details the non-functional requirements (NFRs) that govern the design, implementation, and operation of the homelab infrastructure.

Security

Secure Boundary Enforcement: Private services must be strictly isolated to prevent accidental exposure to the public internet.
Identity & Access Management: A centralized identity provider must be utilized, supporting multifactor authentication (MFA).
Secrets Governance: All credentials and sensitive data must be managed through defined storage and rotation policies.
Network Segmentation: Traffic flow between services must be restricted according to clearly defined security policies.

Connectivity & Networking

Seamless Remote Access: Remote devices must maintain an experience identical to local network connectivity via secure VPN.
Naming Consistency: A unified naming scheme (*.risu.tech) must be maintained across both public and private services using split-horizon DNS.

Availability & Reliability

High Availability (HA): The system must remain operational across multiple nodes, ensuring service continuity and data consistency.
Workload Rescheduling: Applications must automatically relocate to healthy nodes in the event of hardware or software failure.
Data Persistence: The storage fabric must guarantee data consistency and replication across failure domains.

Data Protection

Resilient Backup: Critical data must be protected through immutable and offline copies.
Disaster Recovery: Restoration procedures must meet defined Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO).
Restore Verification: Backup integrity must be regularly validated through systematic restore testing.

Usability

Low-Friction UX: The infrastructure must provide an intuitive and accessible experience for non-technical users.
Single Sign-On (SSO): Authentication must be streamlined to minimize login prompts through a unified session.

Maintainability

Advanced Observability: Centralized logging and metrics must be implemented to facilitate rapid troubleshooting and performance analysis.
Reproducibility: The entire infrastructure configuration must be defined within a central source-of-truth repository.
Documentation: Maintenance tasks must be supported by clear, actionable runbooks.
Automated Documentation Delivery: The source of truth for documentation must be automatically built and deployed to ensure accessibility and consistency.

Pipelines

Pipelines are found in the .forgejo/workflows/ directory in the source code repository, utilizing Forgejo Actions.

docs_deploy: Build mdBook and deploy static HTML to the documentation server via rsync/SSH.

Service Catalog

This catalog contains the citizens of the platform. Each service is defined by a contract that specifies its requirements and how it fits into the platform’s architecture.

Lifecycle

Each service must satisfy the platform rules defined in Architecture Overview before it is shipped.

Service Contract:

Ownership

Owner:
Steward:

Purpose

What problem does it solve for the family/me?

Exposure

Category: Public | Internal | VPN-only | Management
Ingress: Public | Internal | Management
DNS names:

Identity

AuthN: SSO required | SSO + MFA required | local accounts (justify)
AuthZ: IdP group(s) required
Break-glass account: yes/no (location)

Data

Persistence: ephemeral | persistent
Data class: disposable | standard | critical
Estimated storage growth:

Network

Allowed source networks: Internet | LAN | VPN | Admin LAN
Egress requirements:

Availability

HA required: yes/no
Acceptable downtime:

Backup

Tier: none | standard | critical
Restore test cadence:

Dependencies

Needs database:
Needs object storage:
Needs SMTP:
Other:

Observability

Metrics:
Logs:
Alerts:

Change Control

Deployment method:
Rollback plan:

Notes / Risks

What could go wrong?

Service Contract: OpenWRT Bootstrap Resolver

Purpose

Authoritative DHCP/DNS front door for LAN/VPN clients; performs public recursion and conditionally forwards internal zones to Technitium while holding static overrides for recovery.

Exposure

Category: Internal | VPN-only
Ingress: Management
DNS names: distributed via DHCP; management UI reachable via static IP

Identity

AuthN: Local admin accounts
AuthZ: Admin account required for configuration changes
Break-glass account: Yes (documented in password vault)

Data

Persistence: Persistent (config backups required)
Data class: Standard
Estimated storage growth: Negligible

Network

Allowed source networks: LAN, VPN
Egress requirements: Public DNS upstreams; Internet for firmware updates

Availability

HA required: No (Phase 1 single resolver)
Acceptable downtime: Short maintenance windows; restores must be priority

Backup

Tier: Standard (export config before/after major changes)
Restore test cadence: After firmware updates or quarterly

Dependencies

Needs database: No
Needs object storage: No
Needs SMTP: No
Other: Stable upstream DNS IPs

Observability

Metrics: DNS query/error counters (if available)
Logs: DNS and DHCP logs
Alerts: Loss of upstream resolution; DHCP pool exhaustion

Change Control

Deployment method: OpenWRT config/UI + git-backed config export
Rollback plan: Restore last known-good config backup

Notes / Risks

Phase 1 single point of failure for DNS; keep static overrides for Technitium and ingress VIP to enable recovery.

Service Contract: Technitium DNS

Purpose

Authoritative DNS for internal service names, serving LAN/VPN clients and Kubernetes-ingress endpoints; optional recursion or forwarding to OpenWRT.

Exposure

Category: Internal | VPN-only
Ingress: Internal
DNS names: dns.risu.tech (internal-only)

Identity

AuthN: Local admin accounts
AuthZ: Admin role required for zone changes
Break-glass account: Yes (stored in password vault)

Data

Persistence: Persistent (zones/config)
Data class: Standard
Estimated storage growth: Minimal

Network

Allowed source networks: LAN, VPN, cluster nodes
Egress requirements: Upstream DNS IPs (public or OpenWRT)

Availability

HA required: High (for internal service resolution) but not required for platform bootstrap
Acceptable downtime: Minutes; recovery path via OpenWRT static overrides

Backup

Tier: Standard (regular export of zones/config)
Restore test cadence: After major upgrades or quarterly

Dependencies

Needs database: No (embedded)
Needs object storage: No
Needs SMTP: No
Other: Stable Service IP/VIP; upstream DNS reachable by IP

Observability

Metrics: Query rate, NXDOMAIN/servfail counts
Logs: Query/zone change logs
Alerts: Service availability; zone integrity errors

Change Control

Deployment method: Kubernetes (Talos) workload
Rollback plan: Redeploy previous version and restore last config backup

Notes / Risks

Must avoid DNS self-dependency: configure all upstreams and ExternalDNS endpoints by IP; keep WAN exposure disabled.

Service Contract: ExternalDNS

Purpose

Automate internal DNS records by reconciling annotated Kubernetes resources into Technitium with clear ownership boundaries.

Exposure

Category: Internal (cluster-only)
Ingress: Internal
DNS names: None (API-driven)

Identity

AuthN: Kubernetes service account
AuthZ: ClusterRole scoped to read ingress/service resources
Break-glass account: Not applicable

Data

Persistence: Ephemeral
Data class: Standard
Estimated storage growth: None

Network

Allowed source networks: Cluster nodes
Egress requirements: Technitium Service IP/VIP; Kubernetes API

Availability

HA required: No (automation only)
Acceptable downtime: Hours; existing records continue to resolve

Backup

Tier: None (state is declarative via Kubernetes + Technitium registry)
Restore test cadence: Not required

Dependencies

Needs database: No
Needs object storage: No
Needs SMTP: No
Other: Stable Technitium IP/VIP; domain filters/ownership registry configured

Observability

Metrics: Reconciliation success/fail counts
Logs: Controller logs for record changes
Alerts: Persistent reconciliation failures

Change Control

Deployment method: Kubernetes deployment/helm/manifest
Rollback plan: Revert deployment manifest/helm release

Notes / Risks

Restrict domain filters and ownership to internal hostnames to avoid accidental public zone changes.

Runbooks

Operational runbooks for the homelab platform. Each runbook is designed to be copy-paste friendly and scoped to a single failure or procedure.

Catalog

DNS Bootstrap & Recovery

Runbook: DNS Bootstrap & Recovery (OpenWRT + Technitium + ExternalDNS)

Purpose

Bring up or restore internal DNS while avoiding dependency loops. Applies to split-horizon risu.tech with OpenWRT as bootstrap resolver, Technitium as internal authority, and ExternalDNS for automation.

Preconditions

OpenWRT reachable with admin access.
Reserved stable IPs/VIPs for Technitium and internal ingress.
Access to Kubernetes cluster (Talos) for Technitium/ExternalDNS deployments.

Bootstrap Steps (greenfield or re-seed)

OpenWRT
- Ensure DHCP is enabled and advertises itself as DNS.
- Verify public recursion works using upstream DNS IPs.
Static overrides on OpenWRT
- Add host overrides:
  - dns.risu.tech → Technitium IP/VIP
  - ingress-internal.risu.tech → internal ingress VIP (optional but recommended)
Deploy Technitium
- Deploy to the cluster with a stable Service IP/VIP.
- Configure upstream resolvers by IP (public) or forward recursion to OpenWRT by IP.
- Keep WAN exposure disabled.
Conditional forward on OpenWRT
- Add forward rule: risu.tech → Technitium IP/VIP.
Deploy ExternalDNS
- Scope with domain filters/ownership registry to internal hostnames only.
- Set provider endpoint to the Technitium IP/VIP (not hostname).

Recovery: Technitium Down

From a LAN/VPN client, confirm public DNS still works via OpenWRT.
Use OpenWRT static overrides to reach the cluster ingress/UI.
Restart Technitium workload; restore config/zones if needed.
Validate conditional forwarding resumes and internal names resolve.

Recovery: ExternalDNS Down

Confirm Technitium answers existing records.
Restart ExternalDNS deployment; check logs for reconciliation success.

Recovery: OpenWRT DNS Down

Clients lose DNS; bring OpenWRT back first (single resolver in Phase 1).
Verify DHCP/DNS service restores; re-check conditional forward to Technitium.

Verification & Tests

Power-cycle the cluster with OpenWRT up: public DNS must still resolve.
Start cluster with Technitium intentionally delayed: control plane reachable via overrides.
Kill Technitium: public DNS works; internal names fail (expected).
Kill ExternalDNS: existing internal names resolve; no new records created.
WAN test: internal-only names do not resolve from cellular; LAN/VPN resolve to internal VIPs.

Notes

Keep all DNS dependencies by IP to avoid “DNS needs DNS.”
Once resolver redundancy exists, you may move clients to Technitium directly; update this runbook accordingly.

Architecture Decision Records

This directory contains a historical log of significant architectural decisions made throughout the evolution of the homelab project. Each record details the context, decision, and resulting consequences to provide transparency and rationale for the system’s design.

Records Index

ADR 0001: Use Codeberg as Public Git Host
ADR 0002: Record Architecture Decisions
ADR 0003: Split-Horizon DNS for Unified Naming
ADR 0004: Documentation Delivery System
ADR 0005: No Inbound NAT for Internal Services
ADR 0006: Identity-First Ingress for Service Access
ADR 0007: Kubernetes with TalosOS
ADR 0008: Adopt Authentik as Central Identity Provider
ADR 0009: Eliminate Dual DHCP and Establish a Single Boundary
ADR 0010: Prefer Perimeter Firewall with Dual Ingress for Exposure
ADR 0011: ExternalDNS + Technitium for Internal DNS Automation

ADR 0001: Use Codeberg as Public Git Host

Status

Accepted

Context

The homelab project requires a public git repository to host its architecture documentation, infrastructure-as-code (IaC), and potentially public-facing service configurations. This host serves as the “public face” of the project and must align with the project’s values regarding open source, privacy, and community-driven infrastructure.

While a self-hosted instance (e.g., Forgejo/Gitea) will be used for internal management and private code, a reliable public host is needed for:

Public visibility and collaboration.
External CI/CD triggers (e.g., for documentation deployment).
Mirroring and redundancy for critical configurations.

Decision

We will use Codeberg as the primary public git host for the homelab project.

Codeberg is chosen because:

It is based on Forgejo (a community fork of Gitea), which aligns with our internal management plane preferences.
It is a non-profit, community-driven platform that prioritizes privacy and freedom.
It provides a reliable, high-performance environment for hosting public repositories without the commercial baggage of larger platforms.

Consequences

The homelab repository (and associated subprojects) will be maintained on Codeberg.
Automation for documentation deployment (mdBook) will be integrated with Codeberg’s CI/CD (Woodpecker or Forgejo Actions) or triggered by Codeberg webhooks.
Public contributions and issues will be managed via the Codeberg interface.
Secret management must be strictly enforced to ensure no private credentials are leaked to the public Codeberg repositories.

ADR 0002: Record Architecture Decisions

Status

Accepted

Context

A formal mechanism is required to document architectural decisions made during the development and evolution of the homelab project. This ensures long-term consistency, provides critical context for future modifications, and facilitates knowledge transfer.

Decision

The project will utilize Architecture Decision Records (ADRs) to document significant architectural choices. These records will be maintained within the doc/src/adr/ directory, following a sequential numbering scheme.

Consequences

Enhanced Transparency: Provides clear visibility into the reasoning behind key architectural choices.
Historical Context: Establishes a permanent record of the system’s evolution.
Sustainable Maintenance: Facilitates easier onboarding and long-term system maintenance by preserving intent.

ADR 0003: Split-Horizon DNS for Unified Naming

Status

Accepted

Context

The project requires a unified naming scheme (*.risu.tech) that functions seamlessly across both public and private services. Key requirements include maintaining strict isolation for private services and providing a frictionless remote access experience that mirrors local network connectivity.

Decision

We will implement a split-horizon DNS architecture:

Public DNS Authority: Resolves records exclusively for public-facing endpoints.
Private DNS Authority: Resolves records for internal services and serves as the primary authority for LAN and VPN clients.
Context-Aware Routing: Ingress controllers will enforce hostname-based routing determined by the traffic’s origin (public vs. private).

Consequences

Unified User Experience: Users utilize consistent service names regardless of their physical or network location.
Enhanced Security Profile: Internal service names and metadata are not exposed to public DNS.
Operational Complexity: Requires the management and synchronization of two distinct sets of DNS records.

ADR 0004: Documentation Delivery System

Status

Delayed (Time constraints on runners prevent cargo from compiling for dependencies–needs a polite workaround)

Context

Infrastructure documentation must be easily accessible to all authorized users and updated automatically to reflect the current state of the repository. The documentation is authored in Markdown and managed by mdBook. We need a robust pipeline to build and deliver this documentation to a private (internal server) destination.

Decision

We will implement an automated documentation delivery system with the following components:

Source of Truth: The homelab repository on Codeberg.
Build Engine: Forgejo Actions (using Forgejo Runners), triggered on pushes to the main branch (specifically for changes within the doc/ directory) or via manual trigger (workflow_dispatch).
Single-Target Delivery:
- Private: Automated deployment to an internal server at /var/www/doc via SSH/rsync for local access.
Security: SSH-based deployment will use a dedicated, restricted user and an SSH key stored as a secret in the CI environment.
Serving: Nginx will be used to serve the static HTML output on the internal server.

Consequences

Automated Consistency: Documentation is guaranteed to be up-to-date with the repository’s main branch.
Reduced Complexity: Focusing on a single, internal delivery target simplifies the pipeline and avoids dependency on external “best-effort” services.
Standardized Process: Leverages Forgejo Actions, providing compatibility with GitHub Actions-style workflows and existing Runner infrastructure.
Secret Management: Requires careful handling of SSH keys within the CI platform.

ADR 0005: No Inbound NAT for Internal Services

Status

Accepted

Context

The platform hosts both public and internal services. Internal services must never be internet-routable to preserve a strong trust boundary. The architecture already assumes split-horizon DNS and internal ingress controls, but the routing posture must be explicit and enforceable.

Decision

There will be no inbound NAT or port-forwarding from the internet to internal service IPs. All internal services are reachable only from LAN or VPN networks through the internal ingress.

Consequences

Internet-originated traffic can never reach internal services directly.
Public exposure is limited to explicitly designated public services via the public ingress.
Network policies and firewall rules must reflect the absence of inbound NAT.

ADR 0006: Identity-First Ingress for Service Access

Status

Accepted

Context

The platform exposes services to multiple audiences (public, internal, VPN-only, management). To enforce consistent access control and auditing, authentication should be centralized and uniform rather than implemented independently by each service.

Decision

All services must be fronted by an ingress layer that enforces identity at the platform level. Services must integrate with the platform Identity Provider via SSO (OIDC/SAML) or trusted auth proxy headers, with MFA required for public and management access.

Consequences

Services must not expose unauthenticated endpoints unless explicitly approved in a Service Contract.
The ingress layer becomes a critical security control that must be monitored and hardened.
Service onboarding requires identity integration as a first-class step.

ADR 0007: Kubernetes with TalosOS

Status

Accepted

Context

The homelab platform targets a multi-node server environment with room for future capability expansion (for example, optional non-default plugins). K3s was considered, but its optimization for edge/IoT and bundled defaults are less aligned with the desired flexibility. Nomad was also evaluated for its simplicity and support for both containerized and non-containerized workloads. In this environment, infrastructure-as-code and an immutable OS reduce Nomad’s operational advantages, and non-containerized workloads are unlikely.

Decision

Adopt a full Kubernetes stack running on TalosOS as the base orchestration platform.

Consequences

Ecosystem Flexibility: Kubernetes provides a broad ecosystem, extension points, and standard service discovery and load-balancing patterns.
Operational Model: TalosOS delivers an immutable, API-managed Kubernetes host OS and supports extensions and secure networking (for example, KubeSpan).
Complexity Trade-off: Operational complexity is higher than Nomad in isolation, but is mitigated by IaC and TalosOS automation.
Workload Standardization: Workloads will be standardized on containers unless a future ADR explicitly permits exceptions.

ADR 0008: Adopt Authentik as Central Identity Provider

Status

Accepted

Context

The platform needs a centralized identity and access solution that:

Supports SSO and MFA.
Protects both modern apps (OIDC/SAML) and legacy apps without federation support.
Integrates cleanly with the Edge/Boundary reverse proxy and internal DNS.
Is reproducible and manageable as code in a self-hosted environment.

Candidates included Authentik, Authelia, Zitadel, and Keycloak. The key differentiator is robust proxy-based enforcement combined with standards-based federation in a single system.

Decision

Adopt Authentik as the platform’s central IdP and access control system:

Use OIDC/SAML for apps that natively support federation.
Use Authentik proxy/outposts to protect web apps without OIDC/SAML.
Enforce MFA via Authentik policies/flows, with step-up where appropriate.

Consequences

Centralized Access: Consistent login/MFA experience across nearly all services.
Coverage for Legacy Apps: Proxy enforcement reduces per-app auth workarounds.
Critical Dependency: Authentik downtime can block access to protected services; monitoring and break-glass access are required.
Operational Discipline: Flows, policies, and outposts require configuration-as-code to avoid drift.
Container Standardization: Authentik becomes a core platform service and must meet backup/restore and upgrade standards.

Alternatives Considered

Keycloak + oauth2-proxy: Mature IdP, but requires additional gateway components.
Authelia: Strong proxy gate, weaker as a full IdP with rich flows.
Zitadel: Modern OIDC UX, proxy protection is not a core feature.

ADR 0009: Eliminate Dual DHCP and Establish a Single Boundary

Status

Accepted

Context

The network previously had both the ISP gateway and OpenWRT serving DHCP on the same subnet. This created an ambiguous boundary and undermined consistent policy enforcement at the edge.

Decision Drivers

Avoid non-deterministic gateway assignment and client routing.
Ensure consistent DNS behavior to support split-horizon.
Prepare for future HA/VIP routing patterns without conflicting DHCP sources.
Maintain a clear, singular security boundary for policy enforcement.

Decision

Place the ISP router/modem into bridge mode.
Make OpenWRT the sole DHCP and NAT authority for the subnet.
Keep IPMI disconnected by default due to port exhaustion and power constraints; connect only when needed.

Consequences

Single Boundary: A single NAT/DHCP boundary improves policy enforcement and troubleshooting.
Predictable Clients: Gateway and DNS assignment become deterministic.
Future Migration: Simplifies future migration to a dedicated firewall or HA topology.
Operational Trade-off: IPMI access is on-demand rather than always available.

ADR 0010: Prefer Perimeter Firewall with Dual Ingress for Exposure

Status

Accepted

Context

Three exposure stacks were evaluated:

Model A — Perimeter firewall (OpenWRT now, upgradable later) owns routing/NAT; Kubernetes hosts two ingress controllers (internal-only and public).
Model B — Kubernetes-native edge using Gateway API with CNI-integrated data plane (e.g., Cilium) to terminate edge traffic directly on the cluster.
Model C — Cloud tunnel/overlay (e.g., Cloudflare Tunnel, Tailscale Funnel) to expose services without direct inbound paths.

The homelab prioritizes a clear internal/public boundary, minimal external dependencies, and the ability to swap in a dedicated firewall when hardware/power constraints ease. Existing OpenWRT already acts as the single boundary (see ADR 0009), and split-horizon DNS is assumed (ADR 0003). Identity-first ingress is required for user-facing access (ADR 0006).

Decision Drivers

Preserve a single, enforceable perimeter where north-south policy and logging live.
Keep internal ingress paths isolated from public ingress while supporting split-horizon DNS.
Allow future replacement of OpenWRT with a dedicated firewall without re-architecting cluster ingress.
Avoid new external dependencies for routine access; tolerate them only as scoped exceptions.
Fit power/port constraints and current hardware while enabling later VLAN/DMZ phases.

Considered Options

Model A — Perimeter Firewall + Dual Ingress

Pros: Clear boundary; firewall enforces 80/443 exposure; ingress controllers stay inside the cluster; works with current OpenWRT and future firewall/DMZ; keeps routing off the control plane.
Cons: Requires hairpin/port-forward rules and VIP management; firewall must forward to cluster nodes.

Model B — Kubernetes-Native Edge (Gateway API + CNI data plane)

Pros: Uniform policy definition inside K8s; fewer port-forwards; rich L7 features.
Cons: Pushes the trust boundary into the cluster; cluster health becomes prerequisite for edge routing; complicates future dedicated firewall insertion; higher operational complexity today.

Model C — Cloud Tunnel / Overlay Exposure

Pros: Quick public exposure; hides home IP; minimal edge config.
Cons: Adds third-party dependency and opaque failure modes; blurs boundary and bypasses local policy/logging; harder to reason about internal vs. public reachability.

Decision

Adopt Model A (Perimeter firewall + dual ingress):

Keep routing/NAT/policy on the perimeter firewall (OpenWRT now; replaceable with a dedicated firewall later) and continue to expose only the minimal ports (80/443) required for public ingress.
Run two ingress controllers in the cluster:
- Internal Ingress: LAN/VPN-only, resolves via split-horizon DNS to an internal VIP.
- Public Ingress: Receives only firewall-forwarded 80/443 traffic to a public VIP; backs the small set of intentionally exposed hostnames.
Use identity-first auth at ingress per ADR 0006; no generic port-forwarding to services.
Allow cloud tunnels only as scoped, documented exceptions (e.g., break-glass outbound-only tunnels) with explicit change control.

Consequences

Boundary Clarity: North-south enforcement, logging, and DDoS controls stay at the perimeter; internal ingress remains shielded from the internet.
Upgrade Path: A future dedicated firewall or DMZ VLAN can replace OpenWRT without reworking cluster ingress (aligns with the Network Evolution Plan).
Operational Simplicity: Fewer moving parts at the edge; ingress lifecycle stays inside Kubernetes, where certificates and auth already live.
Constraints-Friendly: Works within current power/port limits; no requirement to run edge data plane on K8s nodes.
Risk: Firewall misconfiguration could still overexpose services; requires to be disciplined VIP/reservation management and monitoring of port-forwards.

Implementation Notes / Next Steps

Reserve VIPs for internal/public ingress in the SERVER/DMZ ranges defined in the Network Evolution Plan.
Maintain firewall rules: 80/443 to public ingress VIP only; no generic NAT for internal services (per ADR 0005).
Keep split-horizon DNS records aligned with the two ingress VIPs.
Document any exception tunnels with owners, scope, and teardown criteria.

ADR 0003: Split-Horizon DNS for Unified Naming
ADR 0005: No Inbound NAT for Internal Services
ADR 0006: Identity-First Ingress for Service Access
ADR 0009: Eliminate Dual DHCP and Establish a Single Boundary

ADR 0011: ExternalDNS + Technitium for Internal DNS Automation

Status

Accepted

Context

Internal DNS needs to provide LAN/VPN-only resolution for service hostnames while remaining automatable from Kubernetes. The solution must avoid bootstrap dependency loops (DNS needing DNS) and keep public DNS management separate from internal records.

Decision

Adopt Technitium as the internal authoritative DNS service and use ExternalDNS to reconcile annotated Kubernetes resources into Technitium. Keep OpenWRT as the client-facing bootstrap resolver, providing public recursion and conditional forwarding to Technitium with minimal static overrides for recovery.

Consequences

Enables automated, authoritative internal DNS with clear ownership boundaries.
Avoids DNS dependency loops by using IP-based upstreams and keeping clients pointed at OpenWRT.
Increases operational complexity compared to static DNS; requires guardrails for split-horizon risu.tech and tight scoping of ExternalDNS domain filters.

Keyboard shortcuts

Homelab Project