Architecture Overview
Purpose
This document serves as the canonical source of truth for the Home Lab platform. It defines what exists and why, establishing the fundamental principles that guide all architectural decisions.
Definitions
Public Platform
The set of services and infrastructure components explicitly designed to be reachable from the public internet. This platform resides behind the Public Ingress and is subject to strict exposure policies.
Internal Platform
The core of the home lab, consisting of services reachable only from the local network (LAN) or via an authorized VPN connection. Access is defined by network presence and identity.
Trust Boundaries
Clear lines of demarcation between different security zones (Internet, LAN, VPN, Management). Every interaction across a boundary must be explicitly allowed and authenticated.
Platform Roles
The platform is composed of several stable roles that provide foundational services:
- Connectivity: DNS, Ingress, and VPN.
- Identity: Centralized authentication and SSO.
- Persistence: Replicated storage and backups.
- Operations: Monitoring, alerting, and change management.
Invariants
These rules are absolute and must not be violated by any implementation:
- Internal Isolation: Internal services are never internet-routable. No direct NAT or port-forwarding to internal services is permitted.
- Identity First: No service shall be exposed without an identity-aware proxy or native SSO integration unless explicitly justified in a Service Contract.
- Source of Truth: The Git repository is the sole authority for the state of the platform. Manual “hot-fixes” are technical debt that must be codified immediately.
- Data Durability: Critical data must exist in at least two physical locations at all times.
Non-Goals
- Real-time global availability (HA is local/cluster-based, not geo-distributed).
- Public hosting of third-party data.
- Replacement of enterprise-grade cloud services for high-risk workloads.
System Context
Map View
The following diagram provides a high-level orientation of the actors and systems involved in the Home Lab ecosystem.
flowchart TD
subgraph Users [Users]
Family["Family User"]
Owner["Admin"]
Public["Public Visitor"]
end
PublicPlane["Public Platform (behind Public Ingress)"]
InternalPlane["Internal Platform (LAN/VPN only)"]
subgraph Control ["Change Automation"]
Automation["CI/CD + IaC Pipelines"]
end
subgraph External ["External Dependencies"]
DNS["Cloud DNS"]
Internet["The Internet"]
end
Family -- "HTTPS / LAN / VPN" --> InternalPlane
Owner -- "SSH / Git / HTTPS" --> InternalPlane
Owner -- "Git / CI/CD" --> Automation
Public -- "HTTPS" --> PublicPlane
Automation -- "Deploys / Config" --> InternalPlane
Automation -- "Deploys / Config" --> PublicPlane
Automation -- "DNS record management (automation)" --> DNS
PublicPlane -- "Traffic" --> Internet
InternalPlane -- "Traffic" --> Internet
Actors & Systems
| Entity | Role | Description |
|---|---|---|
| Family User | Internal User | Accesses personal services (Wiki, Photos, Chat) from within the LAN or via VPN. |
| Admin | Infrastructure Owner | Manages the platform, security, and service configurations via SSH, Git, and HTTPS. |
| Public Visitor | External User | Accesses public-facing content and websites hosted on the platform. |
| Public Platform | Public Plane | Internet-facing services reachable through the public ingress. |
| Internal Platform | Internal Plane | Core services and management endpoints reachable only from LAN or VPN. |
| Change Automation | Control Plane | CI/CD and IaC pipelines that apply platform changes and manage DNS records. |
| Cloud DNS | External System | Managed DNS provider (risu.tech) updated by automation for split-horizon or public resolution. |
| The Internet | Network | Public network through which external visitors arrive and internal resources are reached. |
Network Model v1 (Power-Constrained Phase)
Purpose
Document the as-built network state, the rationale behind it, and the intended evolution path. This is the baseline substrate for ingress, naming, and service exposure decisions.
As-Built Topology
Physical Topology
Internet
|
ISP Modem (Bridge Mode)
|
OpenWRT Router (Single NAT / DHCP / DNS)
|
LAN Clients + Server Nodes
(IPMI connected on-demand only)
Logical Roles
| Role | Device/Service |
|---|---|
| Edge NAT | OpenWRT |
| DHCP Authority | OpenWRT |
| DNS | OpenWRT (AdGuard) |
| VPN Client Egress | OpenWRT (WireGuard -> iVPN) |
| ISP Modem | Bridge mode only (no routing) |
IP Plan (Current)
- Single flat LAN (one subnet).
- DHCP and DNS are authoritative only on OpenWRT.
- Specific CIDR, DHCP ranges, and static reservations live in OpenWRT config.
Trade-offs (Intentional)
- No VLAN segmentation yet: Deferred due to hardware and power constraints.
- No dedicated firewall: OpenWRT fulfills boundary duties for now.
- No managed switch: The network spine is temporary; port/power constraints apply.
- IPMI not always-on: Connected only when needed to conserve ports and power.
Evolution Roadmap
- Phase 1 (Current): Single NAT/DHCP/DNS, flat LAN.
- Phase 2: Add managed switch and introduce VLANs.
- Phase 3: Dedicated firewall and segmented trust zones.
Related Decisions
- ADR 0005: No Inbound NAT for Internal Services
- ADR 0009: Eliminate Dual DHCP and Establish a Single Boundary
Network Evolution Plan (VLANs and Ingress Separation)
Purpose
Define the next phases for segmentation and ingress separation so the current flat LAN can evolve without disruptive renumbering.
Phase Targets
- Keep the existing LAN (
10.0.0.0/24) stable during transition. - Introduce clear trust boundaries: Clients, Servers, Management, DMZ, IoT, Guest, Lab.
- Reserve address space and VIP ranges now to simplify later MetalLB/kube-vip usage.
- Separate public and internal ingress paths, with split-horizon DNS.
VLANs and Subnets (Proposed)
| VLAN | Name | Subnet | Purpose | Typical Residents |
|---|---|---|---|---|
| 10 | LAN | 10.0.0.0/24 | Default user network | PCs, phones, TVs |
| 20 | SERVER | 10.0.20.0/24 | App workloads, cluster nodes | Talos/K8s nodes, storage |
| 30 | MGMT | 10.0.30.0/24 | Out-of-band + admin | IPMI/BMC, switch/AP management |
| 40 | DMZ | 10.0.40.0/24 | Public-facing edge only | Public ingress VIPs / edge svc |
| 50 | IOT | 10.0.50.0/24 | Untrusted devices | Cameras, smart devices |
| 60 | GUEST | 10.0.60.0/24 | Visitor access | Guest Wi-Fi clients |
| 70 | LAB | 10.0.70.0/24 | Experiments | Test gear, ephemeral nodes |
DHCP and Gateway Plan (Examples)
Assuming router-on-a-stick (trunk to switch):
| VLAN | Gateway | DHCP Scope | Notes |
|---|---|---|---|
| 10 | 10.0.0.1 | 10.0.0.10–250 | Keep current allocations |
| 20 | 10.0.20.1 | 10.0.20.50–250 | Reserve low IPs for VIPs/statics |
| 30 | 10.0.30.1 | none or limited | Prefer static/reservations |
| 40 | 10.0.40.1 | none or limited | DMZ should be explicit |
| 50 | 10.0.50.1 | 10.0.50.50–250 | Tight egress rules |
| 60 | 10.0.60.1 | 10.0.60.50–250 | Internet only |
| 70 | 10.0.70.1 | optional | Lab isolation |
Default Inter-VLAN Policy (Allow Only What Is Needed)
- LAN (10) → Internal ingress/services (20): allow service ports.
- LAN (10) → MGMT (30): deny, except specific admin workstation or VPN admin group.
- VPN/Admin → MGMT (30): allow.
- DMZ (40) → Servers (20): allow only public ingress backends.
- IOT (50) → anywhere: deny by default, allow minimal egress if needed.
- GUEST (60) → internal: deny (internet only).
Ingress Separation Model
- Public Ingress: Internet-reachable hostnames only; prefer placement in DMZ (VLAN 40) when available.
- Internal Ingress: LAN/VPN-only hostnames; placed in SERVER (VLAN 20) or LAN (VLAN 10) during early phase.
- Start with both ingress controllers in VLAN 20 (simpler); move Public Ingress VIPs to VLAN 40 when DMZ exists.
VIP Reservations (Examples)
- Internal ingress VIPs:
10.0.20.10–10.0.20.19 - Public ingress VIPs:
10.0.40.10–10.0.40.19 - Gateways:
.1, network services:.2–.9
DNS Expectations (Split-Horizon)
- Use the unified namespace
*.risu.tech(per Exposure Policy and Split-Horizon ADR). - Internal-only names resolve to internal VIPs (e.g.,
wiki.risu.tech→10.0.20.10on LAN/VPN). - Public names resolve externally only when intentionally exposed (e.g.,
status.risu.tech). - Internal resolvers must not return public IPs for internal-only names.
Diagram (Ingress and Trust Zones)
flowchart TD
Internet((Internet)) --> WAN[WAN]
WAN --> Edge["Router/Firewall: OpenWRT now, dedicated later (policy gate)"]
subgraph VLAN10[LAN 10 - 10.0.0.0/24]
Clients[LAN Clients]
end
subgraph VLAN20[SERVER 20 - 10.0.20.0/24]
Nodes[K8s/Talos Nodes]
IntIngress[Internal Ingress VIPs]
Services[Internal Services]
end
subgraph VLAN30[MGMT 30 - 10.0.30.0/24]
IPMI[IPMI/BMC]
NetMgmt[Switch/AP Mgmt]
end
subgraph VLAN40[DMZ 40 - 10.0.40.0/24]
PubIngress[Public Ingress VIPs]
end
Edge --> VLAN10
Edge --> VLAN20
Edge --> VLAN30
Edge --> VLAN40
Clients --> IntIngress --> Services
Internet -.->|Allowed 80/443 only via firewall/NAT| PubIngress --> Services
Migration Steps (Incremental)
- Current (flat): keep everything on
10.0.0.0/24, single DHCP (done). - Add managed switch: trunk to router, keep most devices untagged on VLAN 10.
- Move servers to VLAN 20; keep clients on VLAN 10.
- Move management to VLAN 30 (static/reserved IPs).
- Add DMZ VLAN 40 for public ingress VIPs; expose only 80/443 as needed.
Related Decisions
- ADR 0005: No Inbound NAT for Internal Services
- ADR 0009: Eliminate Dual DHCP and Establish a Single Boundary
- ADR 0003: Split-Horizon DNS
Platform Roles
The Home Lab platform is built upon a set of stable, well-defined roles. These roles represent the “bones” of the infrastructure—foundational capabilities that must remain stable regardless of which specific applications are running.
Role Catalog
- Edge & Boundary: Defining where the public internet stops and the private network begins.
- Identity & Access: Providing a single source of truth for users and permissions.
- Connectivity & Naming: Ensuring services are reachable via consistent, human-readable names.
- Storage & Persistence: Guaranteeing data durability and availability across the cluster.
- Compute & Orchestration: Managing the lifecycle of containerized workloads.
- Operations: Handling observability, change management, and backups.
Mapping Roles to Implementation
Each role is defined by its responsibilities and requirements. The specific technologies used to fulfill these roles (e.g., K3s, Authelia, Traefik) may evolve, but the roles themselves remain constant.
Role: Edge & Boundary
Responsibility
The Edge & Boundary role is the first line of defense. It is responsible for terminating public traffic and enforcing the transition from untrusted networks (Internet) to trusted networks (Home Network/VPN).
Key Guarantees
- Traffic Termination: All public HTTPS traffic must terminate at the Edge.
- L7 Load Balancing: Spreading requests across multiple “floating” service instances regardless of their physical node location.
- Protocol Enforcement: Only authorized protocols (HTTPS, WireGuard) are permitted to cross the boundary.
- Isolation: Publicly reachable services must be logically isolated from the internal-only platform.
Related Models & Policies
- Trust Boundaries & Access Model
- Exposure Policy
- Network Model v1 (Power-Constrained Phase)
- Network Evolution Plan (VLANs and Ingress Separation)
Implementation Options
| Option | Best Fit | Good At | Costs / Risks | Integration Notes |
|---|---|---|---|---|
| Traefik | K3s, K8s, Nomad | Simple config, Let’s Encrypt, dynamic discovery, forward-auth. | Some features are Traefik-native; middleware sprawl. | Works with OIDC via forward-auth; requires standard headers. |
| NGINX Ingress | K8s, Talos | Very common, strong annotation ecosystem. | Auth relies on external proxies (oauth2-proxy); annotation-heavy. | Pairs well with oauth2-proxy; explicit ingress classes needed. |
| Caddy | Nomad, Small K8s | TLS automation; simple reverse proxy story. | Less “platformy” out of the box; varies by env. | Decide if identity is enforced here or at auth gateway. |
Typical Stack Pairings
- K3s: Traefik (native feel)
- Talos/K8s: NGINX Ingress (most common)
- Nomad: Traefik or Caddy
Role: Identity & Access
Responsibility
The Identity role provides the “Who” for the entire platform. it manages user identities, credentials, and group memberships, and provides a unified authentication experience (SSO).
Key Guarantees
- Centralized Truth: One directory for all human users.
- MFA Enforcement: Critical services must require multifactor authentication.
- SSO: Users should only need to authenticate once to access multiple platform services.
Related Models & Policies
Implementation Options
| Option | Best Fit | Good At | Costs / Risks | Integration Notes |
|---|---|---|---|---|
| Authentik | All stacks | Flexible flows, “one IdP for everything”. | Operating an IdP (backup, upgrades, DB). | Choose enforcement at ingress: forward-auth, oauth2-proxy, or mesh-based. |
| Keycloak | K8s, Talos | Enterprise-grade, standard OIDC/SAML, great docs. | Heavy; tuning/upgrade complexity. | Pairs well with oauth2-proxy and standard OIDC clients. |
| Authelia | K3s, Nomad | Light-weight auth portal, simple 2FA, forward-auth. | Less of a “platform” than Authentik/Keycloak. | If OIDC is needed for apps, a full IdP might still be required. |
Typical Stack Pairings
- Traefik: Authentik + forward-auth (or Authelia)
- NGINX Ingress: Authentik/Keycloak + oauth2-proxy
- Any: IdP + apps using OIDC directly (for “native SSO” apps)
Role: Connectivity & Naming
Responsibility
This role ensures that users and services can find each other. It handles DNS resolution and internal routing, maintaining a consistent namespace across local and remote connections.
Key Guarantees
- Unified Namespace: Use of
*.risu.techglobally. - Split-Horizon DNS: Internal names resolve to internal IPs; external names point to the Edge.
- Service Discovery: Automatic detection and registration of “floating” workloads.
- L4 Load Balancing (VIP): Providing stable virtual IPs for cluster-wide services (like Ingress) to ensure they are reachable even if nodes fail.
Current Stack Choice
- OpenWRT as the bootstrap resolver, Technitium DNS as the internal authority, and ExternalDNS for Kubernetes-driven automation. Details and runbooks live in Connectivity & Naming Stack.
Related Models & Policies
Implementation Options
| Option | Best Fit | Good At | Costs / Risks | Integration Notes |
|---|---|---|---|---|
| CoreDNS + ExternalDNS | K8s, K3s, Talos | K8s-native, clean in-cluster discovery. | Split-horizon needs careful design. | Decide “source of truth” (Git/IaC) and VPN DNS view. |
| Pi-hole / AdGuard Home | Any (External) | Easy local DNS + blocking; great for split-horizon. | Another stateful service; HA takes effort. | Ensure VPN hands out this DNS; avoid public leaks. |
| WireGuard / Tailscale | Any | Stable remote access. | Tailscale is managed-ish; WireGuard is DIY. | DNS distribution over VPN is the key integration point. |
| MetalLB / Kube-vip | K8s, K3s, Talos | Provides L4 LoadBalancer IPs on bare-metal. | Requires network support (ARP/BGP); configuration overhead. | Essential for giving the Ingress Controller a stable IP in a cluster. |
| Consul | Nomad | First-class in Nomad ecosystems. | Adds a control plane component. | Decide how Consul names map to your DNS naming scheme. |
Typical Stack Pairings
- K8s: MetalLB/Kube-vip + CoreDNS + ExternalDNS + WireGuard/Tailscale
- Nomad: Consul (+ DNS integration) + Traefik/Fabio + WireGuard/Tailscale
- Hybrid: Pi-hole/AdGuard as “front” DNS for LAN/VPN regardless of orchestrator
Connectivity & Naming Stack (OpenWRT + Technitium + ExternalDNS)
Scope
Defines how names under risu.tech are resolved for LAN, VPN, and Kubernetes workloads, and how DNS automation and failure modes behave.
Goals
- Internal names (e.g.,
wiki.risu.tech) resolve only on LAN/VPN. - Public names resolve from anywhere without exposing internal metadata.
- Internal DNS is authoritative and automated from Kubernetes via ExternalDNS.
- DNS outages degrade safely: public domains keep resolving and the platform can bootstrap without internal DNS.
Non-Goals
- Multi-site or geo-distributed DNS.
- Automating the public zone in this phase.
- Making the router a permanent authoritative DNS platform.
Roles and Responsibilities
- OpenWRT (bootstrap resolver): DHCP authority, default resolver for clients, recursion to public upstreams, conditional forward of internal zones to Technitium, local static overrides for recovery.
- Technitium DNS (internal authority): Hosts authoritative internal records and optional recursion; reachable only from LAN/VPN; uses IP-based upstream configuration.
- ExternalDNS (automation controller): Watches Kubernetes resources and reconciles allowed records into Technitium; limited to explicitly delegated hostnames.
Resolution Flows
- Internal name (normal): Client → OpenWRT → Technitium → internal VIP/endpoint.
- Public name: Client → OpenWRT → public recursive resolution.
Dependency-Loop Prevention
- Principle: nothing required to bootstrap the platform should depend on Technitium.
- Invariants:
- Clients always use OpenWRT as resolver in Phase 1.
- OpenWRT keeps minimal static records (Technitium VIP and internal ingress VIP) to reach recovery paths.
- Technitium upstreams are configured by IP or forward recursion to OpenWRT by IP.
- ExternalDNS targets Technitium by stable IP/VIP, not hostname.
Failure Behavior
- Technitium down: Internal names fail except the static overrides; public names still resolve via OpenWRT.
- ExternalDNS down: Existing records served; no new automation until it returns.
- OpenWRT DNS down: Clients lose DNS (Phase 1 SPOF); acceptable until resolver redundancy is added.
Zone Strategy
- Preferred: split-horizon
risu.tech(same zone name internal and public). - Safety controls: Technitium not reachable from WAN; ExternalDNS constrained via annotation/label allowlists, TXT ownership, and domain filters; public DNS managed separately.
- Alternative: internal sub-zone such as
int.risu.techif split-horizon proves risky.
Record Ownership
- ExternalDNS-managed: Annotated Kubernetes services/ingresses that are allowed for automation.
- Manually managed: Bootstrap overrides on OpenWRT, core infrastructure names, sensitive records.
Kubernetes Integration
- CoreDNS handles in-cluster service discovery and does not depend on Technitium.
- ExternalDNS maintains registry markers to avoid overwriting manual records.
- Prefer VIP/stable IP for Technitium reachable from OpenWRT and workloads.
Testing
- Power-cycle the cluster with OpenWRT up: public DNS must still resolve.
- Bring up cluster with Technitium delayed: control-plane access must work via OpenWRT overrides.
- Kill Technitium: public DNS works; internal names fail as expected.
- Kill ExternalDNS: existing internal names still resolve.
- WAN test: internal-only names do not resolve from cellular; LAN/VPN resolve to internal VIPs.
Implementation Notes
- Keep OpenWRT as the client-facing resolver in Phase 1; migrate clients later only after resolver redundancy exists.
- Favor IP-based configuration for anything that talks to DNS to avoid “DNS requires DNS” loops.
- Use stable VIPs where possible so OpenWRT, Technitium, and ExternalDNS share a consistent target.
Related Documents
- Role: Connectivity & Naming
- ADR 0011: ExternalDNS + Technitium for Internal DNS Automation
- Service contracts: OpenWRT, Technitium DNS, ExternalDNS
- Runbooks: DNS Bootstrap & Recovery
Role: Storage & Persistence
Responsibility
The Storage role manages the state of the platform. It provides persistent volumes to applications and ensures that data is replicated and backed up according to its criticality.
Key Guarantees
- Data Durability: Protection against single-node or single-disk failure.
- RPO/RTO Compliance: Backups must be performed and verified according to policy.
- Abstraction: Applications should request storage via standard interfaces (e.g., PVCs) without knowing the underlying disk layout.
Related Models & Policies
Implementation Options
| Option | Best Fit | Good At | Costs / Risks | Integration Notes |
|---|---|---|---|---|
| Longhorn | K3s, K8s | Easy replicated block storage. | Performance/latency tradeoffs. | Tie into snapshot/backup story (Velero). |
| Rook-Ceph | K8s, Talos | Powerful HA, block/file/object storage. | Complexity; resource hungry; learning curve. | Needs disciplined node disks and upgrade choreography. |
| ZFS | Nomad, K8s | Solid local storage, snapshots, replication. | Not a distributed fabric; HA is “replication + restore”. | Orchestration integration varies; great for “pet data”. |
| NFS / SMB | Any | Simple shared storage. | Central dependency; HA depends on NAS. | Backups are straightforward; locking semantics vary. |
Typical Stack Pairings
- K3s: Longhorn
- Talos/K8s: Rook-Ceph (for strong HA) or NFS (for simplicity)
- Nomad: ZFS (host-based) + replication or NFS
Role: Compute & Orchestration
Responsibility
The Compute role provides the execution environment for all platform workloads. It handles scheduling, lifecycle management (start/stop/restart), and resource isolation between tenants.
Key Guarantees
- Automated Scheduling: Workloads are placed on nodes based on resource availability and constraints.
- Self-Healing: Automatic recovery of failed workloads.
- Resource Isolation: Enforced limits on CPU, memory, and disk to prevent “noisy neighbor” effects.
Related Models & Policies
Implementation Options
| Option | Best Fit | Good At | Costs / Risks | Integration Notes |
|---|---|---|---|---|
| K3s | Pragmatic homelab | Lightweight Kubernetes; easy setup; bundled extras (Traefik). | Still Kubernetes complexity; some opinionated defaults. | Best with Traefik + Longhorn patterns. |
| Kubernetes on Talos | Enterprise-grade HA | Immutable OS; security; “OS as appliance” feel. | Steeper learning curve; non-traditional debugging. | Pairs with NGINX + Rook-Ceph + Velero. |
| Nomad | Simple/Flexible | Easy to run; handles non-container workloads; low overhead. | Smaller ecosystem than K8s; stateful patterns more DIY. | Best with Consul/Vault, ZFS host volumes, Traefik/Caddy. |
Typical Stack Pairings
- K3s: Traefik + Longhorn
- Talos/K8s: NGINX + Rook-Ceph
- Nomad: Traefik + ZFS + Consul
Role: Operations
Responsibility
The Operations role encompasses the tools and processes required to maintain, monitor, and update the platform. It ensures repeatability through GitOps and visibility through observability.
Key Guarantees
- Observability: Centralized metrics, logs, and alerting for all platform components.
- Change Management: Git (GitOps) drives all platform changes.
- Disaster Recovery: Automated backups and verified restore paths for stateful data.
Related Models & Policies
Implementation Options
| Option | Best Fit | Good At | Costs / Risks | Integration Notes |
|---|---|---|---|---|
| Prometheus / Grafana / Loki | All stacks | Standard dashboards; large community; mature. | Can sprawl; needs retention planning. | K8s has the most turnkey packaging (kube-prometheus-stack). |
| GitOps (Argo CD / Flux) | K8s, K3s | Repeatability; drift control; clear audit trail. | Higher initial platform complexity. | Very mature on K8s; Nomad has options but less standardized. |
| Velero | K8s, K3s | K8s-native backups; CSI integration for snapshots. | K8s-specific. | Best for K8s/K3s clusters. |
| Restic | Any | General purpose; deduplication; encrypted backups. | More manual configuration on K8s. | Great for Nomad/ZFS/NAS-based approaches. |
Typical Stack Pairings
- K8s/K3s: Prometheus + ArgoCD/Flux + Velero
- Nomad/Other: Prometheus + Restic
Trust Boundaries & Access Model
Trust View
This document defines the network reachability and security posture of the platform. It answers the question: From where can traffic originate and where can it go?
System Boundaries
The platform is divided into distinct zones with hard boundaries.
flowchart LR
Internet((Internet)) -->|HTTPS| PublicIngress[Public Ingress]
PolicyNote["No inbound NAT or public path to Internal Ingress"]
subgraph Home["Home Network Boundary"]
LAN[LAN Clients] --> PrivateDNS[Private DNS]
VPN[VPN Clients] --> PrivateDNS
PrivateDNS --> InternalIngress
InternalIngress --> Services[Internal Services]
PublicIngress --> PublicServices[Public Services]
end
PublicDNS[Public DNS] --> PublicIngress
Zone Definitions
The Internet (Untrusted)
Any client originating outside the home network. Only allowed to communicate with the Public Ingress via HTTPS.
The Home Network (Trusted Boundary)
A secure zone containing both LAN and VPN clients.
- LAN Clients: Physical devices connected to the home router.
- VPN Clients: Remote devices with an active, authenticated tunnel.
- Enrollment: Only verified devices are permitted to join the network.
- Experience: Remote devices experience connectivity identical to local network access (Split-Horizon DNS + Private IPs).
- Security: Encrypted communication channels are maintained for all remote traffic.
Internal Platform (Protected)
Services that are never exposed to the internet. Reachability is strictly limited to clients already inside the Home Network Boundary.
Reachability Matrix
| From \ To | Public Services | Internal Services | Management (SSH/Git) |
|---|---|---|---|
| Internet | HTTPS | ❌ Blocked | ❌ Blocked |
| LAN | HTTPS | HTTPS | Authorized Only |
| VPN | HTTPS | HTTPS | Authorized Only |
Key Security Postures
- No Inbound NAT: There are no port-forwarding rules from the internet to internal service IPs.
- Split-Horizon DNS: Service names (e.g.,
app.risu.tech) resolve to different IPs depending on if the client is on the Internet or the Home Network. - Authenticated Ingress: All internal services require identity verification at the Ingress layer.
Identity & Login Model
Responsibility View
This document explains how identity and authentication work end-to-end. It answers the question: How does a user gain access to a service?
Authentication Flow
The following diagram illustrates the branching logic for user authentication, including session persistence and MFA requirements.
flowchart TD
Start([User visits service.risu.tech]) --> Ingress[Ingress / Auth Proxy]
Ingress --> CheckAppSession{Valid Session?}
CheckAppSession -- "Yes" --> Forward[Forward to App]
CheckAppSession -- "No" --> IdP[Redirect to Identity Provider]
subgraph IdPFlow [Identity Provider]
IdP --> CheckIdPSession{IdP Session exists?}
CheckIdPSession -- "No" --> Login[User Credentials Prompt]
Login --> Validate[Validate Credentials]
Validate -- "Success" --> CheckMFA{MFA required?}
Validate -- "Failure" --> Login
CheckIdPSession -- "Yes" --> CheckMFA
CheckMFA -- "Yes" --> MFAPrompt[MFA Challenge]
MFAPrompt --> ValidateMFA[Validate MFA]
ValidateMFA -- "Success" --> Authorize[Authorize Access]
ValidateMFA -- "Failure" --> MFAPrompt
CheckMFA -- "No" --> Authorize
end
Authorize --> Grant[Redirect with Session Cookie]
Grant --> Ingress
Forward --> Service[App Content]
Service --> End([Success])
Functional Components
Auth Proxy / Ingress
The first line of defense. It intercepts requests and verifies the presence of a valid session cookie. If missing or expired, it handles the OIDC/SAML handshake with the IdP.
Identity Provider (IdP)
The source of truth for user accounts and groups. It manages credentials, MFA enrollment, and issues tokens upon successful authentication.
Application (Service)
The destination. Most applications are “auth-blind,” relying on the Auth Proxy to provide user information via headers (e.g., X-Forwarded-User).
Platform Policies
These policies define the “Hard Gates” and operational constraints that all platform implementations and services must satisfy. They ensure consistency, security, and durability across the environment.
Policy Index
- Exposure Policy: Defining how services are made reachable and protecting the boundary.
- Identity Policy: Mandating identity-first access for all platform components.
- Backup Policy: Setting requirements for data durability (RPO/RTO).
- Change Management Policy: Defining how the platform and its services are updated.
Exposure Policy
Rules
This document defines how services are exposed to users and the network requirements for each exposure category.
Exposure Categories
Public
- Definition: Services reachable from the internet.
- DNS: Must resolve to the Public IP of the platform.
- Auth: Must enforce SSO/MFA at the Ingress layer.
- TLS: Must use valid, publicly trusted certificates.
Internal
- Definition: Services reachable only from LAN or VPN.
- DNS: Must resolve to a Private IP (RFC1918).
- Auth: Must enforce SSO at the Ingress layer.
- TLS: Should use certificates (internal or public CA).
VPN-Only
- Definition: Services reachable only from VPN clients.
- DNS: Must resolve to a Private IP (RFC1918) only on VPN resolvers.
- Auth: Must enforce SSO at the Ingress layer.
- TLS: Should use certificates (internal or public CA).
Management
- Definition: Administrative endpoints (SSH, Git, control plane consoles).
- DNS: Must resolve to management-only records or private IPs.
- Auth: Must enforce MFA and privileged access controls.
- TLS: Must use certificates (internal or public CA).
Mandatory Ingress Behavior
| Category | Allowed Ingress | Allowed Source Networks | DNS Resolution |
|---|---|---|---|
| Public | Public Ingress only | Internet | Public IP |
| Internal | Internal Ingress only | LAN + VPN | Private IP |
| VPN-Only | Internal Ingress only | VPN only | Private IP (VPN resolvers only) |
| Management | Management endpoints only | Admin LAN + VPN | Private IP / management records |
Mandatory Auth Requirements
| Category | Authentication | Authorization |
|---|---|---|
| Public | SSO + MFA at Ingress | Group-based access (IdP) |
| Internal | SSO at Ingress | Group-based access (IdP) |
| VPN-Only | SSO at Ingress | Group-based access (IdP) |
| Management | MFA + privileged access | Admin-only groups, audited access |
Naming Rules
- All services MUST use the
*.risu.techdomain. - Internal service names MUST match their public counterparts (if they exist) to ensure a seamless user experience.
- The platform uses Split-Horizon DNS to ensure that
app.risu.techresolves to the correct IP based on the client’s network location.
Traffic Constraints
- Public Ingress MUST NOT route traffic to backends tagged as “Internal.”
- Internal Ingress MUST drop any traffic originating from outside the Home Network Boundary.
- No direct port-forwarding (NAT) to backend services is allowed. All traffic must pass through an Ingress controller.
Identity Policy
Rules
This document defines the rules all services and users must obey regarding identity and access.
Guarantees
- Unified Login: A single set of credentials and session is used across all platform services.
- MFA Enforcement: Multi-factor authentication is mandatory for all administrative access and any service exposed to the public internet (where supported).
- Session Isolation: Authentication is handled by the platform, not the application, ensuring a uniform security posture.
Service Requirements
All services integrated into the platform MUST:
- Delegate Auth: Rely on the platform’s Identity Provider via OIDC, SAML, or Auth Proxy headers.
- Use Group-Based Access: Authorization should be based on IdP groups (e.g.,
admins,family), not individual user accounts. - Support SSO: Be configured to allow seamless login via the platform session.
Auth Requirements by Exposure Category
| Category | Authentication | Authorization | Notes |
|---|---|---|---|
| Public | SSO + MFA enforced at ingress | IdP groups required | No anonymous access unless explicitly approved in a Service Contract. |
| Internal | SSO enforced at ingress | IdP groups required | Local accounts disallowed except break-glass. |
| VPN-Only | SSO enforced at ingress | IdP groups required | VPN enrollment required for network access. |
| Management | MFA required for all access | Admin-only groups | SSH keys or short-lived certs required for shell access. |
Management Access Rules
- Administrative endpoints MUST be reachable only from Admin LAN or VPN networks.
- SSH access MUST use keys or short-lived certificates; passwords are forbidden.
- All management access MUST be attributable to a named admin identity and logged.
Negative Constraints
- Services MUST NOT maintain their own local user databases for “standard” access.
- Local “admin” or “break-glass” accounts MUST have high-entropy, randomly generated passwords stored in a secure vault.
- Clear-text passwords MUST NEVER be stored in the Git repository.
Backup Policy
Rules
This document defines the rules for protecting data and ensuring its recoverability.
Data Tiers & RPO/RTO
| Tier | Description | RPO | RTO |
|---|---|---|---|
| Critical | Core identity, config, and family data. | 1 Hour | 4 Hours |
| Standard | Application data, media, and tools. | 24 Hours | 24 Hours |
| Disposable | Caches, logs, temporary files. | N/A | Best Effort |
Retention Rules
- Critical Data: Must be backed up daily, with weekly offsite replication. Retain for 30 days minimum.
- System Config: Must be backed up after every confirmed change (via Git).
- Offsite Copies: At least one copy of critical data must be physically separated from the primary site.
Verification Requirements
- Automated Checks: Every backup job must report its status to the Observability platform.
- Restore Drills: A manual restore test must be performed for each “Critical” service at least once every 6 months.
- Immutability: Backups should be stored in a way that prevents modification or deletion by a compromised system (e.g., append-only mode).
Change Management Policy
Rules
This document defines how changes are made to the platform to ensure stability, auditability, and reproducibility.
The Source of Truth
The platform is defined entirely in code. The Git repository is the sole source of truth for:
- Infrastructure Configuration: YAML, HCL, and scripts.
- Architecture Decisions: ADRs in Markdown.
- Technical Documentation: This book.
Change Workflow
All changes (except for emergency “break-glass” scenarios) must follow this flow:
- Draft: Propose the change in a new branch.
- Review: Peer review or self-review (for minor changes).
- Merge: Merge into the
mainbranch. - Deploy: Automated CI/CD pipelines apply the change.
Documentation Requirements
- Significant architectural shifts MUST be recorded as an ADR.
- All service deployments MUST have a corresponding entry in the Service Catalog.
- Manual configuration on nodes is strictly forbidden unless codified immediately after.
Secrets Management
- Clear-text secrets MUST NEVER be committed to Git.
- Use a dedicated secrets manager or encrypted storage (e.g., SOPS) for credentials.
- Secrets MUST be rotated if a compromise is suspected or as per the defined rotation schedule.
Data Durability Model
Responsibility View
This document defines how data is stored, replicated, and protected. It answers the question: How is data kept safe and available?
Data Pipeline
The following diagram shows the lifecycle of data from the application to offsite storage.
flowchart TB App[Stateful App] --> Request["Storage Storage Interface"] Request --> Storage[Storage Fabric] Storage --> Replicas["Replicated Copies (N>=2)"] Replicas --> Backup[Backup / Snapshot System] Backup --> Offsite[(Optional: Offsite Copy)]
Layers of Protection
Storage Fabric
The active storage layer (e.g., Ceph, ZFS, or RAID). It provides immediate availability and protection against single-drive or single-node failures via real-time replication.
Snapshots
Point-in-time, read-only views of the storage. These provide “undo” capability for accidental deletions or software corruption without requiring a full restore.
Backup System
A separate, immutable copy of the data stored on different physical media. This protects against catastrophic failure of the primary storage fabric.
Workload Orchestration Model
Responsibility View
This document defines how applications are deployed and managed across the platform. It answers the question: How are workloads kept running and healthy?
Orchestration Lifecycle
The platform automatically manages the lifecycle of applications, ensuring they are placed on suitable nodes and restarted if they fail.
flowchart TD
Def[Workload Definition] --> Desired[(Desired State Store)]
subgraph ControlPlane["Control Plane (Decides)"]
Recon[Reconciler / Controller]
Sched[Scheduler]
end
subgraph DataPlane["Data Plane (Runs)"]
subgraph Nodes["Nodes"]
A[Node Agent]
B[Node Agent]
C[Node Agent]
end
WL[Running Workloads]
end
%% Observe
Nodes --> Obs[Health & Telemetry Signals]
WL --> Obs
%% Decide
Desired --> Recon
Obs --> Recon
Recon -->|needs placement| Sched
Sched -->|bind workload| Nodes
%% Actuate
Recon -->|start/stop/restart| Nodes
Nodes -->|run| WL
Key Capabilities
Automated Scheduling
Workloads are assigned to nodes based on resource availability (CPU/RAM) and affinity rules. This ensures that no single node is overwhelmed while others are idle.
Self-Healing
If a node or a specific workload fails, the scheduler automatically attempts to restart the workload on a healthy node, minimizing downtime.
Resource Governance
Every workload must have defined resource requests and limits. This prevents a single “noisy neighbor” from consuming all cluster resources.
Observability Model
Responsibility View
This document defines how the platform monitors its health and alerts on failures. It answers the question: How do we know if something is wrong?
Signal Flow
The platform collects signals from all layers and aggregates them into actionable dashboards and alerts.
flowchart LR
Nodes[Hardware/OS] --> Collector
Pods[Workloads] --> Collector
Ingress[Traffic] --> Collector
Collector --> TSDB[(Metrics / Logs)]
TSDB --> Dashboards[Visualization]
TSDB --> AlertManager[Alerting]
AlertManager --> Notification{Notification}
Core Signals
Metrics (Availability & Performance)
Numerical data points (CPU, Memory, Latency, Error Rate) used to determine the real-time health of a component.
Logs (Context & Security)
Textual records of events. Used for post-mortem analysis, security auditing, and troubleshooting complex failures.
Health Checks (Integrity)
Active probing of service endpoints (e.g., /healthz). This determines if a workload is ready to receive traffic or needs to be restarted.
Control Plane Model
Purpose
This model defines where configuration lives, how it is applied, and what runs continuously vs only during deploys.
Control Flow
flowchart LR
Git[Git Repository] --> CICD[CI/CD Pipeline]
CICD --> Apply[Apply Mechanism]
Apply --> Cluster[Cluster State]
subgraph BreakGlass["Break-Glass Path"]
Admin[Admin Session] --> Manual[Manual Change]
end
Manual --> Cluster
Manual -. "Post-codify in Git" .-> Git
Configuration Sources of Truth
- Primary: Git repository (IaC, manifests, scripts, docs).
- Secrets: Encrypted secrets store (referenced from Git, never committed in clear text).
Apply Mechanism
- CI/CD: Executes validation, build, and apply steps on merge to
main. - IaC Tooling: Terraform/Ansible/Helm (implementation TBD, interchangeable by contract).
- Controllers: In-cluster controllers reconcile desired state continuously.
Continuous vs Deploy-Time
- Continuous: Ingress controllers, identity proxy, DNS sync jobs, monitoring/alerting.
- Deploy-Time: Schema migrations, config changes, new service rollouts.
Break-Glass Rules
- Manual changes are allowed only for incident response.
- Any manual change MUST be codified in Git immediately after stabilization.
Management Plane Model
Purpose
This model defines where administrative endpoints live, how administrators authenticate, and which networks can reach management services.
Management Reachability
flowchart LR
Admin[Admin Operator] -->|SSH / Git / HTTPS| MgmtEndpoints[Management Endpoints]
PolicyNote["No inbound path from Internet"]
subgraph Home["Home Network Boundary"]
AdminLAN[Admin LAN] --> MgmtEndpoints
VPN[Admin via VPN] --> MgmtEndpoints
end
MgmtEndpoints --> ControlPlane[Control Plane Services]
MgmtEndpoints --> Nodes[Cluster Nodes]
Internet((Internet)) -.-> PolicyNote
Access Rules
- Management endpoints are never exposed to the public internet.
- Only admin devices on Admin LAN or VPN can reach management endpoints.
- Administrative access requires MFA and membership in privileged IdP groups.
Authentication Requirements
- SSH: Keys or short-lived certificates only; passwords are forbidden.
- Git/HTTPS: SSO with MFA enforced; audit logging enabled.
- Break-Glass: Emergency accounts are stored in a secure vault and rotated after use.
Implementation Selection
Purpose
Move from architectural roles to concrete implementation choices by evaluating how different options compose into functional platform stacks.
1) Role Implementation Matrix
The following matrix summarizes the primary implementation options for each architectural role. For detailed trade-offs and integration notes, refer to the individual role documents.
| Role | Implementation Options | Primary Best-Fit Stacks |
|---|---|---|
| Edge & Boundary | Traefik, NGINX Ingress, Caddy | K3s, K8s, Nomad |
| Identity & Access | Authentik, Keycloak, Authelia | All Stacks |
| Connectivity & Naming | CoreDNS, ExternalDNS, Pi-hole, Consul | K8s, Nomad |
| Storage & Persistence | Longhorn, Rook-Ceph, ZFS, NFS | K3s, K8s, Nomad |
| Compute & Orchestration | K3s, K8s (Talos), Nomad | - |
| Operations | Prom/Grafana/Loki, GitOps, Velero, Restic | All Stacks |
2) Stack Assemblies
Instead of starting with pre-baked bundles, we derive platform “stacks” as compatible sets of implementations that naturally compose together.
The Pragmatic Homelab (K3s-based)
Focuses on ease of use and low overhead while maintaining Kubernetes compatibility.
- Orchestrator: K3s
- Ingress: Traefik (Forward-auth)
- LB (L4): Klipper (bundled) or MetalLB
- Identity: Authentik
- Storage: Longhorn
- Backups: Velero + Restic
- Observability: Prometheus + Grafana + Loki
The Appliance Cluster (Talos/K8s-based)
Focuses on HA, security, and immutability.
- Orchestrator: Kubernetes on Talos
- Ingress: NGINX Ingress (OAuth2-proxy)
- LB (L4): Kube-vip (Layer 2)
- Identity: Authentik or Keycloak
- Storage: Rook-Ceph
- Backups: Velero (CSI Snapshots)
- Observability: Prometheus + Grafana + Loki
The Flexible Scheduler (Nomad-based)
Focuses on simplicity and host-integrated storage.
- Orchestrator: Nomad
- Ingress: Traefik or Caddy
- Discovery/LB: Consul + Fabio/Traefik
- Identity: Authentik (Forward-auth)
- Storage: ZFS (Host volumes + Replication)
- Backups: Restic
- Observability: Prometheus + Grafana + Loki
3) Selection Criteria & Validation
We evaluate these stacks against our Non-Functional Requirements and Policies.
Hard Gates
These are non-negotiable policy checks.
- No Inbound NAT: Must support exposure via tunnels or relay Exposure Policy.
- Identity-First: All exposure points must enforce IdP-backed auth Identity Policy.
- Cluster Reachability: Load balancing (L4/L7) must be addressed at Day 1; “floating” workloads require a stable entry point to be usable.
- Durability: Must meet RPO 1h / RTO 4h for critical data Backup Policy.
Acceptance Tests
- Internal DNS:
internal.service.risu.techresolves internally and is unreachable from WAN. - VPN access: VPN client resolves internal names and can access internal ingress.
- Public isolation: public ingress serves only public services, never internal.
- Identity flow: auth proxy + IdP flow works end-to-end for internal and public routes.
- Stateful proof: dummy stateful service gets storage, replica, backup job signal, and a restore test plan.
Non-Functional Requirements
This document details the non-functional requirements (NFRs) that govern the design, implementation, and operation of the homelab infrastructure.
Security
- Secure Boundary Enforcement: Private services must be strictly isolated to prevent accidental exposure to the public internet.
- Identity & Access Management: A centralized identity provider must be utilized, supporting multifactor authentication (MFA).
- Secrets Governance: All credentials and sensitive data must be managed through defined storage and rotation policies.
- Network Segmentation: Traffic flow between services must be restricted according to clearly defined security policies.
Connectivity & Networking
- Seamless Remote Access: Remote devices must maintain an experience identical to local network connectivity via secure VPN.
- Naming Consistency: A unified naming scheme (
*.risu.tech) must be maintained across both public and private services using split-horizon DNS.
Availability & Reliability
- High Availability (HA): The system must remain operational across multiple nodes, ensuring service continuity and data consistency.
- Workload Rescheduling: Applications must automatically relocate to healthy nodes in the event of hardware or software failure.
- Data Persistence: The storage fabric must guarantee data consistency and replication across failure domains.
Data Protection
- Resilient Backup: Critical data must be protected through immutable and offline copies.
- Disaster Recovery: Restoration procedures must meet defined Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO).
- Restore Verification: Backup integrity must be regularly validated through systematic restore testing.
Usability
- Low-Friction UX: The infrastructure must provide an intuitive and accessible experience for non-technical users.
- Single Sign-On (SSO): Authentication must be streamlined to minimize login prompts through a unified session.
Maintainability
- Advanced Observability: Centralized logging and metrics must be implemented to facilitate rapid troubleshooting and performance analysis.
- Reproducibility: The entire infrastructure configuration must be defined within a central source-of-truth repository.
- Documentation: Maintenance tasks must be supported by clear, actionable runbooks.
- Automated Documentation Delivery: The source of truth for documentation must be automatically built and deployed to ensure accessibility and consistency.
Pipelines
Pipelines are found in the .forgejo/workflows/ directory in the source code repository, utilizing Forgejo Actions.
- docs_deploy: Build mdBook and deploy static HTML to the documentation server via rsync/SSH.
Service Catalog
This catalog contains the citizens of the platform. Each service is defined by a contract that specifies its requirements and how it fits into the platform’s architecture.
Catalog Entries
Lifecycle
Each service must satisfy the platform rules defined in Architecture Overview before it is shipped.
Service Contract:
Ownership
- Owner:
- Steward:
Purpose
What problem does it solve for the family/me?
Exposure
- Category: Public | Internal | VPN-only | Management
- Ingress: Public | Internal | Management
- DNS names:
Identity
- AuthN: SSO required | SSO + MFA required | local accounts (justify)
- AuthZ: IdP group(s) required
- Break-glass account: yes/no (location)
Data
- Persistence: ephemeral | persistent
- Data class: disposable | standard | critical
- Estimated storage growth:
Network
- Allowed source networks: Internet | LAN | VPN | Admin LAN
- Egress requirements:
Availability
- HA required: yes/no
- Acceptable downtime:
Backup
- Tier: none | standard | critical
- Restore test cadence:
Dependencies
- Needs database:
- Needs object storage:
- Needs SMTP:
- Other:
Observability
- Metrics:
- Logs:
- Alerts:
Change Control
- Deployment method:
- Rollback plan:
Notes / Risks
What could go wrong?
Service Contract: OpenWRT Bootstrap Resolver
Purpose
Authoritative DHCP/DNS front door for LAN/VPN clients; performs public recursion and conditionally forwards internal zones to Technitium while holding static overrides for recovery.
Exposure
- Category: Internal | VPN-only
- Ingress: Management
- DNS names: distributed via DHCP; management UI reachable via static IP
Identity
- AuthN: Local admin accounts
- AuthZ: Admin account required for configuration changes
- Break-glass account: Yes (documented in password vault)
Data
- Persistence: Persistent (config backups required)
- Data class: Standard
- Estimated storage growth: Negligible
Network
- Allowed source networks: LAN, VPN
- Egress requirements: Public DNS upstreams; Internet for firmware updates
Availability
- HA required: No (Phase 1 single resolver)
- Acceptable downtime: Short maintenance windows; restores must be priority
Backup
- Tier: Standard (export config before/after major changes)
- Restore test cadence: After firmware updates or quarterly
Dependencies
- Needs database: No
- Needs object storage: No
- Needs SMTP: No
- Other: Stable upstream DNS IPs
Observability
- Metrics: DNS query/error counters (if available)
- Logs: DNS and DHCP logs
- Alerts: Loss of upstream resolution; DHCP pool exhaustion
Change Control
- Deployment method: OpenWRT config/UI + git-backed config export
- Rollback plan: Restore last known-good config backup
Notes / Risks
Phase 1 single point of failure for DNS; keep static overrides for Technitium and ingress VIP to enable recovery.
Service Contract: Technitium DNS
Purpose
Authoritative DNS for internal service names, serving LAN/VPN clients and Kubernetes-ingress endpoints; optional recursion or forwarding to OpenWRT.
Exposure
- Category: Internal | VPN-only
- Ingress: Internal
- DNS names:
dns.risu.tech(internal-only)
Identity
- AuthN: Local admin accounts
- AuthZ: Admin role required for zone changes
- Break-glass account: Yes (stored in password vault)
Data
- Persistence: Persistent (zones/config)
- Data class: Standard
- Estimated storage growth: Minimal
Network
- Allowed source networks: LAN, VPN, cluster nodes
- Egress requirements: Upstream DNS IPs (public or OpenWRT)
Availability
- HA required: High (for internal service resolution) but not required for platform bootstrap
- Acceptable downtime: Minutes; recovery path via OpenWRT static overrides
Backup
- Tier: Standard (regular export of zones/config)
- Restore test cadence: After major upgrades or quarterly
Dependencies
- Needs database: No (embedded)
- Needs object storage: No
- Needs SMTP: No
- Other: Stable Service IP/VIP; upstream DNS reachable by IP
Observability
- Metrics: Query rate, NXDOMAIN/servfail counts
- Logs: Query/zone change logs
- Alerts: Service availability; zone integrity errors
Change Control
- Deployment method: Kubernetes (Talos) workload
- Rollback plan: Redeploy previous version and restore last config backup
Notes / Risks
Must avoid DNS self-dependency: configure all upstreams and ExternalDNS endpoints by IP; keep WAN exposure disabled.
Service Contract: ExternalDNS
Purpose
Automate internal DNS records by reconciling annotated Kubernetes resources into Technitium with clear ownership boundaries.
Exposure
- Category: Internal (cluster-only)
- Ingress: Internal
- DNS names: None (API-driven)
Identity
- AuthN: Kubernetes service account
- AuthZ: ClusterRole scoped to read ingress/service resources
- Break-glass account: Not applicable
Data
- Persistence: Ephemeral
- Data class: Standard
- Estimated storage growth: None
Network
- Allowed source networks: Cluster nodes
- Egress requirements: Technitium Service IP/VIP; Kubernetes API
Availability
- HA required: No (automation only)
- Acceptable downtime: Hours; existing records continue to resolve
Backup
- Tier: None (state is declarative via Kubernetes + Technitium registry)
- Restore test cadence: Not required
Dependencies
- Needs database: No
- Needs object storage: No
- Needs SMTP: No
- Other: Stable Technitium IP/VIP; domain filters/ownership registry configured
Observability
- Metrics: Reconciliation success/fail counts
- Logs: Controller logs for record changes
- Alerts: Persistent reconciliation failures
Change Control
- Deployment method: Kubernetes deployment/helm/manifest
- Rollback plan: Revert deployment manifest/helm release
Notes / Risks
Restrict domain filters and ownership to internal hostnames to avoid accidental public zone changes.
Runbooks
Operational runbooks for the homelab platform. Each runbook is designed to be copy-paste friendly and scoped to a single failure or procedure.
Catalog
Runbook: DNS Bootstrap & Recovery (OpenWRT + Technitium + ExternalDNS)
Purpose
Bring up or restore internal DNS while avoiding dependency loops. Applies to split-horizon risu.tech with OpenWRT as bootstrap resolver, Technitium as internal authority, and ExternalDNS for automation.
Preconditions
- OpenWRT reachable with admin access.
- Reserved stable IPs/VIPs for Technitium and internal ingress.
- Access to Kubernetes cluster (Talos) for Technitium/ExternalDNS deployments.
Bootstrap Steps (greenfield or re-seed)
- OpenWRT
- Ensure DHCP is enabled and advertises itself as DNS.
- Verify public recursion works using upstream DNS IPs.
- Static overrides on OpenWRT
- Add host overrides:
dns.risu.tech→ Technitium IP/VIPingress-internal.risu.tech→ internal ingress VIP (optional but recommended)
- Add host overrides:
- Deploy Technitium
- Deploy to the cluster with a stable Service IP/VIP.
- Configure upstream resolvers by IP (public) or forward recursion to OpenWRT by IP.
- Keep WAN exposure disabled.
- Conditional forward on OpenWRT
- Add forward rule:
risu.tech→ Technitium IP/VIP.
- Add forward rule:
- Deploy ExternalDNS
- Scope with domain filters/ownership registry to internal hostnames only.
- Set provider endpoint to the Technitium IP/VIP (not hostname).
Recovery: Technitium Down
- From a LAN/VPN client, confirm public DNS still works via OpenWRT.
- Use OpenWRT static overrides to reach the cluster ingress/UI.
- Restart Technitium workload; restore config/zones if needed.
- Validate conditional forwarding resumes and internal names resolve.
Recovery: ExternalDNS Down
- Confirm Technitium answers existing records.
- Restart ExternalDNS deployment; check logs for reconciliation success.
Recovery: OpenWRT DNS Down
- Clients lose DNS; bring OpenWRT back first (single resolver in Phase 1).
- Verify DHCP/DNS service restores; re-check conditional forward to Technitium.
Verification & Tests
- Power-cycle the cluster with OpenWRT up: public DNS must still resolve.
- Start cluster with Technitium intentionally delayed: control plane reachable via overrides.
- Kill Technitium: public DNS works; internal names fail (expected).
- Kill ExternalDNS: existing internal names resolve; no new records created.
- WAN test: internal-only names do not resolve from cellular; LAN/VPN resolve to internal VIPs.
Notes
- Keep all DNS dependencies by IP to avoid “DNS needs DNS.”
- Once resolver redundancy exists, you may move clients to Technitium directly; update this runbook accordingly.
Architecture Decision Records
This directory contains a historical log of significant architectural decisions made throughout the evolution of the homelab project. Each record details the context, decision, and resulting consequences to provide transparency and rationale for the system’s design.
Records Index
- ADR 0001: Use Codeberg as Public Git Host
- ADR 0002: Record Architecture Decisions
- ADR 0003: Split-Horizon DNS for Unified Naming
- ADR 0004: Documentation Delivery System
- ADR 0005: No Inbound NAT for Internal Services
- ADR 0006: Identity-First Ingress for Service Access
- ADR 0007: Kubernetes with TalosOS
- ADR 0008: Adopt Authentik as Central Identity Provider
- ADR 0009: Eliminate Dual DHCP and Establish a Single Boundary
- ADR 0010: Prefer Perimeter Firewall with Dual Ingress for Exposure
- ADR 0011: ExternalDNS + Technitium for Internal DNS Automation
ADR 0001: Use Codeberg as Public Git Host
Status
Accepted
Context
The homelab project requires a public git repository to host its architecture documentation, infrastructure-as-code (IaC), and potentially public-facing service configurations. This host serves as the “public face” of the project and must align with the project’s values regarding open source, privacy, and community-driven infrastructure.
While a self-hosted instance (e.g., Forgejo/Gitea) will be used for internal management and private code, a reliable public host is needed for:
- Public visibility and collaboration.
- External CI/CD triggers (e.g., for documentation deployment).
- Mirroring and redundancy for critical configurations.
Decision
We will use Codeberg as the primary public git host for the homelab project.
Codeberg is chosen because:
- It is based on Forgejo (a community fork of Gitea), which aligns with our internal management plane preferences.
- It is a non-profit, community-driven platform that prioritizes privacy and freedom.
- It provides a reliable, high-performance environment for hosting public repositories without the commercial baggage of larger platforms.
Consequences
- The
homelabrepository (and associated subprojects) will be maintained on Codeberg. - Automation for documentation deployment (mdBook) will be integrated with Codeberg’s CI/CD (Woodpecker or Forgejo Actions) or triggered by Codeberg webhooks.
- Public contributions and issues will be managed via the Codeberg interface.
- Secret management must be strictly enforced to ensure no private credentials are leaked to the public Codeberg repositories.
ADR 0002: Record Architecture Decisions
Status
Accepted
Context
A formal mechanism is required to document architectural decisions made during the development and evolution of the homelab project. This ensures long-term consistency, provides critical context for future modifications, and facilitates knowledge transfer.
Decision
The project will utilize Architecture Decision Records (ADRs) to document significant architectural choices. These records will be maintained within the doc/src/adr/ directory, following a sequential numbering scheme.
Consequences
- Enhanced Transparency: Provides clear visibility into the reasoning behind key architectural choices.
- Historical Context: Establishes a permanent record of the system’s evolution.
- Sustainable Maintenance: Facilitates easier onboarding and long-term system maintenance by preserving intent.
ADR 0003: Split-Horizon DNS for Unified Naming
Status
Accepted
Context
The project requires a unified naming scheme (*.risu.tech) that functions seamlessly across both public and private services. Key requirements include maintaining strict isolation for private services and providing a frictionless remote access experience that mirrors local network connectivity.
Decision
We will implement a split-horizon DNS architecture:
- Public DNS Authority: Resolves records exclusively for public-facing endpoints.
- Private DNS Authority: Resolves records for internal services and serves as the primary authority for LAN and VPN clients.
- Context-Aware Routing: Ingress controllers will enforce hostname-based routing determined by the traffic’s origin (public vs. private).
Consequences
- Unified User Experience: Users utilize consistent service names regardless of their physical or network location.
- Enhanced Security Profile: Internal service names and metadata are not exposed to public DNS.
- Operational Complexity: Requires the management and synchronization of two distinct sets of DNS records.
ADR 0004: Documentation Delivery System
Status
Delayed (Time constraints on runners prevent cargo from compiling for dependencies–needs a polite workaround)
Context
Infrastructure documentation must be easily accessible to all authorized users and updated automatically to reflect the current state of the repository. The documentation is authored in Markdown and managed by mdBook. We need a robust pipeline to build and deliver this documentation to a private (internal server) destination.
Decision
We will implement an automated documentation delivery system with the following components:
- Source of Truth: The
homelabrepository on Codeberg. - Build Engine: Forgejo Actions (using Forgejo Runners), triggered on pushes to the
mainbranch (specifically for changes within thedoc/directory) or via manual trigger (workflow_dispatch). - Single-Target Delivery:
- Private: Automated deployment to an internal server at
/var/www/docvia SSH/rsync for local access.
- Private: Automated deployment to an internal server at
- Security: SSH-based deployment will use a dedicated, restricted user and an SSH key stored as a secret in the CI environment.
- Serving: Nginx will be used to serve the static HTML output on the internal server.
Consequences
- Automated Consistency: Documentation is guaranteed to be up-to-date with the repository’s
mainbranch. - Reduced Complexity: Focusing on a single, internal delivery target simplifies the pipeline and avoids dependency on external “best-effort” services.
- Standardized Process: Leverages Forgejo Actions, providing compatibility with GitHub Actions-style workflows and existing Runner infrastructure.
- Secret Management: Requires careful handling of SSH keys within the CI platform.
ADR 0005: No Inbound NAT for Internal Services
Status
Accepted
Context
The platform hosts both public and internal services. Internal services must never be internet-routable to preserve a strong trust boundary. The architecture already assumes split-horizon DNS and internal ingress controls, but the routing posture must be explicit and enforceable.
Decision
There will be no inbound NAT or port-forwarding from the internet to internal service IPs. All internal services are reachable only from LAN or VPN networks through the internal ingress.
Consequences
- Internet-originated traffic can never reach internal services directly.
- Public exposure is limited to explicitly designated public services via the public ingress.
- Network policies and firewall rules must reflect the absence of inbound NAT.
ADR 0006: Identity-First Ingress for Service Access
Status
Accepted
Context
The platform exposes services to multiple audiences (public, internal, VPN-only, management). To enforce consistent access control and auditing, authentication should be centralized and uniform rather than implemented independently by each service.
Decision
All services must be fronted by an ingress layer that enforces identity at the platform level. Services must integrate with the platform Identity Provider via SSO (OIDC/SAML) or trusted auth proxy headers, with MFA required for public and management access.
Consequences
- Services must not expose unauthenticated endpoints unless explicitly approved in a Service Contract.
- The ingress layer becomes a critical security control that must be monitored and hardened.
- Service onboarding requires identity integration as a first-class step.
ADR 0007: Kubernetes with TalosOS
Status
Accepted
Context
The homelab platform targets a multi-node server environment with room for future capability expansion (for example, optional non-default plugins). K3s was considered, but its optimization for edge/IoT and bundled defaults are less aligned with the desired flexibility. Nomad was also evaluated for its simplicity and support for both containerized and non-containerized workloads. In this environment, infrastructure-as-code and an immutable OS reduce Nomad’s operational advantages, and non-containerized workloads are unlikely.
Decision
Adopt a full Kubernetes stack running on TalosOS as the base orchestration platform.
Consequences
- Ecosystem Flexibility: Kubernetes provides a broad ecosystem, extension points, and standard service discovery and load-balancing patterns.
- Operational Model: TalosOS delivers an immutable, API-managed Kubernetes host OS and supports extensions and secure networking (for example, KubeSpan).
- Complexity Trade-off: Operational complexity is higher than Nomad in isolation, but is mitigated by IaC and TalosOS automation.
- Workload Standardization: Workloads will be standardized on containers unless a future ADR explicitly permits exceptions.
ADR 0008: Adopt Authentik as Central Identity Provider
Status
Accepted
Context
The platform needs a centralized identity and access solution that:
- Supports SSO and MFA.
- Protects both modern apps (OIDC/SAML) and legacy apps without federation support.
- Integrates cleanly with the Edge/Boundary reverse proxy and internal DNS.
- Is reproducible and manageable as code in a self-hosted environment.
Candidates included Authentik, Authelia, Zitadel, and Keycloak. The key differentiator is robust proxy-based enforcement combined with standards-based federation in a single system.
Decision
Adopt Authentik as the platform’s central IdP and access control system:
- Use OIDC/SAML for apps that natively support federation.
- Use Authentik proxy/outposts to protect web apps without OIDC/SAML.
- Enforce MFA via Authentik policies/flows, with step-up where appropriate.
Consequences
- Centralized Access: Consistent login/MFA experience across nearly all services.
- Coverage for Legacy Apps: Proxy enforcement reduces per-app auth workarounds.
- Critical Dependency: Authentik downtime can block access to protected services; monitoring and break-glass access are required.
- Operational Discipline: Flows, policies, and outposts require configuration-as-code to avoid drift.
- Container Standardization: Authentik becomes a core platform service and must meet backup/restore and upgrade standards.
Alternatives Considered
- Keycloak + oauth2-proxy: Mature IdP, but requires additional gateway components.
- Authelia: Strong proxy gate, weaker as a full IdP with rich flows.
- Zitadel: Modern OIDC UX, proxy protection is not a core feature.
ADR 0009: Eliminate Dual DHCP and Establish a Single Boundary
Status
Accepted
Context
The network previously had both the ISP gateway and OpenWRT serving DHCP on the same subnet. This created an ambiguous boundary and undermined consistent policy enforcement at the edge.
Decision Drivers
- Avoid non-deterministic gateway assignment and client routing.
- Ensure consistent DNS behavior to support split-horizon.
- Prepare for future HA/VIP routing patterns without conflicting DHCP sources.
- Maintain a clear, singular security boundary for policy enforcement.
Decision
- Place the ISP router/modem into bridge mode.
- Make OpenWRT the sole DHCP and NAT authority for the subnet.
- Keep IPMI disconnected by default due to port exhaustion and power constraints; connect only when needed.
Consequences
- Single Boundary: A single NAT/DHCP boundary improves policy enforcement and troubleshooting.
- Predictable Clients: Gateway and DNS assignment become deterministic.
- Future Migration: Simplifies future migration to a dedicated firewall or HA topology.
- Operational Trade-off: IPMI access is on-demand rather than always available.
ADR 0010: Prefer Perimeter Firewall with Dual Ingress for Exposure
Status
Accepted
Context
Three exposure stacks were evaluated:
- Model A — Perimeter firewall (OpenWRT now, upgradable later) owns routing/NAT; Kubernetes hosts two ingress controllers (internal-only and public).
- Model B — Kubernetes-native edge using Gateway API with CNI-integrated data plane (e.g., Cilium) to terminate edge traffic directly on the cluster.
- Model C — Cloud tunnel/overlay (e.g., Cloudflare Tunnel, Tailscale Funnel) to expose services without direct inbound paths.
The homelab prioritizes a clear internal/public boundary, minimal external dependencies, and the ability to swap in a dedicated firewall when hardware/power constraints ease. Existing OpenWRT already acts as the single boundary (see ADR 0009), and split-horizon DNS is assumed (ADR 0003). Identity-first ingress is required for user-facing access (ADR 0006).
Decision Drivers
- Preserve a single, enforceable perimeter where north-south policy and logging live.
- Keep internal ingress paths isolated from public ingress while supporting split-horizon DNS.
- Allow future replacement of OpenWRT with a dedicated firewall without re-architecting cluster ingress.
- Avoid new external dependencies for routine access; tolerate them only as scoped exceptions.
- Fit power/port constraints and current hardware while enabling later VLAN/DMZ phases.
Considered Options
Model A — Perimeter Firewall + Dual Ingress
- Pros: Clear boundary; firewall enforces 80/443 exposure; ingress controllers stay inside the cluster; works with current OpenWRT and future firewall/DMZ; keeps routing off the control plane.
- Cons: Requires hairpin/port-forward rules and VIP management; firewall must forward to cluster nodes.
Model B — Kubernetes-Native Edge (Gateway API + CNI data plane)
- Pros: Uniform policy definition inside K8s; fewer port-forwards; rich L7 features.
- Cons: Pushes the trust boundary into the cluster; cluster health becomes prerequisite for edge routing; complicates future dedicated firewall insertion; higher operational complexity today.
Model C — Cloud Tunnel / Overlay Exposure
- Pros: Quick public exposure; hides home IP; minimal edge config.
- Cons: Adds third-party dependency and opaque failure modes; blurs boundary and bypasses local policy/logging; harder to reason about internal vs. public reachability.
Decision
Adopt Model A (Perimeter firewall + dual ingress):
- Keep routing/NAT/policy on the perimeter firewall (OpenWRT now; replaceable with a dedicated firewall later) and continue to expose only the minimal ports (80/443) required for public ingress.
- Run two ingress controllers in the cluster:
- Internal Ingress: LAN/VPN-only, resolves via split-horizon DNS to an internal VIP.
- Public Ingress: Receives only firewall-forwarded 80/443 traffic to a public VIP; backs the small set of intentionally exposed hostnames.
- Use identity-first auth at ingress per ADR 0006; no generic port-forwarding to services.
- Allow cloud tunnels only as scoped, documented exceptions (e.g., break-glass outbound-only tunnels) with explicit change control.
Consequences
- Boundary Clarity: North-south enforcement, logging, and DDoS controls stay at the perimeter; internal ingress remains shielded from the internet.
- Upgrade Path: A future dedicated firewall or DMZ VLAN can replace OpenWRT without reworking cluster ingress (aligns with the Network Evolution Plan).
- Operational Simplicity: Fewer moving parts at the edge; ingress lifecycle stays inside Kubernetes, where certificates and auth already live.
- Constraints-Friendly: Works within current power/port limits; no requirement to run edge data plane on K8s nodes.
- Risk: Firewall misconfiguration could still overexpose services; requires to be disciplined VIP/reservation management and monitoring of port-forwards.
Implementation Notes / Next Steps
- Reserve VIPs for internal/public ingress in the SERVER/DMZ ranges defined in the Network Evolution Plan.
- Maintain firewall rules: 80/443 to public ingress VIP only; no generic NAT for internal services (per ADR 0005).
- Keep split-horizon DNS records aligned with the two ingress VIPs.
- Document any exception tunnels with owners, scope, and teardown criteria.
Related Decisions
- ADR 0003: Split-Horizon DNS for Unified Naming
- ADR 0005: No Inbound NAT for Internal Services
- ADR 0006: Identity-First Ingress for Service Access
- ADR 0009: Eliminate Dual DHCP and Establish a Single Boundary
ADR 0011: ExternalDNS + Technitium for Internal DNS Automation
Status
Accepted
Context
Internal DNS needs to provide LAN/VPN-only resolution for service hostnames while remaining automatable from Kubernetes. The solution must avoid bootstrap dependency loops (DNS needing DNS) and keep public DNS management separate from internal records.
Decision
Adopt Technitium as the internal authoritative DNS service and use ExternalDNS to reconcile annotated Kubernetes resources into Technitium. Keep OpenWRT as the client-facing bootstrap resolver, providing public recursion and conditional forwarding to Technitium with minimal static overrides for recovery.
Consequences
- Enables automated, authoritative internal DNS with clear ownership boundaries.
- Avoids DNS dependency loops by using IP-based upstreams and keeping clients pointed at OpenWRT.
- Increases operational complexity compared to static DNS; requires guardrails for split-horizon
risu.techand tight scoping of ExternalDNS domain filters.