The Underdog Stack: How Luna Runs on Free Inference

The Donna Device from Suits

If you’ve watched Suits, you already get it. (incase you’re uncultured)

I wanted that. Something omnipresent, anticipatory, works across every context. I called it Luna instead of The Donna because the vision was never just one interface. A shortcut, a WhatsApp message, a Slack command, an n8n automation. Same brain, different doors.

The last post covered the front door: an iOS shortcut on my lock screen, one tap, voice memo fired into a backend. This post is about what is behind that door, what it actually runs on, and why I am rebuilding it before adding anything new.

The Architecture (Right Now)

Luna AI Graph as Mermaid Chart

Luna is a LangGraph multi-agent system. Every message comes in, gets loaded with chat history from Redis, and hits a router. The router classifies intent and fires to one of four agents:

Notion agent: reads and writes to my workspace. Tasks, projects, fitness logs, notes.
Calendar agent (Donna): manages Google Calendar. Scheduling, busy slots, edits, removals.
Fitness tracker (Rocky): dedicated logging agent for workouts and activity.
General agent: everything else. Q&A, quick lookups, conversations.

Each agent runs its tools, returns a response, and the result gets pushed back to Redis with the updated conversation history. Clean, stateless agents. Stateful conversations.

Langsmith Dashboard for Tools

The observability and feedback loops are taken care of via Langsmith (more about this, in a later post).

That is the current version. It works. And it has one problem.

Where It Breaks

Say you tell Luna: “Add a task to check out Wan2.2 and block two hours for it on Friday, after work.”

One sentence. Two agents. Right now the router picks one and sends it there. The Notion agent creates the task. The calendar block never happens. Or vice versa.

The routing is a single match statement. One task, one destination. No mechanism for agents to talk to each other, no fan-out for overlapping intent, no synthesis layer for dual-delegation.

Usage data made this obvious. It kept showing up in the logs. Luna v2 fixes this: the router identifies multi-agent tasks, dispatches to multiple nodes in parallel, and synthesizes a single response. LangGraph supports it. The current graph just does not use it yet.

The Underdog Stack

Luna runs on a repurposed college laptop. Ubuntu, homelab, Cloudflare Tunnel. Infrastructure cost: electricity. API cost: close to zero.

Groq runs three jobs. Whisper handles transcription. The router uses gpt-oss-20b with structured output and strict mode for fast, reliable intent classification. The lightweight agents (general Q&A, calendar, fitness) also run on gpt-oss-120b. Fast enough to feel instant. Free tier covers everything at personal usage volumes.

GLM-4.5 Air by Z-AI handles the heavy Notion agent. Ten tools, full datasource context, real read-write operations against my workspace. Purpose-built for agentic applications, MoE architecture, 131K context window, and a thinking/non-thinking toggle depending on whether the flow needs reasoning or just speed. Reasoning is explicitly turned off in the Notion agent config. For tool-calling flows you want decisiveness, not deliberation. A Chinese lab’s open-source model is running the most critical part of this system for free. That is worth saying out loud.

Before GLM-4.5 Air, this slot was Step 3.5 Flash by StepFun (really slept on, in my opinion). 196 billion total parameters with only 11 billion active per token via sparse MoE. Multi-Token Prediction generating 4 tokens per forward pass, hitting 100 to 300 tokens per second in typical usage. It reasoned like a large model and moved like a small one. The free tier on OpenRouter dried up last month. GLM stepped in and has not missed a beat.

Both of these models come from labs that do not get the coverage they deserve. If you are building anything agentic and have not looked at either of them, you should.

Four OpenRouter API keys rotate automatically via a KeyRotator. When one hits a rate limit, the next one picks up. Rate limits per key become irrelevant at personal usage volumes.

Gemini sits at the bottom via ModelFallbackMiddleware. GLM fails, Gemini catches it. Gemini fails, Groq catches it. Three layers of free inference before a single rupee gets spent.

This is not cheapness. It is a deliberate tiered inference strategy. Fast model for routing, capable free model for heavy tool use, two fallback layers below it.

We call that the finest form of Jugaad

What Is Next

Luna v2 is the immediate priority: rewire tools, multi-agent dispatch, inter-agent communication, proper handling of tasks that span multiple domains.

After that: a WhatsApp interface already in progress, then Slack, then the n8n automation layer expanding significantly. Same brain. More doors.

The shortcut was always the starting point. The architecture has to be built for omnipresence before any new interface gets added.

The brain comes first. The ears come later.

The Architecture (Right Now)

Where It Breaks

The Underdog Stack

What Is Next

Linked mentions