litellm-keys-easy-hunting-them-down-hard - AIXplore

# Managing a homelab's LLM API keys: easy in LiteLLM, hard to hunt down > [!tip] TLDR > **The why.** I had one OpenRouter key labeled "Mac" doing double duty across a five-machine homelab. It was the gateway's upstream key and the key behind a dozen direct callers. Every gateway call authenticated with one shared master key, so my spend logs showed `user_api_key_alias = null` on every row. I could not tell which workload spent what. Before I could rotate or revoke anything I had to find every place that one key lived. > > **The shape.** A LiteLLM gateway runs as one front door on the DGX Spark. Every workload gets its own capped virtual key minted with a single curl. Provider secrets live once in the gateway environment; every consumer (squad agents, the investing agent, Claude Code shell launchers, hotkey scripts) holds only its own virtual key and points its `base_url` at the gateway. Spend lands in a Postgres `LiteLLM_SpendLogs` table keyed by the virtual key's alias. > > **The hard part.** Not the gateway. The gateway was a weekend of easy wins. The work was archaeological: one shared key turned out to be deployed in six places across five hosts, config files lied about whether spend was even being logged, a flat-fee subscription cost center was invisible to the spend audit entirely, and a $111 spike that looked like a leak was a legitimate one-time reindex. You cannot safely revoke a key until you have mapped every caller, and the map is hard to build. > > **Reproduction prompt for Claude Code:** > > > Help me put a LiteLLM proxy gateway in front of every LLM call in my homelab so I get per-workload spend attribution. Run LiteLLM as a systemd service on one host with a Postgres backend so `LiteLLM_SpendLogs` persists. Keep all provider API keys (OpenRouter, Anthropic, OpenAI, Gemini) only in the gateway's environment file (root-owned, 0600), never on the consumer machines. Mint one virtual key per workload with `POST /key/generate` setting `key_alias`, `max_budget`, and `budget_duration`, and store each in my password manager. Point every consumer's `base_url` at the gateway and its `api_key` at that consumer's virtual key. Then write me a nightly audit that joins `SpendLogs.api_key` to the verification token's `key_alias` so I can see spend broken down by workload and model. Important: read the gateway's live process environment, not just the YAML, because env vars override the config file. Routing every LLM call through one gateway and attributing the spend turned out to be the easy part. A virtual key is one curl. A model route is one line of YAML. Attribution falls out into a Postgres table for free. The genuinely hard part was the hunt that came first: finding every caller of one overloaded API key, scattered across five machines, before I could safely revoke it. ## The mess before I run LLM workloads across a Mac M4 Max, a DGX Spark, a mac-mini, a Raspberry Pi, and two Hetzner VPS. By the time I sat down to clean this up, the credential situation had drifted into a state I would have flagged in any code review. One OpenRouter key, labeled "Mac" in the dashboard, was doing two jobs at once. It was the gateway's upstream key, and it was also the raw key behind a dozen direct callers on various hosts. On top of that, eight separate direct provider keys (OpenRouter, Anthropic, OpenAI, Gemini, Perplexity, xAI, MiniMax, Kimi, z.ai) were sprinkled across the fleet wherever some script or agent had needed one. The attribution was worse than the sprawl. Every call through the gateway authenticated with a single shared master key. My spend logs had 190,000 rows going back months, and every single one showed `user_api_key_alias = null`. Per-workload attribution was zero. The best I could do was guess from the requester IP and the User-Agent string, which is exactly as useful as it sounds when half your traffic comes from localhost. ## The easy part: the gateway The fix in principle is a [LiteLLM](https://docs.litellm.ai/) proxy. One front door on the DGX. Every consumer points its OpenAI-compatible `base_url` at the gateway and authenticates with a virtual key instead of a raw provider key. Minting a virtual key per workload is a single request: ```bash curl http://<your-tailnet-ip>:4000/key/generate \ -H "Authorization: Bearer $LITELLM_MASTER_KEY" \ -d '{"key_alias":"vk-hermes","max_budget":5,"budget_duration":"30d"}' ``` The `key_alias` is the whole point. It becomes the attribution axis. In the spend table, `SpendLogs.api_key` joins to the verification token's `key_alias`, so every row names its consumer. Adding a new model is one entry in a YAML keyed by `model_name`: ```yaml model_list: - model_name: gemini-flash-lite-latest litellm_params: model: gemini/gemini-2.5-flash-lite api_key: os.environ/GEMINI_API_KEY ``` Provider secrets live exactly once, in the gateway's environment file (root-owned, mode 0600). Every consumer, whether it is a squad agent, the investing agent, a Claude Code shell launcher, or a hotkey script, holds only its own capped virtual key. None of the boxes hold a provider secret anymore. Rotation happens in one place. Caps are per-workload. Attribution is automatic. That part took a weekend, and most of the weekend was reading docs. ## The hard part: the archaeology Then came the real work, which the gateway docs do not warn you about. You cannot revoke the old shared "Mac" key until you know every place it lives, because the moment you revoke it, everything still using it breaks at once. So the first job is not building, it is finding. When I finished mapping, that one key turned out to be deployed in six places across five hosts: the gateway's own environment, one squad agent's `secrets.env` plus a duplicate `.env` on the same box, three more squad agents' `secrets.env` files, and the investing agent resolving it from a secret store at runtime. I fingerprinted each by sha256 (never dumping the value) to prove they were the same key and not coincidentally similar ones. Building that map was the hard part. Several gotchas made it harder. ### Config files lie. The running process is ground truth. The gateway YAML said `database_url: null` and `disable_spend_logs: true`. Read literally, that means the gateway logs nothing. Reading the live process environment told a different story: `DATABASE_URL` was set, pointing at a Postgres container, and the table already had 190,000 rows. The env vars were overriding the YAML, and the YAML was stale fiction. I had even carried a note in my own working memory that said "this gateway has no database," which I had to correct on the spot. > [!warning] Verify the live process, not the config file > Before you assert what a service does, read its live environment (`/proc/<pid>/environ` on Linux), not the config file someone last edited. A service started before the most recent config change will not reflect that change, and environment variables silently override the file. This single habit caught two separate wrong conclusions in this project. ### That same flag had silently broken logging for 34 hours The `disable_spend_logs: true` line was not just stale. At some point a restart had picked it up, and the gateway had stopped writing spend rows for about 34 hours before I noticed. The dashboard looked green the whole time. A gateway that logs nothing looks identical to a gateway with no traffic. The only tell was that the newest row was a day and a half old, which you only see if you go looking. ### Subscription-backed providers are invisible to a spend audit This one reframed the whole exercise. My squad agents mostly run on flat-fee coding subscriptions (GLM via z.ai, Kimi, MiniMax). Those calls never touch the metered OpenRouter bill. An audit that only looks at OpenRouter spend completely misses the biggest cost center, because the biggest cost center is a fixed monthly fee that does not produce per-call rows anywhere. The lesson: a spend audit measures metered spend, not total spend. Know which of your providers bill per token and which bill per month, or you will optimize the wrong thing. ### A $111 spike that looked like a leak The 30-day metered spend was $118.60. Of that, $111 was a single two-day spike. My first instinct was a runaway loop or a leaked key. The suspected culprit, Perplexity's sonar-pro, turned out to have cost $0.60 in the same window. The real source was a legitimate one-time vault vector-search reindex: gemini-flash-lite over roughly 545K-token contexts with tiny outputs, split across the DGX and the Mac. Entirely expected work, completely benign, but nothing flagged it in real time. The baseline is about $0.30 a day, so a $60-a-day spike should have paged me and did not. That gap became its own follow-up: a nightly anomaly detector. ### Network ACL, not firewall When I went to migrate the squad nodes, the gateway port timed out from those hosts. The squad machines are Tailscale tagged-devices. Ping to the DGX worked fine. `ufw` allowed the tailnet. But TCP to the gateway port hung. The block was a Tailscale ACL denying tagged nodes access to that port, not a host firewall. **Ping passing while a TCP port times out is a tell worth memorizing: that pattern points at an ACL denial, not a firewall rule or a dead service.** ## The discipline that prevents an outage The thing that keeps you from taking down the whole fleet is not pinning every last caller before you start. It is a simple rule: never revoke a key that still shows traffic. The safe sequence: 1. Mint the new named virtual keys. 2. Deploy them alongside the old shared key, one host at a time. 3. Watch the old key's "last used" timestamp on the dashboard. 4. Migrate hosts until that timestamp goes quiet and stays quiet. 5. Only then revoke the old key. Per-consumer named keys turn the dashboard into a live map of who has not migrated yet. Every key still showing recent traffic is a host you have not finished moving. The migration becomes a checklist you can read off the spend logs instead of a leap of faith. ## The coda: my own shell launchers The last stragglers were mine. I have a set of Claude Code shell launchers, aliases that flip Claude Code onto GLM, Kimi, MiniMax, or a local Qwen through the gateway. They had all been sharing the master key. I moved each onto a per-host virtual key (`vk-claude-code-mac`, `-dgx`, `-mini`), selected at runtime by hostname inside the one canonical shell-rc block that syncs across machines, with a graceful fallback to the master key on any host that does not match. Now the spend logs name which machine drove a non-default Claude Code session, and the model column still tells me which provider mode it was in. One last gotcha closed the loop. That shell-rc block is synced to the DGX, which has no zsh installed. Running a whole-file `bash -n` syntax check on it false-errors on zsh-only syntax like `setopt`. The fix is to validate the edited block in isolation rather than syntax-checking the entire synced file under the wrong shell. ## End state Where it landed: - About 13 capped virtual keys, one per workload. - Two OpenRouter keys (a capped gateway upstream and a tiny direct fallback) instead of one sprawling shared key. - A nightly spend digest that attributes by virtual key and model, and watches the old key for stragglers. - Every box holding only its own key, no provider secrets on the consumer machines. ## The takeaway If you run more than one machine and more than one model provider, put a gateway in front before you have ten keys in twelve places. The routing and the attribution are trivial. The migration cost is entirely in the archaeology you are deferring by waiting. Build the gateway first, while there are three callers to move instead of thirteen. The weekend of easy wins is always available. The hunt only gets longer.