mneme-semantic-recall - AIXplore

# Mneme: Semantic Recall for Your Claude Code Sessions > [!tip] TLDR > **The why.** Claude Code writes every session to a JSONL file under `~/.claude/projects/`. After a few months you have hundreds, and `grep` doesn't help because you remember what you were doing, not the words you used. The history is on disk and unreachable. > > **The shape.** Mneme indexes the JSONL into LanceDB, exposes four search modes (vector, fts, hybrid, rerank), runs entirely on your laptop. CLI or Claude Code skill. No API keys, no daemon, no cloud round trip. On a real index of 9,145 sessions / 125,062 chunks, rerank mode hits Recall@5 of 0.926 and MRR of 0.864. Repo: [github.com/BioInfo/mneme](https://github.com/BioInfo/mneme). > > **The hard part.** Not the indexer. Trusting the result. The first overnight A/B I ran swapped in `BAAI/bge-m3` (four times the params, 1024-dim, multilingual), and instinct said it would win. The eval phase blew past 36 GB of RAM and panicked the laptop kernel. The rerun on a bigger box took ten hours and said: bge-m3 ties rerank recall and loses MRR by 0.018, loses vector-only MRR by 0.151. The bigger model bought nothing at the top of the ladder and was worse everywhere else. Without the harness I'd have switched on instinct and paid for it forever. The tool isn't the search, it's the discipline to measure it. > > **Reproduction prompt for Claude Code:** > > > Build me a small Python tool that ingests Claude Code session JSONL from `~/.claude/projects/`, chunks the conversational text (skip thinking blocks and tool-result noise, 20 to 2000 chars per chunk), and indexes the chunks into LanceDB with both dense vectors and an FTS index. Embedder default `nomic-embed-text-v1.5`, reranker default `BAAI/bge-reranker-v2-m3`, both via sentence-transformers. Four search modes behind one CLI: `index`, `search`, `stats`. Wrap it as a Claude Code skill so I can ask in-session and the model calls search. Ship an eval harness with Recall@k and MRR so changes can be judged by numbers, including an A/B mode that builds an isolated index for a candidate embedder and prints the delta against the baseline. ## The problem `~/.claude/projects/` is one of the most underused folders on every Claude Code user's machine. Every session you've ever run lives there as a JSONL file: the conversation, the code, the dead ends, the decisions. Months of work, byte-for-byte, sitting right there. Six months later, you go looking for that one thing you remember solving. You can't find it. Keyword search doesn't fix this. `grep "litellm"` returns four hundred files because that string shows up offhand in dozens of unrelated sessions. The thing you're looking for is *the time you traced 401s from a direct-API route through the gateway*, not *any session that contains the substring litellm*. Conversational corpora resist lexical search by nature: people don't restate the title of their work mid-sentence, they tell each other what they're doing in five different framings, and the framing you'll later remember is rarely the framing on disk. That's a vector search problem. It should be local, because nobody should upload their engineering history to a third-party API. It should be cheap enough to call mid-conversation. And it should live next to Claude Code itself so a lookup doesn't cost a context switch. Same instinct as [[Practical Applications/debugging-million-token-meeting-notes|debugging-million-token-meeting-notes]], pointed at a different corpus. That's mneme. ## What it is A single-user local tool. The embedder runs on your laptop (`nomic-embed-text-v1.5`, 768-dim, MPS or CUDA or CPU). The store runs on your laptop (LanceDB, file-backed, no process). The reranker runs on your laptop (`bge-reranker-v2-m3`, downloads once on first use). Nothing leaves the machine. You index when you want. First index pulls weights and chews through your whole `~/.claude/projects/`. Every run after that is incremental, only the files whose mtime changed get re-embedded. You search from a CLI, or you wire it into Claude Code as a skill so the model can call it for you when you ask about past work. ## How it works ```mermaid flowchart TD A["Session JSONL files <code>~/.claude/projects/</code>"] --> B["Parser extract message text, skip tool-result noise + thinking blocks"] B --> C["Chunker 20–2000 char chunks, attach session metadata"] C --> D["Embedder nomic-embed-text-v1.5 (768-dim)"] D --> E["LanceDB local vector store + FTS index"] E --> F["Search ladder"] F --> V[vector] V --> T[fts] T --> H[hybrid] H --> R[rerank] ``` The four modes, in order of cost and quality: - **`vector`**: dense cosine. Cheapest. Use when you remember meaning and not words. - **`fts`**: BM25. Use when you remember an exact term. - **`hybrid`**: dense + sparse fused. The default I'd reach for. - **`rerank`**: hybrid candidates rescored by a cross-encoder. Slowest, best. The difference between "in the top 5" and "at position 1." If you're only running one mode, run rerank. The added latency is a couple of seconds and a one-time model download. Worth it. ## How I use it The flow that's stuck for me is the skill-wrapped one (the same Claude Code skill pattern covered in [[Practical Applications/claude-skills-vs-mcp-servers|claude-skills-vs-mcp-servers]]). Mneme registers as a Claude Code skill whose description names the trigger phrases ("what did we work on with X", "find the session where Y", "remind me about that thing with Z"). When I drop one of those mid-conversation, Claude calls the skill, the skill shells out to `mneme.cli search`, the top sessions come back inline. I don't leave the conversation to look something up. The CLI is there when I want a wider view: ```bash ./venv/bin/python -m mneme.cli search "that database migration we did" --sessions 10 ``` Last week I used it to pull back a January debugging session about gateway 401s, a March prompt I'd written for an agent eval, and a one-off shell pattern I'd built in February and lost. None of those were findable by name. All of them came back on the first query. That's the value. Not "AI-powered search." Just: the history Claude Code already keeps becomes a thing you can actually use. ## Proving it works Search quality is easy to feel and hard to measure. Mneme ships an eval so the measurement isn't optional. Three pieces: `eval/make_queryset.py` reads your recent sessions and emits a JSONL template. You replace each placeholder with a query you'd actually type, and label it with the session ID that should win. `eval/run_eval.py` runs vector, fts, hybrid, and rerank against your queryset and writes Recall@1, @5, @10 and MRR per mode to `eval/results/<timestamp>.json`, plus the verbatim queries each mode missed (the miss list is usually more useful than the numbers). `experiments/embedder_ab.py` builds an isolated index with a candidate embedder, runs the same queryset, and prints the delta against the latest baseline. The live index is never touched. The README's numbers come from running this against my own corpus: ![[mneme-modes-by-metric.png]] Vector-only at 0.802 MRR is solid on its own. FTS lags because the queries don't share enough exact terms with the chunks. Hybrid recovers the recall and rerank converts that recall into precision. The R@5 ceiling is 0.926; the two queries both rerank-pipelines miss aren't an embedder problem, they're a queryset/chunking problem (more on this below). The harness paid for itself the first time I used it. I scheduled an overnight A/B against `BAAI/bge-m3` (four times the params, multilingual, 1024-dim instead of 768). Instinct said it should beat nomic. It tied rerank recall and lost MRR by 0.018. At vector-only it lost MRR by 0.151. Bigger model, worse number, the answer was sitting in `eval/results/`. Without the harness I'd have switched on instinct and paid for it forever. ## Setting it up About twenty minutes if your Python is in shape. First index takes longer because the model has to download and your history has to embed. ```bash git clone https://github.com/BioInfo/mneme.git cd mneme python -m venv venv source venv/bin/activate pip install -r requirements.txt cp config.example.yaml config.yaml # point it at your ~/.claude/projects/ and pick a device. # mps on Apple Silicon, cuda if you have one, cpu fallback. ./venv/bin/python -m mneme.cli index ./venv/bin/python -m mneme.cli search "your query here" ``` If you want it callable from inside Claude Code, drop a `SKILL.md` and a one-line `search.sh` into `~/.claude/skills/mneme/`. The README has the exact two files. Three minutes of wiring, then the model can call mneme whenever you ask about past work. If you want your own numbers, generate a queryset and run `eval/run_eval.py`. The queryset is gitignored by default because it references your session IDs; what you produce stays local. > [!warning] Your sessions contain secrets > Claude Code transcripts will include any API key, password, or token you've typed or pasted during a real debugging session. Mneme's LanceDB store sits on disk as plain (compressed) vectors plus the source text. Treat the `data/` directory as sensitive: don't sync it to a public location, don't ship it with a backup that goes to a shared bucket, and gitignore stays gitignored. The whole tool runs locally precisely so this content doesn't leave the machine. ## What it doesn't do A few things you should know before adopting: Fragmented-mention queries still fail. When a recurring topic spans twenty sessions under five different framings, chunk-level rerank can't recover a query that doesn't lexically anchor to any one chunk. Both embedders I've tested miss the same queries here. I think the answer is hierarchical retrieval (session-level summaries above chunk-level vectors); I don't think it's another embedder. That's the next experiment in the repo. There's no cross-machine sync. The LanceDB store is local. Second box means re-index. I'm holding off on building sync until I've tried hierarchical retrieval, because hierarchy changes the storage shape and I don't want to design the sync twice. Schema migration is manual. If you change embedder dimension (768 to 1024) you have to `index --full`. The A/B harness does this cleanly in an isolated dir so the live index isn't affected. ## Closing If you use Claude Code daily, you're already producing a corpus you can search. Mneme is twenty minutes of setup between that corpus and being able to use it. The repo is public, the README walks the install, and the eval harness is there so you don't have to take the performance numbers on my word. [github.com/BioInfo/mneme](https://github.com/BioInfo/mneme)