FornixDB — Persistent, Human-Like Memory for AI

2026 — ITDT LLC AI Systems Design

By Danny Thornton — ITDT LLC, in collaboration with Claude Code (Anthropic, 2026)

FornixDB is built. The architecture described below is now a working, open-source memory store — local, private, and model-agnostic — in daily use and tested across Claude Code and a local Qwen-72B agent. The fornix is the brain’s memory tract; FornixDB is exactly that for an AI: the store of traces and the path a thinking model reaches them through, never the thinker.

FornixDB on GitHub Source, install, and the integration guide.

→

See It Running

A live, unscripted demo — FornixDB as the persistent memory behind Elira, a local voice assistant. Everything runs on-device: no account, no cloud, no telemetry, a single local file you own.

Recorded 2026 · runs entirely on local hardware.

The Problem

The problem with AI memory is structural. Current AI assistants — even agentic ones doing real engineering work — carry no genuine memory between sessions. What looks like memory is a directory of notes, loaded in full each time, that grows stale and strains context. Three structural failures define the gap:

No time axis. The AI cannot tell you what happened last Thursday. Memories are subject-keyed only; episodic recall is missing entirely.

No relevance ranking. The entire memory index loads every session, whether relevant to the current task or not. It strains the context budget and does not scale as the corpus grows.

No automatic capture. Memories exist only when the AI stops and hand-writes a file. The richest detail — the reasoning buried inside a long working session — evaporates at summarization. Once compaction happens, that detail is gone.

These are solvable engineering problems. The solution looks a lot like the human memory system it should have been from the start.

What You Gain

The architecture below is the how. Here is the what — every capability implemented and measured, not aspirational:

Recall by time. “What did we do last Thursday?” returns everything from that window. Sessions are captured automatically (owner-toggleable), so the answer exists without anyone deciding to save it.
Recall by meaning. Paraphrases work — “the glitch where her eyes sparkled” finds the right memory with zero shared keywords. The optional embedding model is a small, CPU-only upgrade, never a requirement.
A memory that learns. New knowledge supersedes old without erasing it; unused memories fade in ranking while frequently-used ones stay sharp; an explicit “not that one” teaches recall what to stop surfacing — retractable, never deleted.
Private by construction. No account, no cloud, no telemetry — a single local file you own, readable and deletable by you, that outlives any one vendor’s product decisions.
Memory that earns its space. Recall ships a one-line gist first (~300 tokens, hard-capped), replacing the hundreds-to-thousands of tokens of re-explaining and re-deriving it would otherwise cost every session — plus time-axis answers that are impossible at any price in a stateless chat.
You stay in charge. Capture policy is yours — only-when-asked, offer, or auto. Never-delete is the default; true deletion happens only at your explicit consent, and forgets least-important-first.

Two Recall Axes

Human memory has two natural axes. Episodic memory is time-indexed — “on Thursday we designed the memory database.” Semantic memory is subject-indexed — “we prefer local storage over cloud databases.” Current file-based memory is purely semantic and hand-curated. The new architecture adds the missing axis: every memory carries timestamps, and the system answers time-based queries.

The deeper connection: semantic memories should be derived from episodic ones over time, through consolidation. The system observes patterns across episodes and abstracts them into standing facts. A person does not explicitly store “I prefer X” — they notice it across dozens of episodes and it rises to the surface. The AI memory system does the same.

Progressive Disclosure

Every memory entry carries two representations: a gist (the compact summary) and detail (the full content). Recall returns gists inexpensively; detail is fetched only when the conversation drills in. This is both the human-like behavior — “I remember we worked on that; want the details?” — and the mechanism that keeps recall within the context budget. Storage is cheap. Context is not.

Associative Recall and Typed Links

Pure keyword search does not feel like memory. People recall by similarity and association: one memory surfaces another. The store is therefore hybrid — relational structure for integrity, full-text search for precision, vector embeddings for fuzzy associative recall, and a typed link graph between memories.

Links carry semantic weight: refines, supersedes, or relates. When two memories about the same subject conflict, the more recent one supersedes the older — but the older is tombstoned with its timestamp, not deleted. The chain of “we used to do X, then switched to Y on this date, for these reasons” is preserved. That trail is the record of learning.

The conflict gate has two dimensions: confidence in the resolution, and impact on the owner’s understanding. Self-evident conflicts resolve silently. Genuinely ambiguous conflicts surface for discussion. Clear-to-the-AI-but-consequential conflicts are surfaced and explained — even when the AI is certain, because the owner must understand what the resolution implies.

Storage Without Clusters

This is not a Big Data problem. Distributed cluster infrastructure solves volume and throughput at terabyte-to-petabyte scale. A single agent’s corpus is text and tool I/O — orders of magnitude below that threshold. The right store is SQLite, with full-text search (FTS5) and a vector extension (sqlite-vec). Everything that sounds like it needs exotic infrastructure maps to: JSON columns for heterogeneity, a vector index for association, a link table for the graph, and recursive CTEs for traversal.

The corpus is nonetheless substantial at full fidelity — single-digit to low-tens of GB over years of heavy use, not tens of megabytes. Episodic richness, multiple resolution levels, tombstoned history, and vector embeddings accumulate. Scale is managed not by clustering but by tiering, modeled on the human memory hierarchy:

Tier	Contents	Store
Recent episodic (hot)	Full fidelity, full vectors — last weeks of experience	SQLite + vector index
Long-term semantic	Consolidated facts, always live	SQLite (compact)
Cold episodic archive	Old raw detail, rarely queried	Compressed columnar (Parquet-style)

A consolidation pipeline — the “sleep” step — continuously moves experience down the tiers, lossily. Only the cold archive grows without bound, and it is cold precisely because it is almost never queried. Forgetting and human-likeness are the same property: a system that consolidates intelligently is simultaneously human-like and scale-controlled.

The retention and consolidation policy is the hard 80% of the problem. Get it right and scale takes care of itself. Get it wrong — “keep everything at full fidelity forever” — and the result is a tape archive, not a memory.

Two Hardware-Derived Dials

The same architecture spans a Raspberry Pi and a data center. Only two dials change.

Memory dial has two inputs. Capacity determines how much is kept: a small disk means a shorter hot window, lower fidelity, more aggressive consolidation, heavier vector quantization. Speed determines how much stays online versus cold: fast NVMe keeps a large hot set live; slow storage pushes more detail into the cold archive. The recall interface is identical regardless of dial setting. The reasoning model never special-cases a low-memory endpoint — it asks for recall and receives the best that endpoint can give. Same mind, different-sized body.

Cognition dial determines where thinking happens. At the robot end, cognition is onboard — forced by latency, connectivity, and autonomy requirements. At the agent end, cognition is central — a large shared model tolerates round-trips. The split between local reflex work (recall ranking, clustering, eviction — System 1) and deliberative work (consolidation, conflict resolution, interpretation — System 2) slides based on available compute and connectivity.

Local AI for the reflex layer is an option, never a requirement. The algorithmic baseline — clustering and ranking using classical algorithms, no model — works on any hardware. Many endpoints will not have the capacity to run a local model at all, and the system must work fully on those endpoints regardless.

Transparent by Design

Most memory systems are black boxes. This one answers questions about itself, in plain English, through any connected AI: how much disk space it uses (per AI and machine-wide), what its token footprint is, and what is remembered — four levels of drill-down from titles to gist to full detail to the raw original source. Every result also carries why it ranked where it did: staleness, downweighting, and duplicate flags travel with the answer. And the store never claims truth — it preserves where every memory came from (its source, its writer, its supersede history) and surfaces that at recall, so the reasoning model can weigh it. Inspectable all the way down.

Seven Phases

Hot spine — SQLite schema, time and subject recall, import of existing memories. Manual checkpointing.
Associative recall — Vector embeddings, FTS5, typed link graph.
Consolidation and lifecycle — Retention policy, decay, reinforcement, salience-weighted eviction, cold archive, self-tuning under disk pressure.
Recall integration — Model-agnostic MCP-style interface; automatic session-start recall; Markdown export for git.
Algorithmic reflex layer — Clustering (propose, not dispose), recall ranking, eviction. No model required; baseline for lean hardware.
Optional local-model reflex — Model-assisted reflex where hardware allows; escalation split point; eidetic retention mode.
Scope federation — Personal memory tier synced across endpoints over endpoint-local tiers.

Phase 3 is the actual nut. The schema is the easy 20%; the forgetting curve and consolidation trigger are the hard 80%. Phase 1 is a tractable first win. The rest follows from the policy.

Status, 2026 — built and released. Phases 1–6 are implemented and tested: time + subject recall, supersede-with-history, associative recall (vectors + FTS5 + typed links), the consolidation “sleep” pass — now a user-visible dream step that reconciles outdated memories and weaves new connections — the algorithmic no-model reflex layer, and multiple AIs sharing a machine-level memory tier with per-store disk budgets. Proactive recall (L3) is wired into the daily loop, rhythmic in-thought recall (L4) is default-on and proven (a usefulness scan verifies mid-thought pulses are referenced downstream more often than the per-turn channel — 20% vs 13%), and L5, parallel multi-domain activation — “the field” — is now default-on: seven domain-scoped recalls fire on one thought and settle by corroboration into a single directed pattern, flipped on after live dogfooding (~96% of surfaced pushes scored useful; ~307 ms median per beat; honest degrade-to-L4 when nothing corroborates). Two orthogonal signals keep every recall landing on the right memories: per-memory usefulness feedback (impressions kept apart from evidence of use) and project-scoped pulses that stop off-context memories from leaking into the ambient stream. Beyond the memory ladder, the store now also remembers to remember — prospective reminders that fire by the clock, with urgent ones that nag until acknowledged — and has the beginnings of senses: one-shot hearing and on-device temperature. A single configuration surface (config shows every setting at once, doctor health-checks the install and suggests defaults) sits over all of it. Tagged and public through v0.8.7; the federation tier (7) is the planned next step. Proven across three consumers: Claude Code, a local Qwen-72B agent, and a smaller model over the MCP/shim surface.

The Roadmap — Operating Levels

The seven phases above are the build axis — what is engineered. The roadmap runs along a different, orthogonal axis: how tightly, how often, and how much in parallel memory is fused into the act of thinking. It is a climb from “the program must ask” toward memory that activates itself, across many information domains at once, and steers the next thought.

This axis is independent of the two hardware dials. A microcontroller and a workstation can sit on the same rung and simply differ in how much they remember and how coarsely. The ladder measures the maturity of the coupling itself — autonomy first, then rhythm, then parallelism.

Level	How memory couples to cognition	Example endpoint	Status
L0 — Explicit store / retrieve	A passive keyed store. The program must deliberately put and get; exact lookups, no ranking, no automation.	A microcontroller; a hand-wired `put`/`get`.	The floor.
L1 — Associative recall on demand	Still pull-based — the AI must ask — but a query returns relevance-ranked, associative recall (vectors + full-text + a time axis), gist→detail.	An agent that calls recall when it needs context.	Shipped.
L2 — Automatic capture (write-side autonomy)	The write side becomes autonomous: experience is captured and consolidated after each prompt or session with no explicit “store.” The foothold for everything above.	A session whose transcript is auto-captured.	Shipped.
L3 — Proactive recall injection (one pulse per turn)	Memory pushes relevant context into the thinking unasked — once per turn, relevance-gated, additive. The first heartbeat of memory eventing back to the thinker.	A prompt-submit hook surfacing a tagged “possibly-relevant past” block.	Shipped — lived-in. Wired into the daily loop and surfacing each turn; the relevance floor is now per-memory, tuned by the usefulness signal below so the block fires on the right memories, not merely plausible ones.
L4 — Rhythmic in-thought recall (the “metronome”)	Memory is re-activated many times within a single reasoning episode — pulsed as the thought evolves, event-driven cadence, each pulse steering the next step.	A debounced local recall loop re-querying at reasoning checkpoints.	Shipped, default-on — proven. Its usefulness gate resolved on lived-in usage: mid-thought pulses are referenced downstream more often than the per-turn channel (20% vs 13%, scan-verified). A portable cadence controller drives both a local model’s inner tool-loop and the Claude Code tool seam.
L5 — Parallel multi-domain activation (the human-like target)	Many lightweight agents fire simultaneously across domains (episodic, semantic, feedback, by-project, by-person, by-salience), all local, and their returns settle into one pattern that directs the next thought.	A local orchestrator spawning N domain-scoped recall agents per step.	Default-on — the living frontier. “The field”: seven domain-scoped recalls on one shared query embedding, settled by corroboration clustering (rows several domains return, or that link across domains, form the pattern; no corroboration degrades gracefully to L4 — nothing is fabricated). Flipped on after live dogfooding showed no harm (~96% of surfaced pushes scored useful; ~307 ms median per beat; degrade-to-L4 honesty held); the same usefulness gate L4 passed keeps accruing as the revert signal, and `config parallel_recall off` steps back to L4.
L6 — Federated / distributed memory (beyond human-like)	The parallel model extended across endpoints — machines, agents, a household — behind one recall. Reached only after L5: a single mind is not federated, so this sits above the human-likeness climb, not on it.	A cross-machine aggregator; a fleet or household memory.	Far out; encryption-gated.

L0–L3 are about autonomy — memory learning to write itself and recall without being told. L4–L5 are about rhythm and parallelism, and that is where the human-likeness actually lives. Today’s AI thinks at length in a serial stream while memory, at best, is consulted once at the edges of a turn. A human mind, mid-thought, lights up in parallel across many domains at once and the action-state settles on a pattern. L5 is the computational mirror of that: not a single lookup but a field of simultaneous local recalls resolving into a direction.

The rungs are cumulative heartbeats of one idea. L2 is the foothold — memory writes itself. L3 is the first heartbeat — memory speaks back, once per turn. L4 (proven) makes it beat repeatedly within a thought. L5 (default-on, where we live today) makes it beat in parallel across domains. L6 then extends that beating mind across machines — deliberately last. At every rung the “memory, not a mind” line holds: memory surfaces and settles structure; judgment and action stay in the reasoning model.

Cross-cutting work that strengthens every rung

The ladder measures the maturity of the coupling. Orthogonal to it — and to the two hardware dials — are signals that make every recall, at every rung, land on the right memories instead of merely plausible ones. These are not rungs, but the rungs lean on them.

Per-memory usefulness feedback. Each memory carries a usefulness signal — an explicit “this helped” mark plus scan-verified downstream use of its pushes — rolled up at session start and fed back into both ranking and the relevance floor. The closing move, directly downstream of L3: every unsolicited surfacing is counted as an impression, kept strictly apart from evidence of use, so the system can tell “this memory keeps getting surfaced but no one ever uses it” from “this memory is used.” That gap nudges a per-memory push floor — proven-useful memories surface a touch more easily, chronically-ignored ones go quiet. Bounded, never hides a memory, reversible.

Project-scoped pulses. The complementary half: a proactive pulse that knows its active context raises the floor for memories that don’t belong to it, so off-context memories stop leaking into the ambient stream on weak matches — while a strongly-relevant one still surfaces and untagged, structural facts are never scoped out. The active context is a pinned project, else the one declared in the prompt, else the working directory — so it works even when every session shares one directory. Push-only; explicit recall stays unscoped.

The Design Conversation The 2026-06-05 session that produced this architecture — verbatim owner turns and faithful Claude summaries.

→