← aethersystems.net
Whitepaper · Aether AI

Unlimited Context: Virtual Memory for LLM Attention

How any local large language model gets billion-token reach by encoding overflow to a local pool and paging the right slice back — instead of compress-and-forget.

By Brandon Barrante, Aether AI · Published June 2, 2026 · Last updated June 2, 2026 · Open-source, Apache-2.0

What it is, in one paragraph. Unlimited Context is an open-source engine — the Python package aether-context — that gives any local LLM (via Ollama, llama.cpp, or Hugging Face) reach over roughly a billion tokens of context. When the model's window fills, it does not summarize and discard the overflow. It encodes the overflow into a local, memory-mapped vector pool on disk and pages the relevant slice back into the working window exactly when the model needs it — concurrently with generation. It is virtual memory, for an LLM's attention.

⭐ Star on GitHub pip install aether-context

The problem: long runs rot in the middle

Every long agentic run dies the same way. The model fills its context window, begins compressing its own history to make room, and silently drops the one detail that mattered three steps ago. Then it drifts — the runaway pull request, the agent that confidently rewrites a function it already wrote, the build that falls apart at hour two.

Bigger windows only delay the failure. A crammed million-token window suffers from "lost in the middle": transformer models reliably use information at the start and end of a long context far better than information buried in the middle, so a stuffed window quietly degrades even when nothing has been dropped [1]. Two forces compound: compaction loss (summarizing throws away specifics) and positional rot (mid-context facts are under-attended).

The fix: encode & recover, not compress & forget

Unlimited Context fixes the overflow, not the window. Instead of blindly summarizing what spills over, it encodes and externalizes it to a local pool, and retrieves the right slice back on demand. Nothing load-bearing is silently lost — it is filed, and recoverable.

Compress & forget ✗  →  Encode & recover ✓

How it works: virtual memory for attention

The cleanest way to understand the architecture is to map it onto an operating system's virtual memory:

OS conceptUnlimited Context
RAM (small, fast)Resident window — what the model sees this turn
Disk (vast, cheap)Context pool — a memory-mapped 256-dim vector index, ~5 GB, ~1B tokens
PagerSlice loader — prefetches the next slice from what the model is reasoning about right now, on a background thread
Page replacementWitnesses (+/−) — salient slices harden, stale ones fade, anything relevant again re-hardens
Encode-on-spillStatic encoder — tokenizes overflow and writes 256-dim vectors as it streams

Because the pager runs concurrently with generation — hidden behind the model's own thinking — reaching the pool adds no extra wall-clock latency. The retrieval is effectively free in time; what it costs is disk, and a good retrieval hit rate, which the loader is engineered to keep high.

The five moving parts

Unlimited Context vs. the alternatives

Long-context approaches make different trade-offs. The comparison below is the fast way to place Unlimited Context against the four common strategies developers reach for.

ApproachReachLoses detail?Cost modelLocal / private
Unlimited Context~1B+ tokens (per 5 GB)No — encoded & recoverableDisk + retrieval (one-time encode)Yes — fully local
Bigger context windowUp to model limit (e.g. 1M)No, but rots in the middle [1]Quadratic-ish compute & $ per tokenDepends on model
Summarization / compactionUnbounded in theoryYes — specifics discardedExtra LLM calls per compactionDepends
Vector RAG (static corpus)Corpus sizeNo, but not the model's own working memoryEmbedding + storeYes, if self-hosted
Fine-tuningBaked into weightsN/A (not per-run memory)Training computeYes, if local

The key distinction: Unlimited Context is the only one that continuously externalizes and recovers the model's own live overflow during a single long run, rather than pre-loading a fixed corpus or throwing detail away.

Benchmarks: measure the drift, don't take our word

The pitch is a delta you can reproduce. The bundled benchmark runs the same base model on the same long, multi-stage build twice — engine on versus off — and reports four numbers: cross-stage contradictions (drift), per-stage correctness, retrieval hit rate, and whether the run finished unattended.

Metric (long scripted build)Engine OFFEngine ON
Cross-stage drift (contradictions)30
Per-stage correctness0.01.0
Planted-fact reach0 / 44 / 4
Finished unattendedNoYes

Run it yourself: python bench/drift_vs_window.py --model ollama/qwen2.5. The hermetic mock-model mode proves the mechanism in CI; the real-model flag runs it on your own hardware.

The numbers behind the reach

"Billion-token memory" is derived, not a slogan. Each encoded slice is ~2.2 KB (a 256-dim vector plus compressed text and metadata) and represents ~512 tokens. That works out to roughly 455,000 slices per gigabyte → ~233 million tokens of reach per gigabyte, so reach ≈ pool_GB × 233M.

PoolEncoded reachResident index RAM
5 GB (floor)~1.16B tokens~146 MB
10 GB~2.33B tokens~291 MB
20 GB~4.65B tokens~582 MB

RAM stays predictable because vectors are memory-mapped on disk: RAM ≈ 180 MB base + 29 MB per GB of pool + 30 MB per session. A bigger pool buys reach, not concurrent sessions — those are RAM-bound either way.

Quickstart

pip install aether-context
from aether_context import Session

s = Session(model="ollama/qwen2.5", pool_gb=5)
s.run("Build me a full-stack weightlifting tracker app.")
# runs long. stays coherent. walk away.

The core install is numpy-only and works offline; the Ollama path uses only the Python standard library. llama.cpp and Hugging Face Transformers are opt-in extras (pip install "aether-context[llamacpp]" / [hf]). No GPU, API key, or account required.

Who it's for

Frequently asked questions

What is Unlimited Context?

An open-source engine (aether-context) that gives any local LLM billion-token reach by encoding window overflow to a local on-disk vector pool and paging the right slice back while the model reasons — virtual memory for attention.

How is it different from RAG?

RAG retrieves from a static external corpus before generation. Unlimited Context continuously encodes the model's own live working memory during a run and pages it back concurrently with generation.

Does it run locally and offline?

Yes — numpy-only core, no API key or account, wrapping Ollama, llama.cpp, or Hugging Face. A 5 GB pool holds ~1.16B tokens of encoded reach on disk.

Does "unlimited" mean an infinite attention window?

No. It means reach, not attention. The model keeps its native window; the engine lets it reach a billion-token pool in slices via fast retrieval. Quality rides on retrieval hit rate.

What does it cost?

It's free and open-source under Apache-2.0. The only cost is local disk for the pool.

About the authors

Aether AI, founded by Brandon Barrante, builds local-first, verifiable AI infrastructure. Unlimited Context is the open engine in that stack; the hosted Aether platform layers verified knowledge and frontier-model routing on top of the same engine. Unlimited Context is released as open source under Apache-2.0 at github.com/AetherAi-labs/Unlimited-Context.


References

  1. Liu, N. F., et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. arXiv:2307.03172.
  2. Aggarwal, P., et al. (2024). GEO: Generative Engine Optimization. Proceedings of KDD 2024.
  3. Aether AI (2026). aether-context — open-source context engine, Apache-2.0. github.com/AetherAi-labs/Unlimited-Context.

© 2026 Aether AI · Brandon Barrante. Unlimited Context and aether-context are released under the Apache-2.0 license. aethersystems.net