How is Unlimited Context different from RAG?

Traditional RAG retrieves from a static external corpus before generation. Unlimited Context continuously encodes the model's own live working memory — everything that overflows its window during a run — and pages it back concurrently with generation, so a long agentic build stays coherent without re-stuffing the prompt.

Does Unlimited Context run locally and offline?

Yes. The core is numpy-only and runs fully offline with no API key or account. It wraps local models via Ollama, llama.cpp, or Hugging Face Transformers. A 5 GB pool stores roughly 1.16 billion tokens of encoded reach on disk.

How much memory (RAM) does it use?

Vectors live on disk (memory-mapped); only the index graph and a hot working set are resident. RAM is approximately 180 MB base + 29 MB per GB of pool + 30 MB per active session, so a 5 GB pool uses about 146 MB of resident index RAM.

How do I install Unlimited Context?

Run pip install aether-context, then: from aether_context import Session; s = Session(model='ollama/qwen2.5', pool_gb=5); s.run('your long task'). The core install is numpy-only; llama.cpp and Hugging Face are optional extras.

← aethersystems.net
Whitepaper · Aether AI

Unlimited Context: Virtual Memory for LLM Attention

Q: What is Unlimited Context?

Unlimited Context is an open-source engine (Python package aether-context) that gives any local large language model billion-token reach. Instead of compressing overflow when the context window fills, it encodes the overflow to a local on-disk vector pool and pages the right slice back in while the model reasons — virtual memory for an LLM's attention.

Q: Does "unlimited" mean an infinite attention window?

No. Unlimited means reach, not attention. The model keeps its native context window; Unlimited Context lets it reach a billion-token pool in slices via fast retrieval. Quality rides on retrieval hit rate, which the slice loader is built to keep high.

How any local large language model gets billion-token reach by encoding overflow to a local pool and paging the right slice back — instead of compress-and-forget.

By Brandon Barrante, Aether AI · Published June 2, 2026 · Last updated June 14, 2026 · Open-source, Apache-2.0

What it is, in one paragraph. Unlimited Context is an open-source engine — the Python package aether-context — that gives any local LLM (via Ollama, llama.cpp, or Hugging Face) reach over roughly a billion tokens of context. When the model's window fills, it does not summarize and discard the overflow. It encodes the overflow into a local, memory-mapped vector pool on disk and pages the relevant slice back into the working window exactly when the model needs it — concurrently with generation. It is virtual memory, for an LLM's attention.

⭐ Star on GitHub pip install aether-context

The problem: long runs rot in the middle

Every long agentic run dies the same way. The model fills its context window, begins compressing its own history to make room, and silently drops the one detail that mattered three steps ago. Then it drifts — the runaway pull request, the agent that confidently rewrites a function it already wrote, the build that falls apart at hour two.

Bigger windows only delay the failure. A crammed million-token window suffers from "lost in the middle": transformer models reliably use information at the start and end of a long context far better than information buried in the middle, so a stuffed window quietly degrades even when nothing has been dropped [1]. Two forces compound: compaction loss (summarizing throws away specifics) and positional rot (mid-context facts are under-attended).

The fix: encode & recover, not compress & forget

Unlimited Context fixes the overflow, not the window. Instead of blindly summarizing what spills over, it encodes and externalizes it to a local pool, and retrieves the right slice back on demand. Nothing load-bearing is silently lost — it is filed, and recoverable.

Compress & forget ✗ → Encode & recover ✓

How it works: virtual memory for attention

The cleanest way to understand the architecture is to map it onto an operating system's virtual memory:

OS concept	Unlimited Context
RAM (small, fast)	Resident window — what the model sees this turn
Disk (vast, cheap)	Context pool — a memory-mapped vector index on disk, ~5 GB, ~1B tokens
Pager	Slice loader — prefetches the next slice from what the model is reasoning about right now, on a background thread
Page replacement	Retention policy — useful slices stay, stale ones fade, anything relevant again comes back
Encode-on-spill	Encoder — turns overflow into compact retrieval vectors as it streams

Because the pager runs concurrently with generation — hidden behind the model's own thinking — reaching the pool adds no extra wall-clock latency. The retrieval is effectively free in time; what it costs is disk, and a good retrieval hit rate, which the loader is engineered to keep high.

The five moving parts

Encoder — a fast, local, stateless embedder that turns spilled text into compact retrieval vectors as it streams; no GPU, no network, no model download.
Context pool — a session-namespaced, memory-mapped vector store with a hard size budget. Vectors live on disk; only the index and a hot working set are ever resident in RAM.
Slice loader (pager) — fetches the slices the model is most likely to need next into a warm cache, and tracks its hit rate.
Retention — decides what stays reachable: useful slices stay, stale ones fade, anything relevant again comes back; under budget pressure the least-valuable slices are evicted first.
Session — the lifecycle controller: open a fresh window, stream-encode-and-fade as the model emits, reason over a paged window, then close.

Unlimited Context vs. the alternatives

Long-context approaches make different trade-offs. The comparison below is the fast way to place Unlimited Context against the four common strategies developers reach for.

Approach	Reach	Loses detail?	Cost model	Local / private
Unlimited Context	~1B+ tokens (per 5 GB)	No — encoded & recoverable	Disk + retrieval (one-time encode)	Yes — fully local
Bigger context window	Up to model limit (e.g. 1M)	No, but rots in the middle [1]	Quadratic-ish compute & $ per token	Depends on model
Summarization / compaction	Unbounded in theory	Yes — specifics discarded	Extra LLM calls per compaction	Depends
Vector RAG (static corpus)	Corpus size	No, but not the model's own working memory	Embedding + store	Yes, if self-hosted
Fine-tuning	Baked into weights	N/A (not per-run memory)	Training compute	Yes, if local

The key distinction: Unlimited Context is the only one that continuously externalizes and recovers the model's own live overflow during a single long run, rather than pre-loading a fixed corpus or throwing detail away.

Benchmarks: a real, paid run — not a slogan

The headline benchmark is a real, paid, end-to-end run, committed to the repository. A reasoning model — deepseek-v4-pro, via OpenRouter — was driven through a 40-turn agent session that overflows its context window (a deliberately small 2,000-token window, working 60 real microsoft/vscode GitHub issues), measured with the engine off versus on. One live run, total spend $0.19, June 14, 2026.

Cumulative cost and recall coherence versus turn, engine off vs on — deepseek-v4-pro, 40-turn window-overflow session — Cumulative cost and recall coherence over the session, engine off vs on. Once the early reads fall out of the window, the baseline forgets and drifts; the engine holds coherence flat at 1.00.

Three results, straight from the run:

The model stops forgetting. Recall coherence of early facts after they fall out of the window: 0.15 → 1.00 (6.7×). The baseline drifts and forgets; the engine holds every early fact, with zero drift.
Failure turns into success on the real work. Tasks completed correctly: 3 / 20 → 20 / 20. The job is only done right with the engine.
Cheaper, not just better. −24% total session cost, −54% in the recall back-half — the engine sends a compact recalled slice instead of dragging the whole transcript into every call.

Metric	Off (baseline)	On (engine)	Change
Recall coherence (early facts still correct)	0.15	1.00	6.7×
Work outcome (tasks done right)	3 / 20	20 / 20	3 → 20
Cost — full session	$0.0711	$0.0542	−24%
Cost — back half (recall phase)	$0.00117/turn	$0.00053/turn	−54%

Scope, honestly: this measures the engine (retrieve-on-overflow memory). The 2,000-token window is deliberately tiny to force overflow, so a realistic window shows a smaller — still real — gain. N = 20 recall turns, single run, supervised under a $25 hard cap. Committed data and the exact reproduce command live in the repository: python -m bench.api_eval --model deepseek/deepseek-v4-pro --repo microsoft/vscode --arms off,on --plot.

A second, hermetic benchmark proves the mechanism offline in CI with no API spend: python bench/drift_vs_window.py --model ollama/qwen2.5 runs the same base model on a long scripted build, engine on vs off, and reports drift, per-stage correctness, retrieval hit rate, and unattended completion.

The numbers behind the reach

"Billion-token memory" is derived, not a slogan. Each encoded slice is ~2.2 KB (a compact vector plus compressed text and metadata) and represents ~512 tokens. That works out to roughly 455,000 slices per gigabyte → ~233 million tokens of reach per gigabyte, so reach ≈ pool_GB × 233M.

Pool	Encoded reach	Resident index RAM
5 GB (floor)	~1.16B tokens	~146 MB
10 GB	~2.33B tokens	~291 MB
20 GB	~4.65B tokens	~582 MB

RAM stays predictable because vectors are memory-mapped on disk: RAM ≈ 180 MB base + 29 MB per GB of pool + 30 MB per session. A bigger pool buys reach, not concurrent sessions — those are RAM-bound either way.

Quickstart

pip install aether-context

from aether_context import Session

s = Session(model="ollama/qwen2.5", pool_gb=5)
s.run("Build me a full-stack weightlifting tracker app.")
# runs long. stays coherent. walk away.

The core install is numpy-only and works offline; the Ollama path uses only the Python standard library. llama.cpp and Hugging Face Transformers are opt-in extras (pip install "aether-context[llamacpp]" / [hf]). No GPU, API key, or account required.

Who it's for

Local-LLM developers running Llama, Qwen, Mistral, or Phi who want long, coherent agentic runs without a frontier API bill.
Agent builders whose autonomous loops drift after an hour because the window compacts.
Privacy-first teams who need context that never leaves the machine.

Frequently asked questions

What is Unlimited Context?

An open-source engine (aether-context) that gives any local LLM billion-token reach by encoding window overflow to a local on-disk vector pool and paging the right slice back while the model reasons — virtual memory for attention.

How is it different from RAG?

RAG retrieves from a static external corpus before generation. Unlimited Context continuously encodes the model's own live working memory during a run and pages it back concurrently with generation.

Does it run locally and offline?

Yes — numpy-only core, no API key or account, wrapping Ollama, llama.cpp, or Hugging Face. A 5 GB pool holds ~1.16B tokens of encoded reach on disk.

Does "unlimited" mean an infinite attention window?

No. It means reach, not attention. The model keeps its native window; the engine lets it reach a billion-token pool in slices via fast retrieval. Quality rides on retrieval hit rate.

What does it cost?

It's free and open-source under Apache-2.0. The only cost is local disk for the pool.

About the authors

Aether AI, founded by Brandon Barrante, builds local-first, verifiable AI infrastructure. Unlimited Context is the open engine in that stack; the hosted Aether platform layers verified knowledge and frontier-model routing on top of the same engine. Unlimited Context is released as open source under Apache-2.0 at github.com/DBarr3/Unlimited-Context-LLM.

References

Liu, N. F., et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. arXiv:2307.03172.
Aggarwal, P., et al. (2024). GEO: Generative Engine Optimization. Proceedings of KDD 2024.
Aether AI (2026). aether-context — open-source context engine, Apache-2.0. github.com/DBarr3/Unlimited-Context-LLM.