AVALON-2B
The first sub-3B language model that knows what it doesn't know.
AVALON-2B is the first language model below 3B parameters to implement Self-Reflective Retrieval-Augmented Generation with learned reflection tokens.
Built upon Qwen 3.5 2B, AVALON introduces a five-token reflection vocabulary — [Retrieval] [No Retrieval] [Relevant] [Utility:5] [Utility:4] — and a 22M-parameter MiniLM router that predicts retrieval necessity at the query level with 90.5% accuracy in 5 ms. The model achieves 82.5% Self-RAG token accuracy under LoRA fine-tuning when embed_tokens and lm_head are saved alongside the LoRA adapters — a recipe that lifts token accuracy from 12% to 82.5% over naive LoRA.
We trained on 201,004 synthetic samples (80% Self-RAG, 20% general instruction) for ~6 hours on 8× A100 80GB. The model beats Qwen 3.5 2B base, Gemma 4 E2B and SmolLM3 3B on its target benchmarks, runs at 40 tok/s on a MacBook Air M3 and 12 tok/s on iPhone 15 Pro, and ships under Apache 2.0 with weights, GGUF quants and the synthetic-data pipeline public.
The router fires in parallel with prompt encoding.
Qwen 3.5 2B — 18 Gated DeltaNet linear-attention layers + 6 full softmax layers, with a 262K native context window. No rope hacks, no position interpolation tricks.
A 22M-parameter MiniLM router predicts retrieval necessity at the query level with 90.5% accuracy in 5 ms. The embedding call happens in parallel with prompt encoding — retrieval is free on the critical path.
We trained with LoRA plus modules_to_save=["embed_tokens", "lm_head"]. Without saving those two modules, the model cannot actually learn the new special tokens — Self-RAG token accuracy stays around 12%. With them saved, it lifts to 82.5%. That single config change is the difference between a paper and a noise floor.
Beats models 1.5–2× our size — and adds a capability nobody has at this scale.
Frontier capability that fits in a phone.
Three lines of Python, or one line of bash.
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("nuroai/Avalon-2B")
tokenizer = AutoTokenizer.from_pretrained("nuroai/Avalon-2B")
system = """You are AVALON, a self-reflective AI assistant.
Before answering: emit [Retrieval] or [No Retrieval].
End every response with [Utility:X] (1-5)."""
prompt = system + "\n\nUser: What's in the news today about lithium prices?"
out = model.generate(**tokenizer(prompt, return_tensors="pt"))
# → [Retrieval] ... <answer> ... [Utility:5]# install
ollama pull nuroai/avalon-2b
# run
ollama run nuroai/avalon-2b
# 1.5 GB Q4_K_M quant
# ~40 tok/s on a MacBook Air M3
# ~12 tok/s on iPhone 15 ProWhat this model does not do.
GSM8K math drops 4.5%
Reflection-token training shifts probability mass away from chained arithmetic. We recover most of it with mixed-task fine-tuning, but the regression is real and we report it.
English-only
Training corpus is 100% English. The reflection vocabulary should transfer, but we have not validated multilingual Self-RAG behavior — and we don't claim it.
Static binary retrieval at start
The router emits a single binary [Retrieval] / [No Retrieval] decision per query. Mid-generation re-routing and graded retrieval intensity are on the roadmap, not in this release.
If AVALON-2B helps your work, please cite.
@misc{ponnada2026avalon,
title = {AVALON-2B: Self-Reflective Retrieval-Augmented Generation
in Sub-3B Language Models},
author = {Ponnada, Akhil and Arvapalli, Naga Sri},
year = {2026},
month = {April},
publisher = {Nuro AI Labs},
howpublished = {\url{https://huggingface.co/nuroai/Avalon-2B}},
note = {Apache 2.0}
}Read the paper. Then run it.
39 pages, NeurIPS format. Authors: Akhil Ponnada · Naga Sri Arvapalli.