A small, fast, local-first AI assistant built on RWKV-7 with ternary BitLinear quantization. Shipped in three sizes, ~50M to ~150M parameters. Train cheaply on a Google Cloud L4; run locally on Linux, macOS (Apple Silicon), or Windows.
All Rust. No C++ compiler. No cloud calls at inference time. Apache 2.0.
tinybit is a small assistant that is useful because it is optimized, measurable, tool-aware, and honest about its limits. It is meant to be hacked on and inspected.
It is not a ChatGPT clone. A 50M-parameter model is not going to match a large cloud model on knowledge or reasoning, and tinybit doesn't pretend otherwise. It wins by being small, local, fast, controllable, easy to train, and easy to read.
V1.0 is local CLI inference first: chat, eval, train, convert,
download. There is intentionally no HTTP server.
- RWKV-7 architecture — recurrent, O(1) memory at inference (no KV cache); state is a fixed-size matrix per layer regardless of context length.
- Two model families — general (
micro/bit/qbit) and coding (*-coding). Same architecture, different training data + persona. See MODELS.md. - BitLinear quantization — ternary weights
{-1, 0, +1}(BitNet b1.58 STE), packed on export. - Built-in local tools — calculator, time, todos, notes, calendar (SQLite-backed, user-extensible) via a stable token protocol.
- Stable prompt format — the exact
system:/user:/assistant:template used intinybit chatis the one the training data is formatted with. - Cheap training — a 50M
micromodel is ~1.9 days / ~$32 on a single L4 (measured post-optimization; see TRAINING.md). - Fused WKV CUDA kernel — the RWKV-7 scan runs as a single fused forward/backward op on CUDA (≈25× the naive candle loop for the scan).
- Real evals —
tinybit evalreports perplexity and runs generation sanity prompts so quality is measured, not guessed.
# Prerequisites: Rust stable (see rust-toolchain.toml)
git clone <this-repo> && cd tinybit
cargo build --release --workspace
# GPU build — add --features cuda (needs the CUDA 12.8 toolkit; on Windows build
# from a VS dev shell so nvcc finds cl.exe — see CLAUDE.md decision 12). Big win
# for TRAINING and `eval` (batched: ~30x on an A2000); eval/chat/train auto-detect
# the GPU. Interactive `chat` of a tiny model is launch-bound — often FASTER on CPU
# (force CPU on a cuda build with CUDA_VISIBLE_DEVICES=-1).
# cargo build --release --workspace --features cuda
# 1. Get the tokenizer
./target/release/tinybit download --output .
# 2. Verify the training pipeline locally (CPU, a few minutes)
./target/release/tinybit train \
--model-config configs/micro.toml \
--train-config configs/train-micro-l4.toml --smoke-test
# 3. After you have a checkpoint, chat with it
./target/release/tinybit chat \
--config configs/micro.toml \
--model checkpoints/step_0025000.safetensors
# 4. Measure quality
./target/release/tinybit eval \
--config configs/micro.toml \
--model checkpoints/step_0025000.safetensors \
--data data/val.bintinybit V1.0 ships no pretrained weights — you train your own. The 50M
microrun is the documented target (see TRAINING.md).
Windows: the same commands work in PowerShell — use
.\target\release\tinybit.exeand.\scripts\*.ps1(e.g..\scripts\prepare_data.ps1,.\scripts\gcp_launch.ps1). Everyscripts/*.shhas a.ps1sibling; only the in-VMscripts/cloud/startup.shstays bash (it runs on the Linux training VM). No WSL or Git Bash required.
| Command | What it does |
|---|---|
tinybit chat |
Interactive local REPL. Streams tokens, supports tools, slash-commands (/help, /reset, /system, /save). |
tinybit eval |
Perplexity over a token file + greedy generation sanity prompts (with tok/s). |
tinybit train |
Train from a model+train config. --smoke-test runs a short sanity loop; --resume continues from the latest checkpoint. |
tinybit convert |
Export a checkpoint; --quantize packs 2D matrices to ternary (smaller on disk). |
tinybit download |
Fetch tokenizer.json from HuggingFace. |
Run tinybit <command> --help for all flags.
Three sizes × two families. Architecture is shared; the family is the training-data mix + default persona. Full details in MODELS.md.
| Preset | Params | Layers | d_model | d_ffn | L4-trainable |
|---|---|---|---|---|---|
| micro | ~50M | 16 | 384 | 1344 | ✅ (main target, batch 11) |
| bit | ~100M | 12 | 640 | 2240 | ✅ (train-bit-l4.toml) |
| qbit | ~150M | 13 | 768 | 2688 | ✅ (train-qbit-l4.toml) |
Each has a -coding sibling (configs/<size>-coding.toml) trained on a
code-heavy data mix. All use a 3.5× FFN expansion (d_ffn = 3.5 × d_model).
The model calls tools with a stable token protocol:
<|tool_call|>{"tool":"calculator","args":{"expr":"2^10"}}<|end_tool_call|>
The runtime executes the tool and injects:
<|tool_result|>1024<|end_tool_result|>
| Tool | Description |
|---|---|
time |
Current date, time, timezone |
calculator |
Math via evalexpr (e.g. 2+2, 12^2, sqrt(144)) |
todos |
Add / list / complete / delete tasks (SQLite) |
notes |
Save and full-text search notes (SQLite FTS5) |
calendar |
Add / list / delete events (SQLite) |
Add your own by implementing the Tool trait in tinybit-tools and registering
it. Note: reliable tool calling depends on instruction/tool fine-tuning — the
plumbing (detect → execute → inject → continue) is complete and tested, but a
base-pretrained model emits tool calls only loosely. See Known limitations.
Full guide: TRAINING.md. In short:
# Prepare data (general or coding mix)
DATA_PROFILE=general TOTAL_TOKENS=1500000000 ./scripts/prepare_data.sh data/
# Train locally or launch an L4 on GCP
DATA_TOKENS=1500000000 TRAIN_CONFIG=configs/train-micro-l4.toml \
./scripts/gcp_launch.sh micro# Windows / PowerShell equivalents
$env:DATA_PROFILE = "general"; $env:TOTAL_TOKENS = "1500000000"
.\scripts\prepare_data.ps1 data
$env:DATA_TOKENS = "1500000000"; $env:TRAIN_CONFIG = "configs/train-micro-l4.toml"
.\scripts\gcp_launch.ps1 micro- Optimizer: AdamW by default (validated). Muon is opt-in (
optimizer = "muon") for 2D weight matrices and is experimental. - LR schedule: WSD (warmup → stable → cosine decay).
- Checkpoints: safetensors + JSON meta; the trainer keeps the best-3 and most-recent-3.
# Re-save as safetensors (identity copy)
tinybit convert --input ckpt.safetensors --output model.safetensors
# Pack 2D weight matrices to ternary (smaller on disk; ~5 weights/byte, base-3)
tinybit convert --input ckpt.safetensors --output model.q.safetensors --quantize--quantize packs every 2D weight matrix to ternary {−1, 0, +1} at ~1.6
bits/weight (five values per byte, base-3); the tied embedding/LM-head stays f32.
On the 50M micro checkpoint that is ~3.2× smaller on disk (the f32 embedding
is the floor). Loading dequantizes back to f32 and runs the normal path — a
true ternary-matmul runtime is future work, so it's a storage/format win, not an
inference-speed win.
Caveat — this is post-training quantization. For an f32-trained model (e.g.
micro, ternary_ffn = false) ternarizing the weights is lossy: measured
perplexity rises ~83 → ~590 on the micro checkpoint. Near-lossless ternary needs
quantization-aware training — the bit/qbit configs with ternary_ffn = true,
trained from scratch so the STE learns ternary-friendly weights.
When to use it (rarely, at these sizes): at 50–150M params the f32 file is
already small and quantizing gives no speed win while costing quality — so for
deployment, don't bother. For throughput (training, batched eval) build
--features cuda and use a GPU (measured ~30× over CPU on an A2000); keep weights
full-precision. Note interactive chat (single-token decode) of a tiny model
is launch-bound and is often faster on CPU — see the speed note below. Reach for
--quantize only to ship a bit/qbit model trained ternary-aware.
Token IDs (B, T)
│ EmbeddingHead.embed()
Hidden (B, T, D)
│ × num_layers: LN → TimeMix (WKV scan) → +res → LN → ChannelMix → +res
│ EmbeddingHead.lm_head() (tied weights, scaled by 1/√d_model)
Logits (B, T, vocab_size)
- TimeMix (RWKV-7 WKV): token-shifted r/k/v/gate; recurrent state
S_t = S_{t-1}·exp(-exp(time_decay)) + kᵀv; readouty = r·S, group-normed, gated. On CUDA this scan is a fused kernel; on CPU it's a sequential loop (numerically equivalent). - ChannelMix (FFN):
k = SiLU(W_k·x),v = W_v·k²,r = σ(W_r·x),out = r·v. - BitLinear: RMSNorm + linear; full-precision master weights in training (STE), ternary on export.
LayerNorms are a hand-rolled differentiable implementation (candle's fused
LayerNorm has no backward — using it silently froze training before this was
fixed; tests/grad_flow.rs guards against regressing it).
cargo test --workspaceGate tests (must pass before any real run):
test_forward_shapes_nano— logits shape + finite checktest_inference_step_matches_train— step-by-step matches full-sequence forwardsmoke_train_nano_100_steps— initial loss ≤ 2× ln(vocab)grad_flow(intinybit-core) — every parameter receives a finite gradient and the model overfits a fixed batch
- Small models are small. Expect coherent short text and simple instruction
following — not reliable factuality or multi-step reasoning.
tinybit evalprints perplexity so you can see where a checkpoint actually lands. - No pretrained weights ship with V1.0. You train your own.
- Tool calling needs fine-tuning to be reliable. The detect/execute/inject loop is complete and tested; a base-pretrained model still emits tool calls only loosely.
- Quantization is storage-only and usually not worth it at these sizes. The packed file loads back as f32 (no speed gain), and post-hoc ternary on an f32-trained model is lossy. A true ternary-matmul runtime is future work.
- GPU helps throughput, not tiny-model decode (measured). On an A2000, batched
eval/training is ~30× faster on GPU, but single-tokenchatis ~3.5× slower than CPU (~21 vs ~75 tok/s) — launch-latency bound at 50M. Use GPU for train/eval; CPU is fine (often faster) for interactive chat. - No GGUF / llama.cpp export. tinybit's custom RWKV-7 + BitLinear architecture isn't a llama.cpp architecture, so a GGUF file couldn't be loaded there — the stub was removed rather than left as a dead option.
- Muon optimizer is experimental — mechanically correct, quality-at-scale unverified. AdamW is the validated default.
crates/
tinybit-core/ — model, config, tokenizer, state, quantize
tinybit-tools/ — tool trait, registry, built-in tools
tinybit-infer/ — inference engine, sampler, session, tool processor
tinybit-train/ — trainer, optimizer, scheduler, loss, checkpoint, data
tinybit-cli/ — CLI (chat, eval, train, convert, download)
tests/ — integration tests (workspace member)
configs/ — model + training TOML configs (6 model variants)
scripts/ — data prep + GCP L4 launcher (Bash *.sh + PowerShell *.ps1)
Apache 2.0 — see LICENSE.