TRANSLATION & TRANSCRIPTIONCHAMGEI/LABS RESEARCH NOTE 001

English in.
Kalenjin out.
& vice versa.

TONY KIPKEMBOI2026-05-13UPDATED

We trained a 600M NLLB+LoRA adapter for translation across English, Swahili, and Kalenjin, and a Whisper LoRA for Kalenjin speech-to-text. Open weights, open data, open evaluations. This page describes the method, coverage, and limits.For the live demo, go to chamgei.com.

§ 01 — METHOD

The translation stack is a cascade. We started from NLLB-200-distilled-600M — Meta’s open multilingual model that already speaks English and Swahili — then taught it Kalenjin by attaching a LoRA adapter and fine-tuning on the pair we could actually find at scale: Swahili↔Kalenjin.^[01]

The adapter itself is small. Rank 64, alpha 128, six target modules ( q_proj / k_proj / v_proj / out_proj / fc1 / fc2), 34.6M trainable parameters — about 5.33% of the base. Kalenjin has no language code in NLLB’s vocabulary, so the KAL→SW direction is inverted at training time: the source slot is filled with luo_Latn and the target with swh_Latn. The model never knows the “wrong” label — it just learns the mapping.^[02]

We warm-started from the Phase 1d adapter and trained on the thinkKenya kln_swa split — 28,101 sentence pairs — for 42 epochs. The peak chrF++ landed at 68.55 at epoch 32; the last ten epochs added zero. We stopped, packed the checkpoint, and shipped.

§ 02 — TEACHING WHISPER KALENJIN

Whisper didn’t know
Kalenjin. Now it does.

Zero-shot, Whisper-large-v3-turbo does not speak Kalenjin. On a held-out slice of the eval set it scored a WER around 124% — worse than transcribing silence, because hallucinating fluent English over Kalenjin audio is a more expensive error than emitting nothing at all.

The corpus was the unlock. The Anv-ke/Kalenjin dataset from AfriVoices-KE — gated, ~521 hours, a 57/43 split between Kipsigis and Nandi — landed in our account on 2026-04-22.^[03]

We ran a smoke test first. A 3-epoch LoRA pass over a sliver of the corpus on Modal cost about $1 and dropped WER from 124% to 68.8%. That was the proof of concept: the adapter shape worked, the data shape worked, the budget worked.

The scaled run came next. 20,000 clips on a single A100, total spend roughly $5.50, WER of 60.5% at epoch 2 (epoch 3 overfit cleanly — train loss kept falling, eval crept back up). The shipped T2 checkpoint comes from epoch 2.

There was a bug halfway through. A base64 decoding edge case in the data loader was silently dropping 1.87% of the rows — not erroring, just skipping. Patched, re-ran the affected shard, 100% rescue rate.

The live result is 26.4% CER, ~3 seconds of warm latency, with a 60-second clip cap on the endpoint. English LibriSpeech regressed about 4 points on the same checkpoint — tracked, acceptable, and the kind of thing T3 should pull back.^[04]

§ 03 — HOW THE CASCADE WORKS

A pivot, a LoRA,
and a router.

Three pieces, deterministically wired. The base handles EN↔SW. The adapter handles SW↔KAL. A router decides whether a request is direct or has to pivot through Swahili.

STEP01

NLLB-200 600M

SW ↔ EN base

STEP02

LoRA (r=64, KAL)

28k pairs, 42 epochs (peak ep 32)

STEP03

Router

deterministic dispatch · SW pivot for EN↔KAL

OUTPUT

KAL ↔ SW ↔ EN

6 directions: 4 direct, 2 via SW pivot

§ 04 — COVERAGE

Three languages in, three languages out. Most pairs are direct; two compose through Swahili. The table below is the full routing matrix the system understands today.

Table 1Routing matrix · 6 productive directions

SRC \ TGT

KAL

identity

direct

cascaded

direct

identity

direct

KAL

cascaded

direct

identity

DirectCascaded (SW pivot)Identity

§ 05 — LIMITS

Where it’s still rough.

Embellishment.NLLB occasionally adds politeness particles not in the source — “please,” “sir,” a softening tag — even when the source is flat.
Rare vocabulary.Cultural-specific nouns — Kipsigis names of plants, kinship terms, ritual vocabulary — drop coverage. The corpus is general, not ethnographic.
Dialect bias.The training set leans Kipsigis and Nandi. Tugen and Marakwet speakers will hear themselves under-represented. Sebei, Pökoot, Endorois too.
No tone marks.The orthography flattens tone. Minimal pairs that carry meaning by tone alone can collide on the page — and therefore on the model.

§ 06 — ARTIFACTS

[01]MODEL CARDhuggingface.co/Tonykip/chamgei-kal2sw-nllb600m
[02]DATASEThuggingface.co/datasets/thinkKenya/kenyan-low-resource-language-data
[03]PAPER (DRAFT)chamgei.com/research/kal-sw-001.pdf
[04]TRAINING REPOgithub.com/tonykipkemboi/kalenjin-text-lora
[05]EVAL SETpaper_eval_v1 · internal · 250-row sample of thinkKenya kln_swa/test · seed=42

§ 07 — KALENJIN TTS (PREVIEW)

The [ LISTEN ]tab is a first look at giving Kalenjin a synthetic voice. We started from an open-source English voice model (Orpheus-3B) and trained a small Kalenjin layer on top of it — the same approach as the translation work in § 01, just applied to speech instead of text. The model reads a written sentence and produces an audio waveform end-to-end; there’s no separate “voice” engine bolted on.

The clips in the tab were rendered ahead of time from Kalenjin sentences the model didn’t see during training, paired with the original speaker reading the same sentence. That lets you compare by ear instead of taking our word for it. This is one short training run on a single GPU — a first attempt, not a polished voice.^[07]

Data

We started from the same 521-hour AfriVoices-KE Kalenjin corpus that powers the speech-to-text work in § 02, and kept only the cleanest scripted audio: clips between 3 and 10 seconds, no clipping or background hiss, no English mixed in, no long silences at the edges. After filtering we were left with 66.9 hoursof audio (about 42,000 clips) from speakers whose dialect we could confirm — including eight contributors whose dialect was missing from the source spreadsheet and had to be identified by a native Kalenjin speaker listening to a sample of their recordings.^[08]

Dialect coverage

The training audio is split roughly evenly between Nandi (52 % of clips) and Kipsigis (48 %). At the speaker level the balance flips slightly toward Kipsigis, because Nandi contributors recorded more clips each on average. Other Kalenjin dialects are not represented yet.

Limits

Only two dialects so far — Nandi and Kipsigis. Tugen, Marakwet, Sabaot, and Sebei speakers won’t recognize themselves in these clips.
The speakers in the training audio skew male (52 %), tertiary-educated (71 %), and young (18–29 is the largest bracket). The voice will lean that way too.
We haven’t yet had Kalenjin speakers grade these clips against the originals. Until that happens, treat the audio as an early sketch.
There is no separate hidden-test benchmark for this model — the dataset authors keep that audio private — so we can’t report a single headline accuracy number yet.

Artifacts

[01]MERGED 16-BIThuggingface.co/Tonykip/kalenjin-tts-orpheus-v1
[02]LORA ADAPTERhuggingface.co/Tonykip/kalenjin-tts-orpheus-v1-lora
[03]GGUF · Q5_K_Mhuggingface.co/Tonykip/kalenjin-tts-orpheus-v1-gguf
[04]BASE MODELunsloth/orpheus-3b-0.1-ft · Apache-2.0
[05]DATASETAfriVoices-KE · Wanzare et al. 2026 · CC BY 4.0

Whisper didn’t knowKalenjin. Now it does.

A pivot, a LoRA,and a router.

Where it’s still rough.

Data

Dialect coverage

Limits

Artifacts

Whisper didn’t know
Kalenjin. Now it does.

A pivot, a LoRA,
and a router.