The corpora behind
the models.
Two datasets do the heavy lifting at chamgei/labs: a parallel text corpus for translation, and a transcribed audio corpus for speech. A third — the AfriVoices-KE paper — frames the broader effort to give Kenyan languages first-class data. Sources, sizes, and the licensing notes that matter are below.
Translation · Kalenjin ↔ Swahili
The text LoRA was trained on the kln_swa split of thinkKenya/kenyan-low-resource-language-data — a community-contributed Hugging Face dataset of aligned Kalenjin and Swahili sentence pairs.[01]
Pairs were curated, normalized, and deduplicated before training. The held-out split was used to track chrF++ across the training run; the published r001 checkpoint peaked at 68.55 chrF++ at epoch 32 before regressing — which is the checkpoint we ship.
Speech · Kalenjin transcripts
The Whisper transcribe LoRA was trained on Anv-ke/Kalenjin — a transcribed Kalenjin speech corpus covering the two largest spoken dialects.[02]
We fine-tuned Whisper-large-v3-turbo with a LoRA adapter on this corpus. The shipped checkpoint reaches 26.4% CER on held-out Kalenjin speech — usable for short-form transcription today, with longer-form and broader-dialect work on the roadmap.
Field note · AfriVoices-KE
While our adapters were training, the team behind Tech Innovators Network (THiNK) and collaborators across seven Kenyan and East African institutions released AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages — a foundational corpus for inclusive speech tech across five Kenyan languages.[03]
The dataset spans Dholuo, Kikuyu, Kalenjin, Maasai, and Somali, split roughly 75% unscripted and 25% scripted across eleven domains — agriculture & food, healthcare, financial transactions, digital government services, news & media, education, and everyday scenarios. The Kalenjin portion covers Nandi and Kipsigis, the same two dialects in the corpus we already train on.
Collection was crowd-sourced through a custom mobile app over 12,000+ active contributors, with automated SNR validation and native-speaker review. The paper is honest about the operational hard parts — app instability on low-memory phones, rural bandwidth, trust-building around personal data — which is the part most dataset papers skip.
Wanzare, L., Amol, C., Maina, E., Odhiambo, N., Kerubo, H., Misula, L., Oloo, V., Mboya, R., Onkoba, E., Ombui, E., Muguro, J., Maina, C. wa, Kipkebut, A., Otom, A. O., Kang’ethe, I. N., Kanyi, A. W., & Omwenga, B. G. (2026). AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages. arXiv:2604.08448 [cs.CL].
A unified chamgei/labs data card with downloads, schemas, and licensing notes lands here when r002 ships.
How it works →