Marketing and Analytics

Teaching Gemma-2-2B to Actually Speak Turkish
Teaching Gemma-2-2B to Actually Speak Turkish

Emincan Tetik

Around 100 million people speak Turkish. So you'd think the open-source language models everyone's building on would handle it reasonably well. They don't. Ask most of them a question in Turkish and you get back something that's almost right but clearly off: a suffix in the wrong place, a verb that doesn't agree, and every so often the model just gives up and answers half in English.

We got tired of working around that, so we decided to fix it ourselves.

What follows is the story of how we took Google's Gemma-2-2B model and trained it to handle Turkish properly: why it was broken to begin with, what we did about it, and how much it actually improved. There's a technical layer here, but the short version is that you don't need a giant model or a giant budget to make a real difference for a language like Turkish.

Why Turkish Trips Models Up

The short version is that Turkish is agglutinative. You take a root word and keep stacking suffixes onto it, each one adding tense, possession, case, plurality, and so on. Arkadaşlarımızınkilerin — very roughly, "of those belonging to our friends" — is one word, and nobody who speaks Turkish would blink at it. If your model learned language mostly from English text, this is exactly the kind of thing it has never really had to deal with.

When we ran the base Gemma-2-2B model before doing anything to it, the problems were consistent. It botched verb and noun endings constantly. It would fall back on English word order (Turkish puts the verb at the end; English doesn't). It switched into English when it should have stayed in Turkish almost a quarter of the time. It lost the thread in longer conversations, and it read idioms literally instead of understanding them.

On ARC-TR, the Turkish version of a standard reasoning benchmark that tests comprehension rather than just grammar, the base model scored 0.188. That number told us what we already felt from poking at it: technically it spoke Turkish, but you wouldn't ship it.

What We Did, Without the Jargon

Training a model from scratch was never on the table. Too expensive, too slow, and unnecessary for what we were trying to do. Instead we used an approach called LoRA, which lets you take an existing model and adapt it without retraining the whole thing. In plain terms: we left the original model mostly untouched and only trained an adaptation layer that teaches it Turkish. Compared with training from scratch, this is far faster and more resource-efficient, and the original model is preserved as-is.

For the technically curious: we used LoRA at rank 128 with 4-bit quantization, applied across both the attention and MLP layers, trained in BF16 mixed precision on 8×NVIDIA H200 GPUs. Full configs are in the open-source release.

The Data

We trained on an open Turkish dataset from TÜBİTAK BİLGEM. We used a slice of it (36 of 95 files) that gave us enough variety without dragging training on forever.

The mix is roughly: news (30%), encyclopedic content (25%), literary text (20%), social media and forums (15%), and academic or technical writing (10%). That spread matters because it's what stops the model from sounding like it only ever read one kind of text.

None of it went in raw. We cleaned out HTML and links, filtered anything that wasn't clearly Turkish, removed duplicates, dropped low-quality text, and masked personal information. After all that we were left with about 3.19 million training examples.

What Changed

Here's the headline comparison on the ARC-TR reasoning benchmark:


Metric

Base

Fine-tuned

Change

Accuracy

0.188

0.224

+19.1%

Normalized accuracy

0.244

0.277

+13.5%

The benchmark moved, but the more telling improvements showed up in the model's actual output: grammar accuracy went from 67.2% to 92.4% on our Turkish grammar test, the English code-switching problem fell from 23.7% to 4.2%, Turkish paraphrasing quality more than doubled on standard scoring, and the model stayed coherent over longer passages far more often (71.5% to 87.3%).

What that adds up to in practice: the model finishes Turkish sentences the way a person would, stops bailing into English, writes summaries that hold together, and handles the kind of suffix gymnastics that used to break it.

What This Says About Smaller Languages

Turkish sits in an awkward middle ground. There's a decent amount of data out there, but nothing close to what English or Chinese have. What our results suggest is that you don't need a giant model or a giant budget to close the gap. A compact model, a sensible approach, and a dataset that's been cleaned up and covers a range of topics will get you a long way.

We think the same recipe would work for plenty of other languages that are underserved — where the distance between "the model technically works" and "the model sounds native" is still wide.

Where We're Taking It

This is a prototype. It's good, but it's not done.

We only used part of the dataset, so the obvious next step is training on the full thing. We'd also like to do instruction tuning, which is what turns a raw model into something that can follow instructions and hold a proper conversation. On the product side, we want to adapt it specifically for the analytics, customer data, and BI use cases that matter for the B2Metric platform.

Open Source

The training code, configs, and benchmark results are going out publicly. The point is to give the Turkish NLP community something to build on and to leave a reference for anyone tackling a similar low-resource language problem.

If you're working on Turkish NLP, want to extend any of this, or just have questions about how we did it, get in touch — we'd genuinely like to hear from you.

B2Metric is a product analytics and customer data platform. We took this on as part of building AI infrastructure that can actually make sense of Turkish-language data.

Related Blogs