Building a Machine Learning Model to Assess German Writing Proficiency

When I first set out to build an automated language proficiency assessment tool for German, I had ambitious plans. I wanted to evaluate both speaking and writing skills using state-of-the-art models. However, reality quickly set in.

Running a comprehensive speaking assessment would require three different models, one of which demands GPU acceleration. The computational overhead and infrastructure costs made it impractical for my use case. So I made a strategic decision: focus on what’s feasible and do it well.

Writing assessment, on the other hand, is far more practical. It requires less computational power, can run on standard hardware, and still provides tremendous value for language learners. After all, being able to assess someone’s writing proficiency automatically can help millions of German learners understand their current level and track their progress.

The Dataset

I started with a solid foundation: 2,943 German texts:

A1: 57 texts (beginner)
A2: 306 texts (elementary)
B1: 1,284 texts (intermediate)
B2: 1,118 texts (upper intermediate)
C1: 174 texts (advanced)
C2: 4 texts (mastery)

The first thing you’ll notice? Massive class imbalance. The majority of learner texts fall into B1 and B2 levels, while absolute beginners (A1) and near-native speakers (C2) are rare. This mirrors reality; most language learners cluster around intermediate levels, but it poses a challenge for machine learning.

Starting with a Baseline: Traditional ML Features

Before jumping to transformer models and neural networks, I believe in establishing a solid baseline. The question was: How well can we classify German texts using traditional NLP features?

Feature Engineering: Capturing Linguistic Complexity

I designed a feature extractor that captures multiple dimensions of text complexity:

1. Length Metrics

Character count, word count, sentence count
These simple metrics are surprisingly informative; beginners write shorter texts with simpler sentences

2. Lexical Sophistication

Average word length
Type-token ratio (vocabulary diversity)
Long word ratio (>7 characters)
Compound word ratio (≥10 characters; it is so German!)

3. Sentence Complexity

Average sentence length
Standard deviation of sentence lengths (indicates varied syntax)
Commas per sentence (proxy for complex clauses)

4. Discourse Markers

Basic connectors: und, oder, aber, denn
Advanced connectors: dennoch, folglich, allerdings, hingegen

5. TF-IDF Features

500 most important word unigrams and bigrams
Captures vocabulary patterns at each level

This gave me 515 features total, a rich representation of linguistic complexity.

Two Models, One Goal

I trained two classifiers with balanced class weights to handle the imbalance:

1. Logistic Regression

Simple, interpretable, fast
Works well when features are carefully engineered
Good for understanding which features matter most

2. Random Forest

Ensemble of 200 decision trees
Handles non-linear relationships
Provides feature importance rankings

Making It Production-Ready

A model is only useful if you can actually use it. I built a complete pipeline that:

Automatically saves everything: Both models, all preprocessors, feature names, and class labels get saved to a timestamped directory
Easy loading: One function call loads your trained model and makes it ready for predictions
Batch prediction: Classify hundreds of texts at once for efficiency
Interactive CLI: Test your model with a simple command-line interface

The Results

After training on 2,354 samples and testing on 589, here’s what the baseline models delivered:

Random Forest: The Clear Winner

[caption id=“attachment_501” align=“alignnone” width=“393”] Random Forest Model Accuracy[/caption]

Overall Accuracy: 82% with cross-validation at 80.6% (±2.7%)

Let’s break down the performance by level:

Level
Precision
Recall
F1-Score
Support

A1
0.80
0.36
0.50
11

A2
0.84
0.67
0.75
61

B1
0.82
0.88
0.85
257

B2
0.80
0.86
0.83
224

C1
0.95
0.51
0.67
35

C2
0.00
0.00
0.00
1

Key Observations:

Excellent on B1/B2: F1-scores of 0.85 and 0.83; these are the bread and butter levels, and the model handles them very well
High precision on C1: 95% precision means when the model says “C1”, it’s almost always right. However, 51% recall shows it’s conservative; it misses half of true C1 texts
A2 performing well: Despite being a minority class, 75% F1-score is respectable
A1 struggles: Only 36% recall; the model misclassifies most A1 texts as A2 (understandable, as they’re adjacent levels)
C2 impossible: With only 1 test sample (and 4 total in the dataset), the model never learned C2 patterns

Logistic Regression: Decent but Slower

[caption id=“attachment_502” align=“alignnone” width=“393”] Logistic Regression Model Accuracy[/caption]

Overall Accuracy: 69% with cross-validation at 69.8% (±1.2%)

The logistic regression model performed 13 percentage points worse than Random Forest. However, it showed interesting behavior:

Better recall on A1 (64% vs 36%) and C1 (83% vs 51%)
Lower precision overall, leading to more false positives
More stable cross-validation scores (±1.2% vs ±2.7%)

The confusion matrices reveal the classic pattern: most errors happen between adjacent levels (A1↔A2, B1↔B2, B2↔C1), which mirrors human judgment uncertainty.

Top 15 Features That Matter Most

The Random Forest revealed what drives CEFR classification:

1. $1

2. $1

3. $1

4. $1

5. $1

6. $1

7. $1

8. $1

9. $1

10. $1

11. $1

12. $1

13. $1

14. $1

Notice that all 15 linguistic features appear in the top 15, before any TF-IDF features dominate. This validates the feature engineering approach; these hand-crafted features capture real linguistic differences between proficiency levels.

Class Imbalance Handling: The Weight of Evidence

The class weights applied show the challenge:

C2: 122.62× weight (only 4 samples!)
A1: 8.61× weight (57 samples)
C1: 2.82× weight (174 samples)
A2: 1.60× weight (306 samples)
B2: 0.44× weight (1,118 samples)
B1: 0.38× weight (1,284 samples)

Despite massive weighting (C2 gets 122× the importance of B1), the model still couldn’t learn C2 patterns. This isn’t a model failure; it’s a data reality check. You simply cannot learn from 4 examples.

What’s Next? The Path Forward

[caption id=“attachment_503” align=“alignnone” width=“562”] Testing the model with new data[/caption]

This baseline establishes a foundation, but there’s significant room for improvement:

1. Address Class Imbalance

SMOTE (Synthetic Minority Over-sampling) for rare classes
Consider combining adjacent levels: (A1+A2) → A, (C1+C2) → C
Collect more data for underrepresented levels

2. Add German-Specific Features

Currently, my features are language-agnostic. German has unique characteristics I should exploit:

Part-of-speech distribution (spaCy’s German models)
Case usage patterns (Nominativ, Akkusativ, Dativ, Genitiv)
Verb complexity (Konjunktiv II, Plusquamperfekt at higher levels)
Dependency parsing depth (nested clauses)
Separable verb usage

3. Try Deep Learning

The logical next step: fine-tune German BERT (gbert-base or bert-base-german-cased):
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
‘bert-base-german-cased’,
num_labels=6
)

This should capture subtle linguistic patterns that handcrafted features miss.

4. Hierarchical Classification

Instead of flat 6-way classification, use a two-stage approach:

Stage 1: Classify into broad categories

Basic (A1-A2)
Intermediate (B1-B2)
Advanced (C1-C2)

Stage 2: Fine-grained classification within each category

This could improve accuracy, especially for confused adjacent levels.

The Takeaway: Start Simple, Iterate Intelligently

Starting with a baseline taught me several lessons:

1. $1

2. $1

3. $1

4. $1