Building a Machine Learning Model to Assess German Writing Proficiency
When I first set out to build an automated language proficiency assessment tool for German, I had ambitious plans. I wanted to evaluate both speaking and writing skills using state-of-the-art models. However, reality quickly set in.
Running a comprehensive speaking assessment would require three different models, one of which demands GPU acceleration. The computational overhead and infrastructure costs made it impractical for my use case. So I made a strategic decision: focus on what’s feasible and do it well.
Writing assessment, on the other hand, is far more practical. It requires less computational power, can run on standard hardware, and still provides tremendous value for language learners. After all, being able to assess someone’s writing proficiency automatically can help millions of German learners understand their current level and track their progress.
The Dataset
I started with a solid foundation: 2,943 German texts:
-
A1: 57 texts (beginner)
-
A2: 306 texts (elementary)
-
B1: 1,284 texts (intermediate)
-
B2: 1,118 texts (upper intermediate)
-
C1: 174 texts (advanced)
-
C2: 4 texts (mastery)
The first thing you’ll notice? Massive class imbalance. The majority of learner texts fall into B1 and B2 levels, while absolute beginners (A1) and near-native speakers (C2) are rare. This mirrors reality; most language learners cluster around intermediate levels, but it poses a challenge for machine learning.
Starting with a Baseline: Traditional ML Features
Before jumping to transformer models and neural networks, I believe in establishing a solid baseline. The question was: How well can we classify German texts using traditional NLP features?
Feature Engineering: Capturing Linguistic Complexity
I designed a feature extractor that captures multiple dimensions of text complexity:
1. Length Metrics
-
Character count, word count, sentence count
-
These simple metrics are surprisingly informative; beginners write shorter texts with simpler sentences
2. Lexical Sophistication
-
Average word length
-
Type-token ratio (vocabulary diversity)
-
Long word ratio (>7 characters)
-
Compound word ratio (≥10 characters; it is so German!)
3. Sentence Complexity
-
Average sentence length
-
Standard deviation of sentence lengths (indicates varied syntax)
-
Commas per sentence (proxy for complex clauses)
4. Discourse Markers
-
Basic connectors: und, oder, aber, denn
-
Advanced connectors: dennoch, folglich, allerdings, hingegen
5. TF-IDF Features
-
500 most important word unigrams and bigrams
-
Captures vocabulary patterns at each level
This gave me 515 features total, a rich representation of linguistic complexity.
Two Models, One Goal
I trained two classifiers with balanced class weights to handle the imbalance:
1. Logistic Regression
-
Simple, interpretable, fast
-
Works well when features are carefully engineered
-
Good for understanding which features matter most
2. Random Forest
-
Ensemble of 200 decision trees
-
Handles non-linear relationships
-
Provides feature importance rankings
Making It Production-Ready
A model is only useful if you can actually use it. I built a complete pipeline that:
-
Automatically saves everything: Both models, all preprocessors, feature names, and class labels get saved to a timestamped directory
-
Easy loading: One function call loads your trained model and makes it ready for predictions
-
Batch prediction: Classify hundreds of texts at once for efficiency
-
Interactive CLI: Test your model with a simple command-line interface
The Results
After training on 2,354 samples and testing on 589, here’s what the baseline models delivered:
Random Forest: The Clear Winner
[caption id=“attachment_501” align=“alignnone” width=“393”] Random Forest Model Accuracy[/caption]
Overall Accuracy: 82% with cross-validation at 80.6% (±2.7%)
Let’s break down the performance by level:
Level
Precision
Recall
F1-Score
Support
A1
0.80
0.36
0.50
11
A2
0.84
0.67
0.75
61
B1
0.82
0.88
0.85
257
B2
0.80
0.86
0.83
224
C1
0.95
0.51
0.67
35
C2
0.00
0.00
0.00
1
Key Observations:
-
Excellent on B1/B2: F1-scores of 0.85 and 0.83; these are the bread and butter levels, and the model handles them very well
-
High precision on C1: 95% precision means when the model says “C1”, it’s almost always right. However, 51% recall shows it’s conservative; it misses half of true C1 texts
-
A2 performing well: Despite being a minority class, 75% F1-score is respectable
-
A1 struggles: Only 36% recall; the model misclassifies most A1 texts as A2 (understandable, as they’re adjacent levels)
-
C2 impossible: With only 1 test sample (and 4 total in the dataset), the model never learned C2 patterns
Logistic Regression: Decent but Slower
[caption id=“attachment_502” align=“alignnone” width=“393”] Logistic Regression Model Accuracy[/caption]
Overall Accuracy: 69% with cross-validation at 69.8% (±1.2%)
The logistic regression model performed 13 percentage points worse than Random Forest. However, it showed interesting behavior:
-
Better recall on A1 (64% vs 36%) and C1 (83% vs 51%)
-
Lower precision overall, leading to more false positives
-
More stable cross-validation scores (±1.2% vs ±2.7%)
The confusion matrices reveal the classic pattern: most errors happen between adjacent levels (A1↔A2, B1↔B2, B2↔C1), which mirrors human judgment uncertainty.
Top 15 Features That Matter Most
The Random Forest revealed what drives CEFR classification:
1. $1
2. $1
3. $1
4. $1
5. $1
6. $1
7. $1
8. $1
9. $1
10. $1
11. $1
12. $1
13. $1
14. $1
Notice that all 15 linguistic features appear in the top 15, before any TF-IDF features dominate. This validates the feature engineering approach; these hand-crafted features capture real linguistic differences between proficiency levels.
Class Imbalance Handling: The Weight of Evidence
The class weights applied show the challenge:
-
C2: 122.62× weight (only 4 samples!)
-
A1: 8.61× weight (57 samples)
-
C1: 2.82× weight (174 samples)
-
A2: 1.60× weight (306 samples)
-
B2: 0.44× weight (1,118 samples)
-
B1: 0.38× weight (1,284 samples)
Despite massive weighting (C2 gets 122× the importance of B1), the model still couldn’t learn C2 patterns. This isn’t a model failure; it’s a data reality check. You simply cannot learn from 4 examples.
What’s Next? The Path Forward
[caption id=“attachment_503” align=“alignnone” width=“562”] Testing the model with new data[/caption]
This baseline establishes a foundation, but there’s significant room for improvement:
1. Address Class Imbalance
-
SMOTE (Synthetic Minority Over-sampling) for rare classes
-
Consider combining adjacent levels: (A1+A2) → A, (C1+C2) → C
-
Collect more data for underrepresented levels
2. Add German-Specific Features
Currently, my features are language-agnostic. German has unique characteristics I should exploit:
-
Part-of-speech distribution (spaCy’s German models)
-
Case usage patterns (Nominativ, Akkusativ, Dativ, Genitiv)
-
Verb complexity (Konjunktiv II, Plusquamperfekt at higher levels)
-
Dependency parsing depth (nested clauses)
-
Separable verb usage
3. Try Deep Learning
The logical next step: fine-tune German BERT (gbert-base or bert-base-german-cased):
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
‘bert-base-german-cased’,
num_labels=6
)
This should capture subtle linguistic patterns that handcrafted features miss.
4. Hierarchical Classification
Instead of flat 6-way classification, use a two-stage approach:
Stage 1: Classify into broad categories
-
Basic (A1-A2)
-
Intermediate (B1-B2)
-
Advanced (C1-C2)
Stage 2: Fine-grained classification within each category
This could improve accuracy, especially for confused adjacent levels.
The Takeaway: Start Simple, Iterate Intelligently
Starting with a baseline taught me several lessons:
1. $1
2. $1
3. $1
4. $1