An empirical study investigating whether transformer models truly learn structural rules in formal languages or simply memorize statistical surface patterns within the training distribution.
Can transformers learn the actual generative structure of a language, or do they only approximate patterns seen during training?
Transformer models routinely achieve impressive performance on sequence
modeling tasks, yet performance metrics alone cannot determine whether
these systems genuinely understand the rules governing a structure or
merely approximate patterns within a dataset.
This project isolates that distinction through deterministic finite
automata, where the true generative rules are fully known. By creating
a controlled experimental environment, we can directly observe where
transformers succeed, where they fail, and whether those failures reveal
deeper architectural limitations.
Prior work often focuses on benchmark accuracy or theoretical learnability in isolation. Very little research combines both behavioral evaluation and structural probing to identify whether models genuinely learn rules.
All datasets were generated using deterministic finite automata, ensuring every sequence has a precisely computable structure. This allows evaluation beyond simple accuracy metrics and enables direct probing of structural understanding.
Regular languages generated through DFA transitions over constrained alphabets.
GPT-style transformer architectures evaluated under identical training conditions.
Structural failure was measured through targeted probing rather than benchmark accuracy alone.
Experimental Results
While transformers achieved strong in-distribution validation performance, their behavior collapsed under longer out-of-distribution sequences, revealing reliance on statistical locality rather than structural rule learning.
Validation Dynamics
Both classification and next-token models demonstrate smooth validation convergence during training. However, later experiments reveal that this apparent success does not transfer outside the original training distribution.
Validation accuracy exceeds 90%, suggesting successful memorization of training dynamics.
A dramatic perplexity spike appears immediately beyond the training boundary, indicating failure to generalize algorithmically.
Key Observation
Scaling Analysis
Increasing parameter count alone fails to guarantee stronger structural generalization. Longer training contexts improve OOD performance more reliably than naive scaling.
End-of-sequence prediction deteriorates sharply outside the training range.
The model exhibits severe uncertainty once sequence lengths exceed familiar contexts.
Symbol-level prediction remains relatively stable, suggesting dependence on local statistical cues.
OOD perplexity spikes indicate that the model does not consistently recover underlying DFA structure.
Models performed well on sequences similar to training data, but degraded sharply on longer unseen sequences. The failures consistently concentrated around sequence termination, suggesting the absence of true structural understanding.
Transformers achieved stable perplexity and strong next-token prediction within the training distribution.
Performance degraded sharply on longer unseen sequences, indicating failure to extrapolate structural rules.
End-of-sequence prediction emerged as the primary breakdown point, revealing weak state-sensitive reasoning.
Larger models improved in-distribution metrics but did not resolve structural generalization failures.
Our findings suggest transformers trained through next-token prediction do not reliably learn functions of this form. Instead, they often rely on local statistical correlations that remain effective only within the original training distribution.
The DFA framework creates a controlled environment for studying structural learning and opens several directions for future work surrounding interpretability, inductive bias, and symbolic reasoning.