Bit & Babel

Participants: Jenia (Linguist) and Mario Zechner (Computer Scientist)
Channel: Mario Zechner on YouTube
Duration: Approximately 2 hours

A wide-ranging conversation between a linguist and a computer scientist exploring the intersection of large language models, linguistics, education, and society. The discussion covers technical aspects of how LLMs work, their limitations, implications for academic research and education, and broader societal impacts.

Executive Summary
LLMs and Linguistic Research
How LLMs Actually Work
Language, Dialects, and Bias
The Education Crisis
Programming and Code Generation
Future Implications
Key Quotes

Executive Summary

In this candid and technical discussion, Jenia (a linguist working on Ukrainian refugee language technologies) and Mario Zechner (a computer scientist and developer) dissect the reality of large language models beyond the hype and doom. Their conversation moves from technical explanations of transformer architectures to pressing questions about academic integrity, the future of education, and societal transformation.

Key themes include: the unsuitability of LLMs as proxies for human linguistic research, the fundamental difference between Chomskyan formal grammars and how LLMs actually process language, the educational crisis created by widespread LLM access, the practical realities of using AI for programming, and concerns about language homogenization and the erosion of minority languages and dialects.

Throughout, both speakers maintain a pragmatic stance—neither techno-utopian nor apocalyptic—while acknowledging genuine concerns about academic fraud, the loss of learning through friction, and the uncertain future for both humanities education and programming as professions.

LLMs and Linguistic Research

Can LLMs Replace Human Test Subjects?

Jenia raises a practice she's seen in academic research: using LLMs to simulate test subjects, particularly for hard-to-reach demographic groups. Mario is unequivocal in his assessment: this amounts to scientific fraud for final research papers.

"In lieu of actual human subjects, I guess you could do some pre-hypothesis testing and possibly method testing, but you shouldn't use those for your final paper. That just sounds like scientific fraud." — Mario

The core problem is unknowable bias. When asking an LLM to respond as a specific demographic, researchers have no insight into what sampling biases exist in the training data, making any statistical analysis fundamentally unsound.

Evaluating Dialect and Language Variation

The conversation includes a live experiment with Austrian Styrian dialect. Mario demonstrates how Claude handles (or fails to handle) completing a Styrian dialect poem. While the model recognizes the form and refuses to complete it citing authenticity concerns, this reveals both capabilities and limitations.

"All it sees is stuff like this, right? This is accessible to it through its training set. And this is already useful. But this doesn't cover the breadth and also not the history of a local dialect." — Mario

For mainstream language analysis, LLMs show some competence. For dialectal variation, regional speech patterns, and minority languages, the lack of training data makes them unreliable research tools.

Synthetic Data for Low-Resource Languages

Jenia asks about generating synthetic training data for languages with limited corpora. Mario explains that while data augmentation is standard practice in machine learning, its effectiveness for truly low-resource languages remains questionable. The synthetic data cannot venture far beyond the existing examples, leaving vast portions of the language's possibility space unexplored.

How LLMs Actually Work

Beyond Chomskyan Formal Grammars

Mario, coming from a computer science background steeped in formal grammars, explains how LLMs have upended assumptions about language processing. Programming languages use context-free grammars—strict, deterministic, tree-based structures. LLMs don't work this way at all.

"LLMs have kind of proven, in my opinion, that this is also not how they process language, because this is a strict, formal, deterministic thing. We know that LLMs are not very good at forming encodings of deterministic systems inside of their model parameters." — Mario

Instead, LLMs operate on fuzzy pattern matching that bears striking resemblance to construction grammars in linguistics—learning schemas and constructions from examples rather than following explicit rules.

The Attention Mechanism Explained

Mario provides a detailed walkthrough of how transformers work, using the German sentence "Der hungrige Bub kauft ein Brot" (The hungry boy buys bread) as an example. The process:

Text is broken into tokens (which aren't always full words)
Each token gets converted to a numeric ID from the vocabulary
An embedding layer converts IDs into vectors (lists of numbers)
Attention layers identify relationships between tokens and transfer meaning
The adjective "hungry" gets associated with "boy," the verb "bought" gets associated with both
Each token's vector accumulates semantic meaning from its context
The final vector predicts what token should come next

"This attention layer here actually learns to identify the relationships between words for all kinds of languages and can then based on that say, okay, given 'Bub', is there a word before it that is an adjective? If yes, then add that meaning to 'Bub'." — Mario

The Universal Conceptual Space

A fascinating insight emerges: LLMs trained on multiple languages simultaneously develop an internal representation that strips away surface linguistic form, operating in a language-independent conceptual space. This isn't Chomsky's universal grammar, but something more fundamental—a transformation from grammatical structures to pure semantic concepts and back again.

"Somewhere in this whole pipeline where we push through the words and then get the vectors, we strip away the language part, so to speak, and then reassemble it out here." — Mario

Language, Dialects, and Bias

Machine Translation and Language Politics

Jenia shares experiences from her work with Ukrainian refugees in Austria, highlighting both the utility and limitations of machine translation. Google Translate has Ukrainian support (added prominently during the 2022 invasion), but quirks remain—like routing Russian-to-German translations through English as a pivot language, losing grammatical features like formal address.

"She still used in Russian the polite form. Can I come to you polite? And because it went over English, it still said 'du'. Can I come to you impolite?" — Jenia

Yet despite imperfections, machine translation enables communication. Mario notes that his company has used Google Translate for business communications with Asian customers for a decade—successfully—because the effort to communicate matters more than perfect grammar.

Training on Related Languages

The question arises: should Ukrainian models be initialized from Russian? Jenia provides nuanced insight—grammatically they're similar, but the lexicon diverges toward Polish. The risk is "bleed over," where Russified Ukrainian becomes normalized, potentially accelerating language shift. This has political dimensions given the history of language suppression during Soviet times.

Yet the alternative—excluding Ukrainian entirely—may be worse. The pragmatic approach seems to be accepting some bias while documenting it, rather than perfect purity at the cost of accessibility.

English Dominance and Linguistic Colonialism

Jenia references research framing the current moment as a "third wave" of English linguistic imperialism—after colonialism and mass media, now LLMs trained predominantly on English text. Since models are inevitably influenced by their highest-frequency training data, English patterns, idioms, and structures likely bleed into other languages in model outputs.

"What colonialism didn't do in the first place, now with new media it's like the second wave in terms of English influence. And now we can say it's a third wave with the [LLMs]." — Jenia

The Education Crisis

Academic Fraud and Detection

Both speakers share experiences discovering LLM-generated academic work. Jenia describes reviewing a paper that shifted from paragraphs to bullet points—a telltale LLM formatting quirk. Another colleague received a student essay citing WhatsApp research from the 1990s (impossible, as WhatsApp launched in 2009).

"I just thought it was a young academic like me who maybe was coming from a different background because English isn't their first language. I was just like, why would it go to bullet points if they submit an early draft? And now I put in my time into trying to review and giving constructive criticism. I feel betrayed in a way." — Jenia

The fundamental problem: as LLM-generated text becomes ubiquitous in newspapers, websites, and published material, students growing up reading this text will naturally write in similar patterns. Detection becomes impossible.

The End of Homework?

Mario reveals he stopped doing homework in 8th grade, relying solely on class attention, and still graduated summa cum laude. This prompts reflection: if LLMs can complete all homework assignments, what's the point of assigning them?

"Kids would be fucking stupid if they didn't use an LLM to write all their homework. I would have used an LLM to write all my fucking homework." — Mario

Universities are issuing guidance requiring students to disclose LLM use in grids or statements, but Jenia finds this unhelpful: "I'm not a hundred percent sure how to grade it." If institutional policy accepts LLM assistance, how can educators distinguish student work from machine output?

What Are We Actually Teaching?

Jenia articulates what humanities education fundamentally provides: not domain knowledge, but soft skills—gathering information, synthesizing it, and presenting it coherently. Most linguistics students won't become linguists, but they'll carry these analytical and communication skills into other careers.

"In the humanities, the main thing we teach is the soft skills. Most of my students will not go on to become linguists necessarily, but they will hopefully still be able to use this idea of gathering a lot of information, synthesizing it and presenting it in a good way." — Jenia

If LLMs do this work, what remains? The conversation draws parallels to how Google killed rote memorization—the educational system adapted, focusing on information synthesis rather than recall. But LLMs threaten synthesis itself.

Learning Through Friction

Both speakers converge on a crucial insight: suffering, or at least friction, is integral to learning. For Mario, manually writing code isn't just output—it's thinking, exploring the problem space, understanding interconnections. For Jenia, writing is thinking.

"I believe that suffering is part of learning. Maybe not suffering, but friction. Friction is the thing that makes me grow. If everything is super smooth sailing, I have no incentive to learn a thing, to grow." — Mario

Removing this friction by delegating to LLMs means students never develop deep understanding. They can't evaluate whether solutions are correct, optimal, or even coherent, because they haven't internalized the problem domain.

A Generational Divide

The speakers acknowledge they sound like "two old Luddites" but push back against pure generational pessimism. Young people motivated by genuine interest will still put in the work. The real concern isn't laziness—it's systemic. When students work jobs to afford university, when professors are too overworked to learn new assessment methods, when institutional guidance is vague, the path of least resistance becomes delegation to machines.

Programming and Code Generation

How Mario Actually Uses LLMs for Coding

Mario demonstrates his newspaper scraper—entirely LLM-generated code. But he's not passively accepting outputs. He provides extremely detailed instructions, specifies exactly which files to modify, anticipates what the LLM might break, and constrains it heavily.

"My way of using LLMs is basically putting them on a really tight leash. Giving them very detailed instructions plus context. I give them files, documentation and all the stuff they need to be very surgical and have a better understanding of what they might break." — Mario

LLMs can generate impressive demos—3D scenes, small self-contained programs. But they fail at complex, interconnected codebases where changes in one module ripple through the system. They can't hold enough context in their "tiny little heads" to reason about these dependencies.

The Future of Programming

Can companies replace junior programmers with LLM-augmented seniors? Short-term, perhaps. But this creates a pipeline problem: where do new seniors come from if juniors aren't hired and trained?

For non-programmers, especially in academia, LLMs open new possibilities. Mario taught his partner Steph (also a linguist) to use a coding agent. In three hours spread over three nights, she can now instruct an LLM to process her linguistic data, transform it, and generate visualizations—without understanding the code itself.

"She can judge the outputs of those programs that the LLM generates, but she doesn't have to understand the programs themselves. Given an input, she can see the output of the program and she can say this is correct or not. That's all she needs." — Mario

What Needs to be Taught

The educational implication: teach problem decomposition and output evaluation, not code syntax. Students need to learn how to structure problems, communicate them to LLMs, and validate results. The intermediate steps—writing loops, managing memory—can be delegated.

Future Implications

The Plateau and What Comes Next

Mario shows the scaling curve that AI labs bet on: more data and compute exponentially increases capability. But we've hit a plateau. The sigmoid curve is flattening. Transformers aren't the endpoint.

"We are now at the stage where we realized that the exponential is, like usual in this universe with thermodynamics, a sigmoid. We're kind of plateauing out and they're not getting a lot better, not significantly better like we had before." — Mario

The critical missing piece: sample efficiency. Children learn constructions from 10-20 examples. LLMs need billions. True intelligence—whatever that means—will require architectures that learn from fewer examples, which in turn need less compute. This creates an opening for academic research to become relevant again.

LLMs Are Not Intelligence

Mario is firm: LLMs aren't "it." They're brute force pattern matching at massive scale, not sophisticated reasoning. They might become one component of a more intelligent system—analogous to how human brains have separate regions for vision, language, motor control—but they aren't the complete picture.

"LLMs are a brute force model. They are not sophisticated at all. I mean, obviously they are sophisticated relatively, but conceptually they are not. They rely merely on the fact that you have a lot of parameters to tune and a lot of training data to tune them with." — Mario

Societal Transformation

The conversation touches on algo-speak (changing language to evade content filters), the erosion of translation as a profession, and concerns about older generations falling for LLM-generated misinformation. Jenia references a framework of "talking through the machine" (translation, text-to-speech) versus "talking to the machine" (chatbots, assistants).

Neither speaker offers utopian or dystopian predictions. Instead: pragmatic acknowledgment that the technology won't disappear, coupled with genuine uncertainty about long-term effects on language, education, and work.

Key Quotes

On LLMs and Research

"I think for me, you know, LLMs is something that we talk about all the time from a linguistics perspective and also just as academics in general, because everybody's trying to figure out how much we can do, how much is bullshit that they're pushing on us, how much we should be accepting it and how much not. And so I think it's really important from our perspective to actually hear the facts instead of hearing, you know, either the hype or the doomerism." — Jenia

"Generally, I personally would not use an LLM as a proxy for anything apart from developing the methods that would then apply to human language text or speech or interviewees or whatever." — Mario

On How LLMs Work

"This attention layer here is capable of learning grammatical patterns. No, it doesn't learn grammatical patterns. It learns how to identify relationships between input parts." — Mario

"It's not Chomsky's universal grammar, but it's also something general, something... The underlying shape of this layer is the same for any language. It doesn't matter." — Mario

On Education

"Why would I expect the student to put in any effort if I'm not putting in the effort? I mean, it's kind of unfair to tell them you have to write it, but then I won't read it." — Jenia

"At the end of the day, given that the technology is not going to go away and that students have ample access to it for free, we kind of fucked. So it comes back to self motivation." — Mario

"I'm not optimistic that in our current environment, there are enough people with leverage that will make sure that we kind of keep our educational systems working. I don't see this happening." — Mario

On Writing and Thinking

"Coding, when I physically code, it's a way for me to explore a problem space. I draft, I scribble, I try things out. This work, which might be boring on the surface, is a way for me to sort and order my thinking for a problem. And I think for writing, it's the same thing. Not having that means that the way I understand the problem will be much more superficial." — Mario

"If you are really interested in something, you're probably still putting in the work." — Mario

On Language and Translation

"This is a societal solution or a communicative solution to a technical problem. I mean, that's what language is for, right? Overcoming society." — Mario, discussing Ukrainian refugees adding "we are Ukrainian family" to machine-translated messages

"I find that your opposite end of a communication channel is most often actually kind of enticed by you trying to be conformal to the language or making the effort and being funny while you totally fuck up using a translation software." — Mario

On the Future

"I don't like to make predictions. I do like to make the prediction that LLMs aren't it. They might be part of the thing that eventually becomes really, really intelligent, but more like we have different parts of our brain responsible for vision and linguistics and all of that stuff. And I think LLMs might be one of those parts of a bigger system that might be more intelligent, but they aren't it." — Mario

"It's really hard to estimate the societal effects of a new technology." — Mario

Contents