Categories: DialectologyComputational LinguisticsHistorical Linguistics

Quantifying a Dialect’s ‘Distance’

You say ‘tomahto’, I say ‘tomayto.’ He says ‘y’all’, she says ‘you guys.’ We intuitively understand that speakers of the same language can sound vastly different. But how different is different? When does a thick dialect start to sound like a whole new language? Is there a tipping point where American English and British English could, in theory, become mutually unintelligible?

This isn’t just a philosophical question. It’s a central puzzle in historical linguistics, and a practical one for sociolinguists. While the line between “dialect” and “language” is famously blurry (as the old joke goes, “A language is a dialect with an army and a navy”), linguists have developed fascinating mathematical tools to bring some objectivity to the matter. They can actually quantify the ‘distance’ between two language varieties, measuring just how much they’ve drifted apart.

Beyond “It Just Sounds Different”: The Concept of Lexical Distance

The primary way linguists measure this divergence is through a concept called lexical similarity or its inverse, lexical distance. At its core, this is a measure of how much vocabulary two languages or dialects share.

Think of it like a genetic family tree. You share more DNA with your sibling than you do with a first cousin, and far more than with a distant, fourth cousin. Languages work similarly. Dialects of a single language, like siblings, share a huge amount of their ‘lexical DNA.’ Closely related languages, like Spanish and Portuguese, are first cousins—distinct, but with an obvious and extensive shared heritage. Languages like English and German are more like second or third cousins; the family resemblance is there, but you have to look more closely.

Lexical distance tries to put a number on that family resemblance. A lexical similarity of 100% would mean the two varieties have identical vocabularies for a given set of concepts. A similarity of 0% would mean they share no words at all. By calculating this percentage, we can get a snapshot of how far two branches of a language family have grown apart.

The Dialectologist’s Toolkit: A Core Vocabulary

So, how do you actually compare vocabularies? You can’t just open a dictionary and start counting. For one, which words do you choose? Modern languages are full of loanwords for technology (internet, computer), food (sushi, pizza), and culture that can skew the results. Two languages might both use the word ‘taxi’, but that doesn’t say anything about their ancestral relationship.

To solve this, linguists use a standardized toolkit. The most famous is the Swadesh list, developed by linguist Morris Swadesh in the 1950s. It’s a curated list of about 100-200 basic, universal concepts that are thought to be common to almost every human culture and environment. These include:

Pronouns (I, you, we)
Body parts (head, hand, eye)
Natural phenomena (sun, moon, water, stone)
Basic actions (eat, drink, sleep, die)
Simple adjectives (big, small, long)

The idea is that these core words are much less likely to be borrowed from other languages and are more stable over time. By comparing the words for these specific concepts in two languages, you get a much cleaner signal of their historical relationship. For example, comparing the English ‘water’ and German ‘Wasser’ is more informative than comparing the English ‘government’ and German ‘Regierung’, which are both ultimately borrowings from French/Latin.

Getting Technical: The Levenshtein Distance

Once you have your two lists of core words (e.g., the Swadesh list for Castilian Spanish and one for Argentinian Spanish), you need to compare them. But what if the words aren’t identical, just similar? Is the Spanish noche (‘night’) totally different from the Italian notte? Clearly not. They are cognates—words that have evolved from the same ancestral root (in this case, the Latin noctem).

This is where a powerful algorithm called the Levenshtein distance comes in. It’s a way of calculating the “edit distance” between two strings of text. In simple terms, it counts the minimum number of single-character edits required to change one word into the other. The allowed edits are:

Insertion: Adding a character.
Deletion: Removing a character.
Substitution: Replacing one character with another.

Let’s make this real. How different are the English ‘night’ and its German cognate ‘Nacht’?

night → naght (substitute ‘i’ for ‘a’) – 1 edit
naght → nacht (substitute ‘g’ for ‘c’) – 2 edits
The Levenshtein distance between ‘night’ and ‘Nacht’ is 2.

Now consider the Italian notte and Portuguese noite:

notte → noite (substitute ‘t’ for ‘i’) – 1 edit
The Levenshtein distance is just 1, reflecting their closer relationship.

Linguists automate this process, running the Levenshtein algorithm on hundreds of cognate pairs from the Swadesh lists. By averaging the edit scores and normalizing them based on word length, they can generate an overall lexical distance score. A lower average score means the dialects or languages are lexically closer.

From Word Lists to Percentages

This raw data is then often converted into a more intuitive percentage of lexical similarity. For instance, the data project Ethnologue provides a treasure trove of these figures, giving us a quantitative look at language relationships:

Spanish and Portuguese: ~89% lexical similarity. Very high, confirming they are “sister” languages.
French and Italian: Also ~89%. They diverged from Vulgar Latin in parallel ways.
German and Dutch: ~81% lexical similarity. Close cousins, but with more significant differences.
English and German: ~60%. The familial link is clear, but centuries of separate development (and heavy French influence on English) have created a large gap.

These percentages help explain why a native Spanish speaker can often get the gist of written Portuguese, but a native English speaker would struggle to read a German newspaper without significant study.

What Lexical Distance Doesn’t Tell Us

While this method is incredibly powerful, it’s not the whole story. Lexical distance is just one dimension of language. It primarily measures vocabulary divergence, but it can miss other crucial factors:

Phonology (Sound): Two dialects might use identical words, but pronounce them so differently that they become mutually unintelligible. The various dialects of Chinese, for example, share a writing system but can be phonologically as different as French and Spanish.
Syntax (Grammar): English and German may share 60% of their core vocabulary, but their grammar (word order, case system, etc.) is vastly different, creating a huge barrier to comprehension.
Sociopolitical Factors: Ultimately, people decide what counts as a language. Serbian and Croatian are almost 100% lexically identical and mutually intelligible, but are considered separate languages for historical and political reasons. Conversely, the varieties of Arabic spoken from Morocco to Iraq have very low lexical similarity and are not mutually intelligible, yet are often referred to as dialects of a single “Arabic” language for cultural and religious reasons.

So, the next time you hear a debate about whether Scots is a language or a dialect of English, you’ll know that linguists can actually approach the question with data. By comparing core vocabularies and calculating their Levenshtein distance, they can quantify the lexical gap that has formed over centuries. While the final answer will always be intertwined with culture, identity, and politics, these fascinating tools give us a concrete measure of the beautiful, messy, and ever-drifting nature of human language.

LingoDigest

Next Grammatical Evaporation »

Previous « The Speech of Second Selves

Published by

LingoDigest

Tags: dialectswadesh listedit distancelexical similaritycomparative methodcognatescomputational linguisticscorpus linguistics

3 months ago

This website uses cookies.

Quantifying a Dialect’s ‘Distance’

Beyond “It Just Sounds Different”: The Concept of Lexical Distance

The Dialectologist’s Toolkit: A Core Vocabulary

Getting Technical: The Levenshtein Distance

From Word Lists to Percentages

What Lexical Distance Doesn’t Tell Us

Recent Posts

Appalachian English: It’s Not “Bad” Grammar, It’s History

The Thaana Script: Why Maldives Writing Looks Like Math

Sütterlin: The Handwriting That Divided Generations

Cluttering: The Other Fluency Disorder

Cratylus: Are Names Arbitrary?

Valency: The Chemistry of Verbs

Quantifying a Dialect’s ‘Distance’

Beyond “It Just Sounds Different”: The Concept of Lexical Distance

The Dialectologist’s Toolkit: A Core Vocabulary

Getting Technical: The Levenshtein Distance

From Word Lists to Percentages

What Lexical Distance *Doesn’t* Tell Us

Recent Posts

Appalachian English: It’s Not “Bad” Grammar, It’s History

The Thaana Script: Why Maldives Writing Looks Like Math

Sütterlin: The Handwriting That Divided Generations

Cluttering: The Other Fluency Disorder

Cratylus: Are Names Arbitrary?

Valency: The Chemistry of Verbs

What Lexical Distance Doesn’t Tell Us