How can you be so sure that Shakespeare wrote Shakespeare? For centuries, this question has fueled fiery debates, conspiracy theories, and a cottage industry of skeptics championing alternative candidates. But what if there was a way to lift a writer’s linguistic fingerprint directly from the page—a secret signature, invisible to the naked eye, that could definitively link an author to their work? There is, and it’s called stylometry.
Stylometry is the statistical analysis of literary style, a fascinating intersection of linguistics and data science. It’s the ultimate forger’s bane and a historian’s secret weapon. It operates on a simple but powerful premise: we all have unique and largely subconscious writing habits that, when quantified, create a distinct and measurable profile.
The Telltale Heart of a Text: It’s Not the Words You Think
When we think of a writer’s “style”, we often think of the conscious choices they make. Do they use flowery, polysyllabic words like H.P. Lovecraft, or spare, punchy prose like Ernest Hemingway? Do they favor complex metaphors or straightforward descriptions? While these elements are part of the picture, they are also the easiest to imitate. A clever forger can mimic a writer’s vocabulary or sentence structure for a few paragraphs or even a chapter.
But stylometry’s secret formula lies in the words we don’t even notice: the small, non-contextual “function words”. These are the grammatical nuts and bolts of language. Think of:
- Articles: a, an, the
- Prepositions: of, to, in, for, on, with, at
- Conjunctions: and, but, or, that, while
- Pronouns: he, she, it, they, his, her
These words are the unsung heroes of our sentences. A writer’s choice of using “on” versus “upon”, or “while” versus “whilst”, is not dictated by the topic of their writing but by deep-seated, unconscious habit. Whether composing a love sonnet or a grocery list, the frequency with which you use the word “the” or start a clause with “but” remains remarkably consistent. It’s this consistency that forms the bedrock of your linguistic fingerprint.
Sophisticated stylometric analysis doesn’t just count one or two words. It simultaneously analyzes the frequency of the 100, 200, or even 500 most common function words. It also looks at other subtle markers:
- Punctuation Patterns: How often does an author use semicolons versus colons? What is their comma-to-period ratio?
- Sentence Length: What is the average sentence length, and what is the standard deviation? Does the author prefer long, winding sentences or short, sharp ones?
- N-grams: These are contiguous sequences of items. A “3-gram” analysis might look at the frequency of all three-word phrases (e.g., “in the house”, “as a matter”, “for the most”).
When you combine these dozens or hundreds of variables, you create a high-dimensional statistical profile that is incredibly difficult to fake deliberately.
Stylometry in the Dock: Famous Cases
Stylometry isn’t just a theoretical toy for linguists; it has been used to solve real-world literary mysteries and expose high-profile secrets.
The Federalist Papers
One of the earliest and most famous triumphs of stylometry involved The Federalist Papers, a series of 85 essays written in 1787–88 to promote the ratification of the U.S. Constitution. They were published anonymously under the pseudonym “Publius”, and while authorship was later claimed by Alexander Hamilton, James Madison, and John Jay, 12 of the essays were disputed between Hamilton and Madison.
In the 1960s, statisticians Frederick Mosteller and David Wallace performed a groundbreaking analysis. They ignored the lofty political arguments and instead counted the frequency of function words. They discovered, for example, that Madison was fond of the word “whilst”, while Hamilton almost exclusively used “while”. Madison used “upon” far less frequently than Hamilton. By comparing the disputed papers to the known writings of both men, they concluded with overwhelming statistical certainty that all 12 disputed papers were written by James Madison.
Unmasking Robert Galbraith
Fast forward to 2013. A debut crime novel called The Cuckoo’s Calling by an unknown author named Robert Galbraith was released to critical acclaim but modest sales. When a journalist received an anonymous tip that Galbraith was a pseudonym for J.K. Rowling, researchers Patrick Juola and Peter Millican put stylometry to the test.
They compared the text of The Cuckoo’s Calling to Rowling’s The Casual Vacancy and the works of several other authors suggested as possibilities. The results were immediate and decisive. The linguistic fingerprint of The Cuckoo’s Calling—its patterns of function words, sentence length, and common word pairings—was a near-perfect match for Rowling and a clear mismatch for everyone else. The “secret” was out, and sales skyrocketed.
The Bard and His Collaborators
And what about Shakespeare? Stylometry has been a powerful tool in Bardology. While it overwhelmingly confirms that one man named William Shakespeare was the primary author of the plays attributed to him, it has also revealed something more nuanced: he was a collaborator.
Analyses have identified the distinctive stylistic signatures of other playwrights of the era within the Shakespearean canon. For example, it’s now widely accepted that George Peele likely wrote the first act of Titus Andronicus and that Thomas Middleton had a hand in Timon of Athens. The final plays, like Henry VIII and The Two Noble Kinsmen, show clear evidence of collaboration with John Fletcher. Far from diminishing Shakespeare, this places him firmly in the collaborative theatrical world of Elizabethan England.
The Forger’s Unwinnable Game
Stylometry is the forger’s bane because it changes the nature of the game. A forger focuses on mimicry—copying the visible, conscious elements of a style. But stylometry reveals the invisible, unconscious patterns. To defeat it, a forger wouldn’t just need to write a story in another person’s voice; they would have to maintain the precise statistical frequency of hundreds of function words over tens of thousands of words.
It’s an impossible task, akin to forging a signature while also perfectly replicating the signer’s unique heart rate and breathing patterns during the act of writing. Our most common, throwaway words—the ones we write without a second thought—are precisely the ones that give us away. They are the secret formula of our identity, written in plain sight for anyone with the right tools to read.