How Can Data be Used to Prove Authorship?

mystery-book

Last week, I wrote about how data could be used to analyze the lyrics of the musical Hamilton. I’m revisiting that theme of using data and algorithms to glean new insights about written works. This week, I read this article written in 2013 in which the author, Patrick Juola, explains how he used computational linguistics to suggest that JK Rowling was the real author of The Cuckoo’s Calling. The article was published on Language Log, a blog that analyzes linguistics in pop culture and media.

Background

The Cuckoo’s Calling by “Robert Galbraith” was published in April 2013 without much acknowledgement from the public. A month or two after its publication, the UK newspaper The Sunday Times received an anonymous tip that the novel was actually written by famous author JK Rowling. The newspaper thought the tip was worth investigating so they approached the article author Patrick Juola, a computer science professor and expert in text analysis, to look into the matter.

Juola also gives some interesting background on the history of text analysis. He explains that “ language is a set of choices, and speakers and writers tend to fall into habitual, or at least common, choices”. These choices come from geographic dialects, the setting in which the language is being produced, and “free variation” or seemingly random personal word choice. Juola writes that free variation is usually pretty consistent for any given individual, and analyzing multiple texts written by an individual can reveal their trends in the free variation of their writing. Juola also explains that the idea that free variation can be measured goes back to the 19th century, where mathematician Augustus de Morgan proposed using average word lengths to settle disputes over authorship.

The Method

To determine if JK Rowling was the true author of The Cuckoo’s Calling, Juloa broke the text of the book into 1,000 word chunks. He then used samples of JK Rowling’s previous novel The Casual Vacancy and samples of three books identified as strong candidates for possible authors to compare the text to. Juola ran four different tests, each focusing on a different linguistic variable. The first variable was word length. Six of the 11 Cuckoo chunks were closest to JK Rowling’s previous work. The next variable was the 100 most common words in each text. Four of the chucks were most similar to Rowling, while the others were more similar to other authors. The next variable was “4-grams”, or how often the authors used consecutive sets of four letters in their words. This test showed a preference for one of the other authors, and a secondary preference for Rowling. The final variable was the frequency of word pairs, which 9 of the chucks showed a preference for Rowling.

Results

Since Rowling was either the first or second choice in each category, Joula concluded that Rowling was the strongest candidate out of the four authors analyzed to be the author. He writes that this result doesn’t prove that Rowling was actually the author. All it suggests in that the author of The Cuckoo’s Calling has a style that’s similar to Rowlings. However since Joula’s report showed some evidence that Rowling could be the author The Sunday Times used the report to approach the publisher about the authorship of The Cuckoo’s Calling. The next day, JK Rowling confirmed that she was the author of the book.

Joula emphasizes that his report was not “proof” that Rowling wrote the book, it just suggested her as a possible author. He writes that if the author was able to confirm authorship, “this is the kind of thing that could and would be argued about in the journals for decades”. But he goes on to write that running more experiments with more  texts and different variables would strengthen the claim.

Discussion

I thought this article was fascinating. It made me want to do a analysis of my own free variation when writing! I thought one of the strengths of the article was the author’s admittance that his analysis gave suggestions, not proof, of authorship. It’s important to keep in mind in our increasingly data driven world that often algorithms provide suggestions instead of solid proof. Humans are needed to interpret results and decide what to do with them.

I had a few questions about the article. I would have appreciated more information about how the three other texts the author compared The Cuckoo’s Calling to were chosen. Did they have a similar writing style? Or similar plots of genres? Who chose those three books, and why did they chose them? I also wondered why they chose only three other books, since it seems to me the test would have been more valid if the chucks were compared to a greater number of possible authors.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s