I think we have all experienced a time in our lives when we have had a song stuck in our head. In my experience, it is usually a song I am not particularly fond of but is catchy so my mind keeps returning to it.
This weekend, I found a wonderful data visualization from The Pudding, a weekly journal of visual essays, that asks the question: Are Pop lyrics getting more repetitive? This blog post explores the process, results, and limitations of the author, Colin Morris’s, analysis.
First off, how do you measure repetitiveness in song lyrics? Can you quantify it by dividing the number of unique words by the total number of words in the song? The short answer is no—it is a little more complicated. Here is an example to show us why.
According to the percent uniqueness metric, both of the above choruses would be equally repetitive—52 words long and use the same 23 word vocabulary. Obviously, this is not true which makes this a poor method for measuring song repetitiveness. In reality, the chorus on left is much more repetitive because it not only repeats words, but it also arranges words in a predictable order.
Thus, Morris turns to the Lempel-Ziv algorithm to measure repetitiveness. The Lempel-Ziv algorithm is a lossless compression algorithm (think of a zip file) that “works by exploiting repeated sequences.” In short, the algorithm targets duplicated lines of lyrics and similar sounds in words (like ills from bills and thrills) and replaces them with markers. In the case of Sia’s Cheap Thrills, the chorus is reduced from 247 characters to 133 characters when using the algorithm—that’s total reduction size of 46.2 percent.
In contrast, Morris’s original composition has a 22.9 percent reduction size. This demonstrates how the algorithm is a more effective way to measure song repetitiveness.
(If anyone is curious, here is a link to a list of several full songs run through the algorithm. The website shows you a visual representation of how the algorithm compresses the songs and arrives at their percent reduction size.)
After running 15,000 songs through this algorithm, the distribution looks like this:
To answer the question posed at the beginning of this blog, Morris compares the average percent reduction size by year of songs from 1960 to 2015. When he plots the average of all songs in any given year, there is a positive trend on the graph (blue line). In other words, the percent reduction has gone up in recent songs. The same is true when looking at the top 10 songs for any given year—although there is much more variability year to year, the overall trend is still positive (yellow line). So yes, according to this data, pop songs are becoming more repetitive.
I can think of several limitations for this analysis. One, I do not know where the data (songs) in Morris’s database came from. They could be the top 100 songs per year from the Billboard charts or something completely different. Two, the sample of pop songs studied does not appear to be a random sample. Morris makes no mention of randomly selecting n songs from each year but on the flip-side, it would be impossible to collect all pop songs produced in the United States in any given year. Three, as Morris himself writes, “it’s easier to find lyrics for recent songs” which could mean new songs are over represented in the data set. In the end, I would be careful about making any general claims from this analysis—although it is an entertaining project.
FUN FACT: Rihanna was the most repetitive artist in the database.