The Making of a Data Visualization

I was inspired by the blustery weather outside to write a weather related blog post today. In searching for interesting weather data, I stumbled upon this beautiful infographic created by Nicholas Rougeux, which can be found on his blog.

huge-2014

The poster shows weather data from 2014 in 50 major US cities. Each city is represented by a circle. Each circle is composed of 365 days worth of weather data from the Quality Controlled Local Climatological Data, represented by a circle on a line. Each circle on a line shows five measurements for the day: highest temperature, lowest temperature, range of temperatures, wind direction, and wind speed. Check out the image below to learn more about how to read each data point:

key

Here’s a close up of some of the cities on the infographic:

closeups

This infographic first stood out to me because of how pretty it is to look at. Each city is a unique explosion of color, and made the weather for the year in that city look exciting. I also appreciated how informative the image is. For example, by quickly looking at the images above you can learn the Las Vegas had the hottest temperatures of the three cities and tends to have wind from the the southwest. Chicago had the most varied wind scores, and Fargo had the strongest winds. Both Chicago and Fargo had a mix of warm and cold temperatures during the year. The image is an interesting and fun way to compare and contrast weather in different places across the US.

I was interested in how Rougeux created this image, so I checked out his page on the making of the infographic. First, Rougeux talks about how he was inspired to create a image about weather because it’s something that everyone, everywhere experiences everyday. Rougeux wrote about how it was challenging to create a design where each data point remained relatively equal… he didn’t want the days with warmer temperatures to visually overpower the colder days due to brighter colors or larger sizes. His solution to this problem was having the size of the temperature circles reflect the range of temperatures on that day. This would give the warm and cool days equal visual presence. I thought it was interesting how the choices Rougeux made about how to present the information visually changed both the aesthetic appeal of the image and the way viewers perceive the information.

I also thought Rougeux’s rough drafts of the image were fascinating.I highly suggest checking out his site to see them all. One I found interesting was the 8th version Rougeux tried for visualizing the Chicago data. The triangles pointing upwards are the highs for each day, and the triangles pointing downwards are the lows. The pairs are plotted horizontally from largest to smallest range of temperature, and smaller triangles are placed in front of larger ones. I think this image does a good job of showing the range, but makes it hard to see individual data points, and makes it seem that Chicago was hotter than it really was since the cooler colors tend to blend into the middle.

weather-v8.png

I think Rougeux’s work is a great example of how data, technology, and art can work together. If I had just looked at the weather data presented in a table of list, I probably wouldn’t have been very interested. However, seeing the infographic made me want to look closer to analyze weather trends, or maybe even hang the image up on my wall as a poster. I also appreciated that Rougeux showed some of his drafts. It helped me understand that data visualizations can emphasize different aspects of the data depending on how it is presented.

Advertisements

SmartBoards and SmartIDs? The Internet of Things in Schools

iotIt’s 8am and 8 year old Anna is ready for school. As she gets on her school bus, she swipes her student ID which contains a RFID chip. A text is sent to her parents and teacher, telling them that she is safely in the bus and on her way to school. After arriving at school, Anna’s taps her ID at her classroom door to sign in. Her teacher, freed from the burden of taking daily attendance, has already begun to help her student’s log into their chromebooks and begin their individualized daily math practice. After math, Anna’s teacher pulls up a kid’s news website on the interactive whiteboard and calls different students to come to the board and underline key ideas. Anna sits near the back and can’t see the board very well. She quickly loses interest and begins chatting to her neighbor. Her teacher sees she’s off task and deducts a point from Anna on their classroom management system. Anna sighs, knowing that her parents log in and check how many points she earned or lost each day. The system begins to identify a trend: that she is often off task during whole group lessons. This information will travel with her to the next grade, and her teacher next year will have access to it before they meet Anna in person.

At lunch, Anna chooses chocolate milk instead of low fat. Her lunch choices are sent to her teacher’s classroom management system (to help identify trends between food choices and behavior) and to her parents (who monitor if she makes healthy choices). After she eats, she heads out to play with the other kids. Her RFID enabled ID sends information to the teacher about who Anna plays with at lunch, and the teacher uses the information to plan effective student groupings for the next lesson and be aware of possible bullying situations. When Anna returns to the classroom, her teacher has all the students put on headbands that will track their brain activity and engagement in their reading lesson. Everyone is excited about the new technology they are trying out. To end the day, Anna’s class goes to PE where they are graded on data from heartrate monitors. Anna then packs up, and taps her ID card on the door of the bus to let her parents know she’s on her way home.

The story I just told may sound futuristic, but the technology and tracking systems I described are already being used in schools across America. They are described in the articles I read this week: How Will the Internet of Things Impact Education? from EDtech magazine, A day in the life of a data mined kid from Marketplace Podcast, and Connecting the Classroom with the Internet of Things from EdSurge. Each of these articles took had a different slant, but all agreed that the “Internet of Things” (IOT) has a lot of potential to both help and hurt public education.   

The Potential:

The IOT has the potential to free up time for teachers and create more time for meaningful instruction. The EdSurge article writes that of the approximately 1025 hours kids spend in school, over 308 hours are lost to non-instructional tasks such as classroom management or taking attendance. With RFID ID cards or wristbands and computerized classroom management systems, technology takes care of these tasks or makes them more efficient. Kids stay safer too, since their whereabouts are always known.

IOT can also help teachers identify trends that can lead to better instruction. Data from computerized classroom management systems can show patterns of when and where students act out or get off task and helps teachers fix the problem. Using headbands that monitor students brain activity literally gives teachers an opportunity to see what’s going on in their student’s heads, and help the students who may need assistance but are too shy to ask for it. Technology can also identify connections between the types of food kids eat at lunch, and their academic performance later in the day.This information can be passed from teacher to teacher, making it easier for teachers to create meaningful lessons for students from the first day of school. IOT has a lot of potential to optimizing teaching and learning.

The Problems:

One of the largest problems with using IOT in education is the risk that comes with collecting a tremendous amount of personal data on every student. There’s a chance that sensitive data could end up in the wrong hands. Things get very messy when a child’s disciplinary data, health data, social data, economic data, and even brain data are all tied together with the student’s ID.

All this data creates a picture of the student that might not be completely accurate. Think of Anna in the story. If Anna’s teacher didn’t realize that Anna was off task because she couldn’t see the board during reading, she might think that Anna is a poor reader and attach a label in her data that will travel with her till graduation. Having data from previous teachers on individual students gives teachers less incentive to get to know their students and their needs for themselves. Because teachers input a lot of the data, they might creating or upholding biases about their students without realizing it. Of course, you don’t need smart devices to have and make decisions based on biases. However the IOT and smart devices scale up the impact of these biases.  

There’s also the issue of the digital divide. Kids from wealthy families are more likely to have access to technology at home. This gives them an advantage when it comes to testing or working with technology in the classroom. Their data may report that they are better at math and reading than disadvantaged kids, but really they might just be better at using the computers and tablets used to assess math and reading.

IOT has a lot of potential to better classrooms, but it’s important that teachers, parents, and school districts all consider how the risk involved with collecting large amounts of data on every student.

 

How Can Data be Used to Prove Authorship?

mystery-book

Last week, I wrote about how data could be used to analyze the lyrics of the musical Hamilton. I’m revisiting that theme of using data and algorithms to glean new insights about written works. This week, I read this article written in 2013 in which the author, Patrick Juola, explains how he used computational linguistics to suggest that JK Rowling was the real author of The Cuckoo’s Calling. The article was published on Language Log, a blog that analyzes linguistics in pop culture and media.

Background

The Cuckoo’s Calling by “Robert Galbraith” was published in April 2013 without much acknowledgement from the public. A month or two after its publication, the UK newspaper The Sunday Times received an anonymous tip that the novel was actually written by famous author JK Rowling. The newspaper thought the tip was worth investigating so they approached the article author Patrick Juola, a computer science professor and expert in text analysis, to look into the matter.

Juola also gives some interesting background on the history of text analysis. He explains that “ language is a set of choices, and speakers and writers tend to fall into habitual, or at least common, choices”. These choices come from geographic dialects, the setting in which the language is being produced, and “free variation” or seemingly random personal word choice. Juola writes that free variation is usually pretty consistent for any given individual, and analyzing multiple texts written by an individual can reveal their trends in the free variation of their writing. Juola also explains that the idea that free variation can be measured goes back to the 19th century, where mathematician Augustus de Morgan proposed using average word lengths to settle disputes over authorship.

The Method

To determine if JK Rowling was the true author of The Cuckoo’s Calling, Juloa broke the text of the book into 1,000 word chunks. He then used samples of JK Rowling’s previous novel The Casual Vacancy and samples of three books identified as strong candidates for possible authors to compare the text to. Juola ran four different tests, each focusing on a different linguistic variable. The first variable was word length. Six of the 11 Cuckoo chunks were closest to JK Rowling’s previous work. The next variable was the 100 most common words in each text. Four of the chucks were most similar to Rowling, while the others were more similar to other authors. The next variable was “4-grams”, or how often the authors used consecutive sets of four letters in their words. This test showed a preference for one of the other authors, and a secondary preference for Rowling. The final variable was the frequency of word pairs, which 9 of the chucks showed a preference for Rowling.

Results

Since Rowling was either the first or second choice in each category, Joula concluded that Rowling was the strongest candidate out of the four authors analyzed to be the author. He writes that this result doesn’t prove that Rowling was actually the author. All it suggests in that the author of The Cuckoo’s Calling has a style that’s similar to Rowlings. However since Joula’s report showed some evidence that Rowling could be the author The Sunday Times used the report to approach the publisher about the authorship of The Cuckoo’s Calling. The next day, JK Rowling confirmed that she was the author of the book.

Joula emphasizes that his report was not “proof” that Rowling wrote the book, it just suggested her as a possible author. He writes that if the author was able to confirm authorship, “this is the kind of thing that could and would be argued about in the journals for decades”. But he goes on to write that running more experiments with more  texts and different variables would strengthen the claim.

Discussion

I thought this article was fascinating. It made me want to do a analysis of my own free variation when writing! I thought one of the strengths of the article was the author’s admittance that his analysis gave suggestions, not proof, of authorship. It’s important to keep in mind in our increasingly data driven world that often algorithms provide suggestions instead of solid proof. Humans are needed to interpret results and decide what to do with them.

I had a few questions about the article. I would have appreciated more information about how the three other texts the author compared The Cuckoo’s Calling to were chosen. Did they have a similar writing style? Or similar plots of genres? Who chose those three books, and why did they chose them? I also wondered why they chose only three other books, since it seems to me the test would have been more valid if the chucks were compared to a greater number of possible authors.

Using Data to Analyze Hamilton

Screenshot (1)

I checked out the website Stephanie recommended in her blog post presentation (The Pudding), and stumbled upon an article that uses an interactive visualization to analyze the musical Hamilton. An Interactive Visualization of Every Line in Hamilton is the author Shirley Wu’s attempt to visualize the text of Lin Manuel Miranda’s Hamilton. Wu writes that she was interested in visualizing the relationships between various characters and the themes associated with them. To do this, she went through every single line of Hamilton and recorded who sang the line, who the line was sung about, and what themes the lines repeated in multiple songs expressed. She then created a visual tool that filters through the codes she attached to each line or group of lines.

Wu visualizes the data in a few different ways. First, if you scroll down a little on the web page, you’ll find an image of forty something clusters of colored dots. Each cluster represents a song, and each colored dot is a set of lines within the song colored coded by speaker. Just looking at this visualization can tell you a lot about the musical. For example, the colors that come up the most are purple and teal – the colors that represent lines spoken by Aaron Burr and Alexander Hamilton. These are the two main characters, so it makes sense that they have the most lines.

Next, Wu filters the results by relationships. Her next visualization filters lyrics by who is speaking and who is being spoken about. The page is automatically set to show lines where Alexander Hamilton and Angelica Schlyer speak to or about each other, but you can add characters to the visualization to further explore relationships.As you scroll down, Wu adds themes and recurring phrases to her visualization of the relationship between Angelica, Alexander, and Eliza. I thought this page was particularly interesting. I liked the idea of filtering the text of the musical by speaker and theme. It’s a cool way to visually see the importance or progression of relationships without having to read the entire text of the musical.

At the end of the article, Wu gives the reader a chance to try explore relationships and themes between any of the main characters of musical. This is where I started thinking about the limitations of the article. I realized that although I loved how Wu walked through the nuances of the relationships she highlighted earlier, she didn’t write enough about how to use her visualization for me to feel comfortable using it myself. I also noticed that oftentimes when I selected two charters, the visualization would show songs in which both characters spoke but didn’t interact with each other. This made it look like the characters had a more significant relationship than really did. Another big limitation in this article is potential bias and validity.  Can we trust that Wu interpreted the themes and relationships that she coded the way the author intended? I think the only way to eliminate bias would to confirm the codes with the creator.  

Despite some limitations, I thought the idea of using data to analyze a work of art was fascinating. The articles visualization tool made it easy to see how themes and relationships developed across the musical, and provided insights that would be difficult to come to just reading the text. I don’t think reading the data alone could replace reading (or watching or listening to, in this case) a text itself, but could help readers quickly and clearly analyse it.I would love to see data visualizations of other books or plays or musicals I enjoy and see what I could learn from them.

Can Education Fix Economic Inequality?

American income inequality is one of those issues that seems to come up every year around election season. Every candidate and party has their own ideas and solutions to the problem, but everyone seems to agree that strengthening our education system and getting more people through college will help fix the problem of income inequality. It makes sense, doesn’t it? If more disadvantaged people had the opportunity to go to get an education and graduate from college, they’d get a higher paying job and start to close the gap. However, a New York Times article summarizing recent research findings begs to differ.

The article summarizes a paper written by researchers at the Hamilton Project (which is a politically moderate economic research group). The paper argues that although more education would help the middle and lower class find greater economic success, it wouldn’t be able to change the greater system of inequality in America. The researchers reached this conclusion by running a simulation that assumed 10 percent of non-college educated men in America suddenly received a college diploma and the pay raise that usually comes with it (I was curious why the simulation only gave college degrees to men. I looked at the actual paper, and found they explained that they used only men because low-income men have the largest drops in employment and earnings, and the lowest college graduation likelihood). After granting the college diplomas in the simulated world, the average lower to middle class income increased by 9 percent.

Then things get a little complicated. I don’t know much about economics so it was hard for me to interpret the results. Basically although more education increased the income of low or middle income men in the simulation, the income of those in the top ten percent of the income bracket also enjoyed higher incomes due to inflation. The Gini ratio, which is a measure of economic inequality, went from .57 in the actual 2013 data to .55 in the simulated data. That means higher education attainment actually slightly widened the inequality gap. The article ends by arguing that a stronger education system and more college graduates would help those in the middle and lower class by raising their average income. However, improving education won’t close the income gap and fix systemic inequality.

I thought this study was interesting, but I worried about how it might be interpreted. Although increasing education levels in the middle and lower classes might not fix our broken system, it could have a great impact on the lives of the individuals who increase their education. Attaining higher education can provide an economic boost that could easily change a family’s life for the better. Besides that I like to think that there is value beyond economic gains in a college education– like learning to interact with the world on a deeper level, and making connections with mentors and peers. The economic gains due to higher education attainment may not have change the inequality gap on their own. But I wonder if the motivation, self worth, and social capital lower income individuals could gain through higher education could change things. Now that would be an interesting simulation.

English Language Learners in US Schools

When I tell people that I’m minoring in Teaching English as a Second Language (TESL) the first question they usually ask is where I’m planning to travel to teach English. Although I’ve definitely thought about teaching English abroad at some point, the reason I have my TESL minor is so I can better teach children who already live in the US and are learning English. Since teaching English to people already in the US seems to be confusing concept for some, I looked up the National Center for Educational Statistics page on English Language Learners.  I think the data they present makes a strong argument for why US teachers need a background in TESL.

First, what is an English Language Learner(ELL)? ELLs are students who speak a language other than English at home and participate in some kind of language program within their school to help them reach English proficiency. These students are labeled as everything from “Limited English Proficiency Students” to “Emergent Bilinguals” depending on the school or district, but they all fall under the term ELL.

According to the most recent data on the National Center of Educational Statistics webpage (which is a few years off, since it’s from 2015) 9.4 percent or 4.5 million of US students are ELLs. With class sizes of around 25, that means if all the ELLs in the US were distributed evenly, every classroom would have 2 or 3 students still in the process of learning the language of instruction.

Of course, ELLs are not distributed evenly across the US. Due to a variety of reasons, some states have a lot more than others. Check it out on the graph below:

Screenshot (17)

This image shows the percentage of K-12 public school students who have been identified as ELLs in each state. The darker the color, the higher percentage of ELLs in that state. The state with the highest percentage of ELLs is California, with 22.4 percent. The state with the lower percent in West Virginia, at 1.0 percent. Every other state falls somewhere in between. Utah, the state where I’ll probably start teaching, has 6.3 percent. That’s below the national average, but still means there’s a lot of students in need of teachers who are trained in TESL.

Reporting that Utah has 6.3 percent ELLs doesn’t mean that every school in Utah has 6 ELLs for every 100 students enrolled at the school. The percent will vary based on the city and neighborhood the school is in. The article reports that cities tend to have a higher percent of ELLs than suburbs or rural areas. So it makes sense that many of my neighbors in my White suburban neighborhood questioned why I would need a TESL minor if I’m planning to stay in Utah. They didn’t see the need for it in their neighborhood school, and assumed other schools in Utah would have similar demographics.

The report I read left me with some questions. Since the data was collected in 2015, I wondered how it might have changed in the last two years. I also was interested in the information in one of the footnotes, which reads Data do not include students who were formerly identified as ELLs but later obtained English language proficiency.”  In my TESL classes, we’ve talked about how common it is for students to test out of ELL programs, or be marked as having achieved English language proficiency, too early. Although their conversational English may be proficient, their academic language may be far behind that of their peers. I wondered how many of those students there are in the US, and if they were counted what the average percent of ELLs would then be.

 

The Statistics of Hogwarts Houses

Disclaimer: This post requires some working knowledge of the characteristics of the four Hogwarts houses. If you are not familiar with the Hogwarts houses, you might want to quickly skim over the traits associated with each house here.

762fe0dc-5770-4614-81a6-02bc8ec19c25.jpg
Warner Bros.

Although we never got our Hogwarts letters, almost all my friends can tell you what Hogwarts house they belong in. I’m a proud Ravenclaw, and could explain to you in detail why that house best matches my personality. In June 2017, the Harry Potter book series celebrated its 20th anniversary. To celebrate, Time magazine teamed up with psychologists to create a scientific house sorting quiz. In September, they published what they discovered.

The Method:

As many Harry Potter fans know, JK Rowling created an official sorting quiz that can be found on Pottermore. However, Time wanted to create a quiz that was based on real scientific theory. They describe their process in detail here, but I’ll summarize it. To create the quiz, Cambridge psychologists collected data on the personalities of major characters in the Harry Potter series through close reads of the book, and then determined “ significant differences in the personality traits of characters from different Houses.  They then assembled questions from real personality inventories (such as the Big Five inventory) and created a survey that matched survey takers to the house that best matched their personality. The quiz used a generalized linear model to measure how the closely the survey responses match each of the four houses. The survey makers then recruited hundreds of Harry Potter fans to take the quiz from the perspective of a major character in the book to test its validity. After the survey was completed, it was published on the Time website, and site visitors were invited to take it. The article reports that over 1 million people took the survey, and about 600,000 people opted to provide basic demographic info and allow their survey responses to be studied. About 100,000 of these respondents were from outside the US and not included in the analysis presented in the article. (If you’re interested, you can take the quiz yourself about halfway down this page on the TIME website. )

The Results:

So we have data on what Hogwarts house 500,000 Americans were sorted into based on the scientific house sorting quiz. Let’s take a look at some of the claims the article makes from the data. I’ll add my own interpretation as well.

  1. Most Americans are either Ravenclaws or Hufflepuffs

    Screenshot (13)

Here we have a simple two-variable bar graph. The four houses appear on the x-axis, and each bar rises to represent the percentages of respondents who got that result along the y-axis. Ravenclaw was the most popular result, with just over 45% of respondents.

Articles interpretation: The article called these results “not surprising”, since they say that Gryffindor and Slytherin contain characters with the most extreme  personality traits in the book. The Ravenclaw and Hufflepuff characters and traits are more moderate. Since a typical person wouldn’t have an extreme personality, it makes sense that most people would be Ravenclaws of Hufflepuffs.

My Interpretation: I also would call these results not surprising, but for a very different reason: sampling bias. The survey was posted and advertised on Time magazine’s website. Time is one of the top news and current event sites in America. The kind of people who frequently visit this type of site would describe themselves as curious and intelligent… the main two traits of Ravenclaws. Besides bias from the sites readership, respondents self selected to both take the quiz and make their results available to study. The kind of people to that would volunteer to do that might be described as curious about their hogwarts house and have a desire to contribute to the websites wealth of knowledge. Again, sounds like something Ravenclaws would be most likely to do. Speaking of opting in to the study, Gryffindors and Slytherins are not very trusting in the books. I think there’s a good chance many people who were sorted into the Gryffindor or Slytherin category decided not to make their data available to the study. So I don’t think this graph accurately represents the way all Americans would be sorted.

 

2) Younger People are more likely to be Slytherin

Screenshot (14)

This graph compares three variables: Hogwarts house, age, and percent of total. Slytherin, the green line, is the only house that significantly varied across age of respondents (as seen in the downward slope of the green line).

Articles Interpretation: The article interpreted the variance in the way the Slytherins were distributed across the ages as a “function of maturity”. Basically, people tend to be more selfish (a slytherin trait ) when they are younger get less selfish as they age.

My Interpretation: Honestly this graph baffled me at first. I talked it over with a friend, and her idea was that younger people have to take on competitive and ambitious traits (aka Slytherin traits) in order to get a job and succeed in life. Older people usually already have a job and a place in the world, so they don’t have to be as ambitious or competitive. I thought that was an interesting idea, but couldn’t find any data to back it up. I also wondered if the graph accounted for Slytherin having the least amount of people sorted into, and if the graph above would look different if the sample included a more equal amount of respondents from each house.

3) Regional Distribution:

Screenshot (15)
Sorry, you can’t actually tap a state to see how it matches in this blog post. If you really want to do that you can go to the original article.

The color of each state in the image above shows the most common house for people living in each state. A striped state indicates that two houses were equally most common in that state.   

Articles Interpretation: The article didn’t give much interpretation. It simply pointed out  that the midwest and mountain west tended to be Hufflepuff, the Southwest was mostly Gryffindor, the South was Slytherin, and New England was mostly Ravenclaw.

My Interpretation: Again, I wondered how this image accounted for there being more Ravenclaws and Hufflepuff than Gryffindors and Slytherins in the data. It seems to logically match up to locations though… after all most of the most educated states in America are in New England, which was mostly Ravenclaw. And we all know how friendly and hard-working Utah considers itself to be (we have a beehive on our flag to represent hard work and industry), so it makes sense that we are home to mostly Hufflepuffs.

The article had some other fascinating graphs and claims that I don’t have time to go into. If you’re interested, you can check them out through the link at the beginning of this blog post.

Why should we care about all this?

Hogwarts and Harry Potter may be pretend, but the franchise’s status as a cultural icon is very real. For those who grew up with the Harry Potter craze, the Hogwarts house you identify with can define you and explain your personality to others. For example, many millennials list their house on their social media or dating profiles as a way to succinctly explain their personality and values (and show their devotion to Harry Potter). Before writing this blog post I discussed the article with a friend and we talked about how people in our generation can make snap decisions on a person’s character based on their Hogwarts house. We agreed that because many people care about Hogwarts houses as an indicator of personality and character, it’s good to know if houses have statistical trends that justify or explain the judgments we make based on them.