top of page

Investigating Happiness

E. Finkenauer, J. Horne and M. Visintini | April 2022

Happiness data has historically been measured through self reporting. The most comprehensive data source for self reported happiness is the World Happiness Report (WHR), an annual report on human wellbeing around the world.The WHR mainly gets its stats from a Gallup poll of 1,000 people in each country. It publishes annual country rankings, and has popularised the idea of ‘Gross National Happiness’.

We wondered: do people tell the truth about happiness when they are being surveyed?

For this reason, we decided to compare happiness survey results and social expression data from Twitter. 
Positive and Negative Affect

In its 2021 edition, the World Happiness reports states that positive emotions are almost three times more frequent than negative emotions. Does this correlate with how people speak about happiness? Language research also shows that most languages, including English, have more positive words than negative words. What does Twitter data show?

Starting from a dataset of 10,000 commonly used words by Hedonometer, we processed them down to 268 words denoting positive and negative affect, using semantic tagging. Twitter data shows that during the 2019-2021 period ‘Love’ is the single most used word from our list. ‘Hate’ is the most used negative word. Together, these words account for over 50% of usage in the 10 most used words.

WHR 2022 states that "based on the annual data from the Gallup World Poll, there was no overall change in positive affect, but there was a roughly 10% increase in the number of people who said they were worried or sad the previous day". However, our frequency analysis of Twitter communication shows a consistently positive ratio of usage of words relating to positive to negative affect. Positive expressions always outweigh negative, and the ratio increased over time. The ratio of positive to negative expressions peak around holiday times, with increases in all positive words, rather than just being attributed to usage of ‘happy’ as a congratulatory word (eg, ‘happy Christmas’).

Single Word Usage

If we look at absolute word usage in the scatter plot to the right, at a first glance it might seem that positive words (orange) are used more than negative words (blue). However, outliers such as "love", "lol" do skew the distribution of positive words. Removing them shows a more balanced distribution. 

Comparing the WHR data on positive/negative affect with our keyword frequency analysis reveals differences. In WHR negative affect, worry and sadness have shown statistically significant increases for the global sample of countries.Our word frequency shows a positive trendline for 'worry', and a negative one for 'sadness'.

 

In WHR positive affect, 'laughter' and 'enjoyment' were mostly unchanged. In our word frequency analysis, there’s a positive trendline for both ‘enjoy’ and ‘laugh’.

Happiness and Society

Key events in 2019-2021 in the Western world included the Covid-19 pandemic, the death of George Floyd, and Brexit. Looking at the words with the biggest change in this usage in the three-year period by month, this seems to be reflected in what people are Tweeting. The text color shows whether a positive is positive (orange) or negative (blue), while the arrow colors shows whether there was an increase in usage (green) or a decrease in usage (red).

Data & Tools

01

Wold Happiness Report

The World Happiness Report is a yearly report of world happiness based on global survey and metrics (more information here). Our research draws on two scores measured by the WHR: Positive affect and Negative affect. Such scores are calculated as the average responses to 6 questions:

  • Positive affect: Did you smile/laugh/experience enjoyment yesterday?

  • Negative affect: Were you worried/sad/angry yesterday?

02

Hedonometer.org

Hedonometer.org measures real-time happiness based on Twitter data, using a dataset of about 10,000 commonly used words and a random 10% sample of global Tweets. Each of the 10,000 words is assigned a happiness score measured on a scale from 1 to 9. Our research focuses on the English language. We modified the scale to be from - 4 to + 4, for visualization purposes.

03

UCREL Semantic
Analysis System

Lancaster University's semantic tagger is a framework for semantic text analysis. The tagset is available here. We used the semantic tagger to tag the hedonometer dataset, and then we focused on the tags of our positive and negative affect keywords. You can see our selected tags here.

04

Storywrangler

Storywrangler, which powers the Hedonometer, offers API to extract the frequency of specific words from a random 10% sample of global Tweets. We queried frequencies for all the words identified from the Hedonometer dataset, filtered by WHR keywords and Semantic Tag Main Category (see table above for numbers, and section below for words).

05

Final Dataset

Before finalizing the wordlist, we manually analyzed the list of 268 words identified through the abovementioned processing, to remove ambiguous words. The final word list featured 70 positive affect words and 98 negative affect words, for a total of 168 words. The words are represented in the word clouds below. Word size represents the happiness/sadness score (the bigger the word, the higher the happiness score or the lower the sadness score). View the final word set here.

Effects of the Covid-19 Pandemic

The Covid-19 pandemic was undoubtedly the most impactful event of the three-year period under analysis.

 

As part of our background questions, we started comparing data across years to identify potential seasonal patterns. While no clear seasonal pattern was detected, we were able to make some interesting observations about the effect of the Pandemic on happiness data. 

The charts below show the ratio of positive to negative expressions. The radial field (line) length represents the ratio of positive to negative expressions for each word in our dataset, measured daily. The colour spectrum (blue – orange) is based on this value, across the three-year period of our dataset, so years can be compared to each other. In 2019 the ratio of positive to negative word usage varies day by day, whereas 2020 and 2021 see continuous periods with less variation.

The consistently positive ratio in the second half of 2021 is striking, as is the comparatively negative ratio from May 2020 to the latter part of that year. Interestingly, the beginning of the first lockdown in the US and UK (late March 2020) shows blips of positive days amongst more negative days. The most obviously negative period coincides with the death of George Floyd  on 25 May 2020. There is also a low-point from the last few days of December 2020 through early January 2021. 

The centre value of the colour spectrum is calculated excluding 2 January and 16 February, as these dates represent significant spikes in our data, attributed to the frequency of ‘happy’ greetings marking New Year and Valentine’s Day. If these dates were not adjusted for, almost all datapoints fall below the centre value, and are therefore coloured blue. The segment of white space was added to make the start and end of the time series easy to find. 

bottom of page