anci Temporal Topic Plotter

Quantitative Data Visualisation of term usage in the Geoenergeering mailing list

This tool is a result of the anci hackathon, which took place 10–11.9 at the Potsdam University of Applied Sciences. The tool visualises the cumulative frequency of nouns that members use in the public geoengineering mailing list in the time between 26.10.2017 and 9.9.2019.

Intention

The visualisation is one graphical result of the quantitative analysis of the dataset. It should help to spot patterns in the discussion of certain topics. We use the frequency of nouns as indicators for trends.

Data collection

We exported the 1823 emails from Apple Mail in the mbox format. From there we used the open-source script mbox-to-json by Chandler McWilliams to convert it to a JSON file.
In the next step, we wrote a Ruby script to extract the relevant mail content. That includes the sender’s address, the date, the subject and of course the text. The goal was to extract only the relevant text. This excludes old, quoted emails, signatures and information for unsubscribing. This was difficult and is most likely not done exhaustively and infallibly. This has several reasons: Most emails were written in HTML that is formatted differently by every mail program and not done properly. For example, signatures do not have a specific class and quotes are not always wrapped in block quote tags. Another reason is that people composite mails differently. Some people quote inline, some people above or below the previous text.
During this step, we also extracted all links and images from the mails. We exported these two lists and the list of mails as CSV files.
In the next step, we used the open-source script engtagger by Yoichiro Hasebe to extract the nouns. This, again, is a source of error as the list of extracted nouns already shows some false results. The used script is also not able to categorise »technology« and »technologies« as the same word. This will be fixed later in the interface. In this step, we created a list of most used words and a list of word frequency by month.

Data analysis

By manually analysing the list of authors, we could already find out that the dataset is biased. Out of the 1822 mails in total 727 were written by one author.
The analysis of the links gives a good overview of the referenced material. Websites like tandfonline.com, nature.com, scholar.google.com, wiley.com and springer.com are mostly linked to.

Data visualisation

In the next step, we build a microsite with Vue (Nuxt) and D3. In the interactive visualisation, the user can select one or more nouns that are then plotted over time. The line chart shows the cumulative frequency of each word. The user can also combine multiple nouns to one chart so that »technologies« and »technology« are counted as one. Besides, the visualisation and the input filed the interface holds buttons with suggestions and the 50 most used nouns.

Try

CDR, SRM
Sea, Earth
Funding
Russia, USA, Europe
Technologies + Technolgy, Models + model
technology + technolgies, nature + earth + planet + sea

Search terms

Most used words

climate
research
carbon
change
emissions
governance
srm
engineering
warming
world
technologies
energy
risks
ocean
technology
policy
university
dioxide
time
https
atmosphere
science
temperature
effects
radiation
water
%
impacts
people
ice
cdr
removal
years
scientists
report
paper
system
aerosol
air
management
earth
deployment
way
sea
year
project
gas
mitigation
use
model
greenhouse
countries
work
effect
development
scale
it’s
one
assessment
cost
surface
capture
andrew
level
risk
environment
models
future
cloud
aerosols
process
part
paris
changes
study
group
idea
temperatures
information
example
agreement
injection
geoengineering
al
ipcc
i
problem
stratosphere
issues
space
fossil
response
costs
harvard
need
point
efforts
approach
land
planet
arctic