Categories
Capture Data

Mining Twitter and Covid-19 datasets

The only topic these days: the coronavirus, Covid-19, the pandemic, SARS, the crisis, disease, the enemy, the survival… We all are under the same global situation and we all are concerned by the many impacts and consequences that this event is having and will have in our lives.

This pandemic can be approached from infinite perspectives, and we think that digital humanities can also contribute. We are especially interested in the digital narratives that inform the outbreak. Which are the outbreaks narratives? Certainly, they are not unique or monolithic.

Social distancing brings to the frontline social media, some of which are open for mining and retrieving what people are saying. The most clear example is Twitter that has an API to recover tweets, containing texts and social interactions. Many scholars and projects are already mining data about Covid-19 and providing tweet datasets to be downloaded and explored. Here you have a list of these datasets:

  • Covid-19 Twitter chatter dataset for scientific use” (Panacea Lab) is an online Data set, stored in GitHub and distributed under a DOI with Zenodo (the number version is updated almost every week). They gather data since January 27th and they capture all languages, but -as they explain- the higher prevalence are:  English, Spanish, and French. They deliver the datasets in two different forms: one dataset contains all mentions and retweets, while the other is a clean version containing only the tweets. They also perform NLP tasks and provide the top 1,000 frequent words and top concurrencies. They complement their dataset by building general statistics. This corpus -as it is established by Twitter Terms of service– consists of a list of tweets identifiers, which need to be hydrated. Check also their e-prints posted on arXiv “A large-scale COVID-19 Twitter chatter dataset for open scientific research — an international collaboration“.
  • COVID-19-TweetIDs (E.Chen, K. Lerman, and E. Ferrara) is another ongoing collection of tweets associated with the pandemic. They commenced gathering data on January 28th. In their particular case, besides harvesting hashtags, they use Twitter’s streaming API to track specified user accounts and specific keywords. They have structured their GitHub repository by month, day, and hour. Each month folder contains a .txt file per day and hour. These .txt files also consist of the Tweet IDs and thus need to be hydrated. Check also their e-prints posted on arXiv “COVID-19: The First Public Coronavirus Twitter Dataset.”
  • Coronavirus Tweet Ids” (D. Kerchner, L. Wrubel) contains the tweet ids of 155,206,805 tweets related to Coronavirus. Their starting date was between March 3, 2020 and they keep releasing a new version every two weeks approximately. To build the collections they use Social Feed Manager.
  • Corona Virus (COVID-19) Tweets Dataset” (R. Lamsal) provides a CSV dataset with the tweet ids. This initiative monitors the real-time Twitter feed by tracking only English “en”, and the words “corona”, “covid”, “covid-19”, “coronavirus” and the variants of “sars-cov-2”. Simultaneously, they have sentiment.live, a site that visualizes sentiment analysis of the Twitter feed.

There are many other catalogs, projects, and repositories that gather Twitter collections. We recommend also to have a look here:

and to check the stonishing Covid-19 Dashboard to track the number of cases worldwide.

In these momentum of data, our project Digital Narratives of Covid-19 would like to create as well a Twitter dataset conceived under this criteria :

  • By language: English, Spanish
  • By region: South Florida, Miami
  • By date: January 27th –
  • By hashtags (covid, covid-19, coronavirus, etc.)

We are fairly new in these techniques so bear with us while we post tutorials on how we are doing, and join us!!