We are happy to finally launch the interface to download a collection of tweets related to the Covid-19 pandemic. You can choose a range date, an area (Mexico, Argentina, Colombia, Perú, Ecuador, Spain, Miami area), and language (only for the Miami area, in English and Spanish).
The texts are processed by removing accents, punctuations, mention of users (@users) to protect privacy, and replacing all links with “URL.” Emojis are transliterated into a UTF-8 charset and transformed into emojilabels. We also decided to unify all different spellings of Covid-19 under a unique form, and all other characteristics, including hashtags, are always preserved.
But there’s more! We have implemented a simple API to select your collection with no need to access to the interface.
The API entrance point is also here: https://covid.dh.miami.edu/get/ and it serves to deliver the .txt files that you want.
There are three main variables for queries and each query is separated by an ‘&’: language, geolocalization, and date. Each query starts always with a “?” and is abbreviated as follows:
lang = es or en
geo = fl, ar, es, co, pe, ec, mx, all
date: month-year-day, {month}-year-month, {year}-year, or a range ‘ {from}-year-month-day-{to}-year-month-day’
Here are some examples:
Tweets in English, from Florida, on April 24th: https://covid.dh.miami.edu/get/?lang=en&geo=fl&date=2020-04-24
Tweets in Spanish, from Florida, on April 24th: https://covid.dh.miami.edu/get/?lang=es&geo=fl&date=2020-04-24
Tweets in Spanish, from Colombia, on May 17th: https://covid.dh.miami.edu/get/?lang=es&geo=co&date=2020-05-17
All tweets in Spanish from Flroida: https://covid.dh.miami.edu/get/?lang=es&geo=fl&date=all
Tweets from Argentina from April 24th to 28th: https://covid.dh.miami.edu/get/?lang=es&geo=ar&date=from-2020-04-24-to-2020-04-28
All tweets from Spain during April: https://covid.dh.miami.edu/get/?lang=es&geo=es&date=month-2020-04
Please, have fun! 😉
Remember: if the file is not generated already in the database, it will take some minutes to be generated.
At the end of April we started to get familiarized with the Twitter API and asking how to capture the public conversations that are happening in this social media network.
We quickly understood we needed to focus on a plan and method for organizing our corpus, accomplishing our objectives, and dividing the different tasks among our team members.
Datasets in English are very numerous (see post “Mining Twitter and Covid-19 datasets” from April 23rd, 2020). In order to start with a more defined corpus, we decided to focus on Spanish datasets, in general and per areas. We also wanted to give a special treatment to the South Florida area and approach it from a bilingual perspective, due to its linguistic diversity, especially in English and Spanish. With this in mind, a part of the team analyzes public conversations in English and Spanish, and focuses on the area of South Florida and Miami. While the CONICET team is in charge to explore data in Spanish, namely from Argentina.
To enlarge our dataset, we have decided to harvest as well all tweets in Spanish, and to create specific datasets for other parts of Latin America (Mexico, Colombia, Perú, Ecuador), and Spain. For the sake of organization of our corpus, we built a relational database that collects all information connected to these specific tweets and that automatically ingest hundreds of thousands of tweets a day.
We have different queries running, which correspond to the datasets in our ‘twitter-corpus‘ folder in GitHub. In short, there are three main types of queries:
General query for Spanish harvesting all tweets which contain these hashtags and keywords: covid, coronavirus, pandemia, cuarentana, confinamiento, quedateencasa, desescalada, distanciamiento social
Specific query for English in Miami and South Florida. The hashtags and keywords harvested are: covid, coronavirus, pandemic, quarantine, stayathome, outbreak, lockdown, socialdistancing.
Specific queries with the same keywords and hashtags for Spanish in Argentina, Mexico, Colombia, Perú, Ecuador, Spain, using the tweet geolocalization when possible and/or the user information.
Folders are organized by day (YEAR-MONTH-DAY). In every folder there are 9 different plain text files named with “dhcovid”, followed by date (YEAR-MONTH-DAY), language (“en” for English, and “es” for Spanish), and region abbreviation (“fl”, “ar”, “mx”, “co”, “pe”, “ec”, “es”):
dhcovid_YEAR-MONTH-DAY_es_fl.txt: Dataset containing tweets geolocalized in South Florida. The geo-localization is tracked by tweet coordinates, by place, or by user information.
dhcovid_YEAR-MONTH-DAY_en_fl.txt: This file contains only tweets in English that refer to the area of Miami and South Florida. The reason behind this choice is that there are multiple projects harvesting English data, and, our project is particularly interested in this area because of our home institution (University of Miami) and because we aim to study public conversations from a bilingual (EN/ES) point of view.
dhcovid_YEAR-MONTH-DAY_es_ar.txt: Dataset containing tweets geolocalized (by georeferences, by place, or by user) in Argentina.
dhcovid_YEAR-MONTH-DAY_es_mx.txt: Dataset containing tweets geolocalized (by georeferences, by place, or by user) in Mexico.
dhcovid_YEAR-MONTH-DAY_es_co.txt: Dataset containing tweets geolocalized (by georeferences, by place, or by user) in Colombia.
dhcovid_YEAR-MONTH-DAY_es_pe.txt: Dataset containing tweets geolocalized (by georeferences, by place, or by user) in Perú.
dhcovid_YEAR-MONTH-DAY_es_ec.txt: Dataset containing tweets geolocalized (by georeferences, by place, or by user) in Ecuador.
dhcovid_YEAR-MONTH-DAY_es_es.txt: Dataset containing tweets geolocalized (by georeferences, by place, or by user) in Spain.
dhcovid_YEAR-MONTH-DAY_es.txt This dataset contains all tweets in Spanish, regardless of its geolocation.
As of today May 23rd, we have a total of :
Spanish from South Florida (es_fl): 6,440 tweets
English from South Florida (en_fl): 22,618 tweets
Spanish from Argentina (es_ar): 64,398 tweets
Spanish from Mexico (es_mx): 402,804 tweets
Spanish from Colombia (es_co): 164,613 tweets
Spanish from Peru (es_pe): 55,008 tweets
Spanish from Ecuador (es_ec): 49,374 tweets
Spanish from Spain (es_es): 188,503 tweets
Spanish (es): 2,311,482 tweets
We do not include retweets, only original tweets.
The corpus consists of a list of Tweet Ids. As a way of obtaining the original tweets, you can use the “Twitter hydratator” which takes the id and download for you all metadata in a csv file.
We started collecting our dataset on April 24th, 2020. For prior dates (January – April 24th), we hope to use the PanaceaLab dataset, since it is one of the few that collects data in all languages, and we expect achieve this in the next couple of months.
The only topic these days: the coronavirus, Covid-19, the pandemic, SARS, the crisis, disease, the enemy, the survival… We all are under the same global situation and we all are concerned by the many impacts and consequences that this event is having and will have in our lives.
This pandemic can be approached from infinite perspectives, and we think that digital humanities can also contribute. We are especially interested in the digital narratives that inform the outbreak. Which are the outbreaks narratives? Certainly, they are not unique or monolithic.
Social distancing brings to the frontline social media, some of which are open for mining and retrieving what people are saying. The most clear example is Twitter that has an API to recover tweets, containing texts and social interactions. Many scholars and projects are already mining data about Covid-19 and providing tweet datasets to be downloaded and explored. Here you have a list of these datasets:
“Covid-19 Twitter chatter dataset for scientific use” (Panacea Lab) is an online Data set, stored in GitHub and distributed under a DOI with Zenodo (the number version is updated almost every week). They gather data since January 27th and they capture all languages, but -as they explain- the higher prevalence are: English, Spanish, and French. They deliver the datasets in two different forms: one dataset contains all mentions and retweets, while the other is a clean version containing only the tweets. They also perform NLP tasks and provide the top 1,000 frequent words and top concurrencies. They complement their dataset by building general statistics. This corpus -as it is established by Twitter Terms of service– consists of a list of tweets identifiers, which need to be hydrated. Check also their e-prints posted on arXiv “A large-scale COVID-19 Twitter chatter dataset for open scientific research — an international collaboration“.
COVID-19-TweetIDs (E.Chen, K. Lerman, and E. Ferrara) is another ongoing collection of tweets associated with the pandemic. They commenced gathering data on January 28th. In their particular case, besides harvesting hashtags, they use Twitter’s streaming API to track specified user accounts and specific keywords. They have structured their GitHub repository by month, day, and hour. Each month folder contains a .txt file per day and hour. These .txt files also consist of the Tweet IDs and thus need to be hydrated. Check also their e-prints posted on arXiv “COVID-19: The First Public Coronavirus Twitter Dataset.”
“Coronavirus Tweet Ids” (D. Kerchner, L. Wrubel) contains the tweet ids of 155,206,805 tweets related to Coronavirus. Their starting date was between March 3, 2020 and they keep releasing a new version every two weeks approximately. To build the collections they use Social Feed Manager.
“Corona Virus (COVID-19) Tweets Dataset” (R. Lamsal) provides a CSV dataset with the tweet ids. This initiative monitors the real-time Twitter feed by tracking only English “en”, and the words “corona”, “covid”, “covid-19”, “coronavirus” and the variants of “sars-cov-2”. Simultaneously, they have sentiment.live, a site that visualizes sentiment analysis of the Twitter feed.
There are many other catalogs, projects, and repositories that gather Twitter collections. We recommend also to have a look here: