Categories
Content Analysis Data Recognition Interpretation

Reflections on quantified data: #ScholarStrike in the context of COVID-19

Although the COVID-19 pandemic created a truly shared global context for the first time in years, it soon began to coexist with the local reality of each country. Twitter, as expected, was no stranger to this, and certain hashtags soon began to appear that account for this “localization” process of the pandemic (for example, in Argentina, #coronacrisis, in reference to the financial collapse as a result of a long lockdown and a weak economy inherited from the previous government). However, other hashtags less representative of the public health situation soon began to become resignified, and even to emerge, within this context. For the United States, this was the case for #BlackLivesMatter and #ScholarStrike.

In this post we seek to look into the particularities of the latter, following the analysis that we proposed in our previous post (“What can academic journals tell us about COVID-19 and Education?”), that is, to use quantitative analysis platforms (in the previous post we used AVOBMAT) developed by third parties to perform a text mining analysis, while evaluating the functionalities and limitations of the tool. The case of #ScholarStrike seemed ideal to analyze with a “tailor-made” tool, since it is a hashtag that had a strong presence for a limited time (prior to the initiative, during it and a few days after).

For those unaware of the news from the U.S., Scholar Strike was an action and teach-in at the universities that sought to recognize and raise awareness of the increasing number of deaths of African Americans and other minorities due to the excessive use of violence and force by the American police. For two days, between September 8 and 9, professors, university staff, students and even administrators walked away from their regular duties and classes to participate in classes (in some cases open) on racial injustice, police surveillance and racism in United States. Canadian universities held their own Scholar Strike between September 9 and 10. At the Scholar Strike official site it is possible find more information on the actions, as well as on their YouTube channel, where different scholars posted examples of teach-ins and other resources. The official site also includes a list of textual and audiovisual resources that could be used in the classes as well as information on the media coverage of the Scholar Strike. Scholar Strike Canada also created an official website which includes details of the programmed activities, resources, and links to the organizations that supported the initiative.

 Our goal was to perform a text mining analysis on this hashtag, while also looking for terminological coincidences with others directly related, such as #BlackLivesMatter, and with some more connected to the COVID-19 crisis.

To do this, we used two commercial Twitter text mining platforms: Brand24 and Audiense. Brand24’s official site (https://brand24.com/ ) describes the platform as a “web and social media monitoring tool with powerful analytics”. The tool looks for keywords provided by the user and analyzes them on different levels. It is mostly oriented towards brands analysis and the use of the data in digital marketing. On the other hand, Audiense (https://audiense.com/ ) as it’s described on its official page, “provides detailed insights about any audience to drive your social marketing strategy with actionable and enriched real-time data to deliver genuine business results”. It is worth stressing, as it is clear from the official descriptions of the tools, that both have been developed to be used in business, although they can be adapted, of course, to any type of research on social media.

The work with these platforms is almost completely opposite to what we have been doing in this project. If in the interaction with our database, we establish a process of filtering and curating the data, to then proceed to the analysis through different tools and methods (terms frequency, topic modeling), here the filters that we can give to the platform are few (we can choose the social media platform, and set up the date range). It is the platform itself that produces a series of daily results that are also interpreted in an automatic analysis in the form of percentages, visualizations and infographics.

We used Brand24 and Audiense in their 7-day trial version. Broadly speaking, in comparison, Brand24 is quite superior to Audiense. We performed the same searches and the first thing we noticed was that Audiense had a high bias against the information. All the tweets that we collected via the #ScholarStrike hashtag were negative and all came from Trump supporters or the president himself.

Figure 1. Audiense report on #ScholarStrike.

Brand24, on the other hand, returned the data in a more neutral way. As we already described, once the platform finishes performing the search, it automatically sends an email to the project admin, and the user can choose to download a report. Data can be revised in the data on the ‘Mentions’ tab, that is meant to provide the user the ability to work on the data – from direct and boolean search, through tagging, advanced filtering, deleting irrelevant mentions, to sentiment, which can be either machine assessed, or changed manually, like so:

Figure 2. Mentions Tab. Brabd 24

Now, let’s now take a deeper look on the narrative that this platform offers us for the search on #ScholarStrike.

We did the first hashtag search on the 13 and Brand24 did the retrospective search for the last 30 days (Aug 14, 2020 to Sept 13, 2020). 24 hours after setting up the search, it allowed us to download a report and an infographic. In the first report, we can see that, generally, the sentiment about the strike was positive (44 positive against 21 negative):

Figure 3. Summary of #ScholarStrike mentions on social media from Brand24.

Clearly, since #ScholarStrike was an action that lasted just a couple of days, the mentions only occur in that period, but it is remarkable how they grew on the third day after it started:

Figure 4. Graph of the volume of #ScholarStrike mentions on social media throughout the month of September.

Then, the platform gives us a visualization of the most salient terms of all social media.

Figure 5. Set of most salient terms in social media within the context of #ScholarStrike exchange.

Justifiably, professor, teaching, are key terms since the action occurred in that field, but, as we said at the beginning of the post, the intertwine with the Black Lives Matter movement is visible in terms such as racial, issues, September, police, injustice, black. It is interesting, although expected, given its political use, that of the two most popular social network platforms, Facebook and Twitter, it is the second that stands out. Another notable term is Butler. What is interesting here is that, out of context, Butler could be associated with the philosopher and theorist Judith Butler (widely cited based on her thesis on the performativity of gender), who has also had an active intervention in the BlackLivesMatter movement through her publications in different media outlets, and on social media, as shown in these publications: https://opinionator.blogs.nytimes.com/2015/01/12/whats-wrong-with-all-lives-matter/https://iai.tv/articles/speaking-the-change-we-seek-judith-butler-performative-self-auid-1580 . However, this term actually refers to Aethna Butler, professor in Religious Studies and Africana Studies at the University of Pennsylvania, who was one of the organizers of the Scholar Strike: https://www.insightintodiversity.com/professors- lead-a-nationwide-scholar-strike-for-racial-justice /

Next, the platform shows us the most active and the most recent users in terms of their activity on Twitter:

Figure 6. Most popular users and recent mentions in Twitter.

It is difficult to know if the tool is measuring the most popular users by number of Tweets or by retweets. From what can be seen below, it seems that the calculation is made from the mentions and these are the ones that weight the degree of influence of a user on Twitter (figs 7 and 8).

However, something that struck us is the user ISASaxonists, a group of medievalists specialized in Anglo-Saxon medieval literature (fig 6).

Figure 7. Most active public profiles on Twitter related to #ScholarStrike.

Figure 8. Most influential public profiles on Twitter.

Lastly, the platform shows the most used hashtags (and related to each other):

Figure 9. Most mentioned hashtags on Twitter, from the #ScholarStrike search.

#ScholarStrike, #BlackLivesMatter, #Covid are expected hashtags. Once again, the interesting thing here is the medievaltwitter hashtag, in 13th place, which, although the platform does not make it explicit, must be related, for example, to the user ISASaxonists. If this is the case, it would be interesting to think if both the medievaltwitter hashtag and the tweets of the user ISASaxonists are related to the accusations that occurred in 2019 against the Anglo-Saxon International Society for its inability to account for issue of racism, sexism, diversity and inclusion within Ango-Saxon studies. Part of this discussion was published in academic journals in the U.S during September 2019: https://www.insidehighered.com/news/2019/09/20/anglo-saxon-studies-group-says-it-will-change-its-name-amid-bigger-complaints-about

Overall, exploring the context of ScholarStrike with the Brand24 platform allowed us to confirm some previous assumptions (its relationship with hashtags such as BLM, Covid) but it also illuminated less expected other hashtags for a non-academic user, such as #medievaltwitter, and other hashtags that subtly appeared in the beginning, but soon began to have more impact in the following weeks, in the midst of the electoral race, such as #bidenharris2020.

Gimena del Rio/Marisol Fila

Categories
Uncategorized

Access our Twitter Collection

We are happy to finally launch the interface to download a collection of tweets related to the Covid-19 pandemic. You can choose a range date, an area (Mexico, Argentina, Colombia, Perú, Ecuador, Spain, Miami area), and language (only for the Miami area, in English and Spanish).

https://covid.dh.miami.edu/get/

The texts are processed by removing accents, punctuations, mention of users (@users) to protect privacy, and replacing all links with “URL.” Emojis are transliterated into a UTF-8 charset and transformed into emojilabels. We also decided to unify all different spellings of Covid-19 under a unique form, and all other characteristics, including hashtags, are always preserved.

But there’s more! We have implemented a simple API to select your collection with no need to access to the interface.

The API entrance point is also here: https://covid.dh.miami.edu/get/ and it serves to deliver the .txt files that you want.

There are three main variables for queries and each query is separated by an ‘&’: language, geolocalization, and date. Each query starts always with a “?” and is abbreviated as follows:

  • lang = es or en
  • geo = fl, ar, es, co, pe, ec, mx, all
  • date: month-year-day, {month}-year-month, {year}-year, or a range ‘ {from}-year-month-day-{to}-year-month-day’

Here are some examples:

  • Tweets in English, from Florida, on April 24th:
    https://covid.dh.miami.edu/get/?lang=en&geo=fl&date=2020-04-24
  • Tweets in Spanish, from Florida, on April 24th:
    https://covid.dh.miami.edu/get/?lang=es&geo=fl&date=2020-04-24
  • Tweets in Spanish, from Colombia, on May 17th:
    https://covid.dh.miami.edu/get/?lang=es&geo=co&date=2020-05-17
  • All tweets in Spanish from Flroida:
    https://covid.dh.miami.edu/get/?lang=es&geo=fl&date=all
  • Tweets from Argentina from April 24th to 28th:
    https://covid.dh.miami.edu/get/?lang=es&geo=ar&date=from-2020-04-24-to-2020-04-28
  • All tweets from Spain during April:
    https://covid.dh.miami.edu/get/?lang=es&geo=es&date=month-2020-04

Please, have fun! 😉

Remember: if the file is not generated already in the database, it will take some minutes to be generated.

Categories
Analysis

Frequency Analysis for South Florida (April – June)

This post compares the top 30 most frequent words and the top 20 hashtags in our Twitter English and Spanish corpora of South Florida from April 25th to June 25th, 2020. We divided it into 2 four-week periods to analyze broad trends and themes in the discourse.

For our corpus criteria as well as for the keywords used to harvest our corpus, please refer to our blog post “A Twitter Dataset for Digital Narratives“. As for our corpus, check our GitHub repo for the ID datasets to recover tweets collections.

The project uses Coveet, a frequency analysis tool in Python developed by Jerry Bonnell, a PhD student in Computer Science at the University of Miami, that retrieves basic statistics (most frequent words, bigrams, trigrams, top users, hashtags, etc.). Coveet allows 1) customized stopword removal, 2) top words retrieval by date, location, and language, 3) mining unique top words by location and date, 4) collocation analysis, and 5) visualization.

We have prepared a version of this post with a Jupyter notebook in our GitHub repo that is available to be run via Binder.

As far as the number of tweets concerns, these are the totals in the South Florida area by month. As we can see, tweets in English are much frequent than in Spanish:

25/April – 25/May25/May – 25/June
Tweets in Florida in Spanish:6,6954,957
Tweets in Florida in English:23,54818,867

Top 30 words in South Florida from April 25th to May 25th: English vs Spanish

We will start this post by querying the tweets for the first four weeks from April 25 to May 25 in South Florida in both English and Spanish. We have prepared the dataset for interpretation, by removing stop words, which refers to the most common words in the language that appear so frequently that bear little significance. Removing stop words makes it easier to focus on the substantive discussions and themes in the corpus. There is not a standard list of stop words in each language. A few examples of stop words include “I,” “is,” “and,” etc. We have established our own list of stop words in our GitHub repo, for English and Spanish.

The process used for all periods is as follows: After the query is done [3], our coveet.py script, with the help of the pandas package, processes all commands in a csv file, which can be read via Excel and is downloadable and portable [4]. We then run a function consisting of tidying the csv data by removing the stopwords [5]. Afterward, we organize the resulting data by showing data, text, and hashtag, and to separate strings of texts into individual words (consequently a normal string such as “have a great days” is converted into “have” “a” “great” “day” [6] . We finally create the top ngrams and visualizations for each section [7].

Produced with coveet.py written by Jerry Bonnell.

Top 20 hashtags South Florida from April 25th to May 25th: English vs Spanish

We have run the same process from April 25th to May 25th but recovering the 20 most used hashtags by language in South Florida.

Bar chart of 20 most frequent hashtags produced by coveet.py written by Jerry Bonnell.

English and Spanish discourses in South Florida both discuss daily new cases, infected patients, deaths, testing during this global crisis.

Comparing and contrasting the top words and hashtags points us to some interesting areas for further investigation.

  1. The Spanish discourse seems more global. “eeuu,” “Cuba,” “Venezuela,” and “pais” suggest that the Spanish corpus discussed the pandemic on a national and international scale. “Miami,” a local term, on the other hand, is unique to the English corpus, whose top words don’t include any country names. Here are a few important questions to investigate:
    • Were foreign countries mentioned because of the large South Florida residents of Latin America, Cuba and Venezuela in particular, descent?
    • Did these Twitter users want to compare the situation in the US to those of other countries?
    • Why such international focus is more prominent in the Spanish corpus than the English corpus?
  2. Public health measures are more prominent in the Spanish corpus. “Cuarentena” and “vacuna” shows that the discussions of quarantine policies and vaccine take a significant weight in Spanish-languaged tweets, which neither is discussed in the English corpus. How shall we explain this distinction?
  3. The English corpus seems to be more “interactive.” “Help,” “need,” “support,” and “please” suggest a call for action from another individual, and are unique to the English corpus. These words imply that the English tweets have a stronger intention to interact with readers and influence others’ behaviors. With further concordance analysis, here are a few questions that come to mind
    • To whom are these actions directed? Government agencies, the audience at large, hospitals, etc.?
    • Are these demands for others, i.e. “people need to wear masks,” or calls for assistance for oneself, i.e. “my family needs support due to unemployment”?
    • Which topical area did these words mostly appear in, economics, medical, political, personal, etc?
  4. “Business” and “work” are unique to the English corpus. Does it indicate more discussions about the economic effects of the pandemic?
  5. Since “gobierno” is unique to the Spanish corpus, how prominent is government-related discussions in Spanish tweets? Does the English corpus discuss government, at all? How do they differ?

Top 30 words in South Florida from May 25th to June 25th: English vs Spanish

The same process now can be repeated for a second period, from May 25th to June 25th, and here are the top 30 words most used, after cleaning stopwords:

Produced with coveet.py written by Jerry Bonnell.

Top 20 hashtags South Florida from May 25th to June 25th: English vs Spanish

Also, here are a list of the most used hashtags in this period:

Produced with coveet.py written by Jerry Bonnell

Let’s first look at the common top words. With cases/casos and new/nuevo rising to top two, we can speculate more discussions about the increasing number of cases in South Florida after late May. “Miami” shows up in both English and Spanish corpora, indicating more attention paid to this area by Spanish-language Twitter users.

The list of unique words further reveals some patterns and research questions.

  1. The Spanish corpus remains more “global.” Venezuela and Cuba are again hot topics among the Spanish-speaking population, but this time Venezuela appears first, with Brazil added to the list and China disappeared. As the situation of the countries worsen, they naturally rise to the upper position in the public conversations.
  2. “Masks” is (finally) a top word in the English corpus. This aligns with various states’ mandatory mask policies, calls for responsible protesting during the Black Lives Matter movement, and reflects an improved public awareness of responsible preventive measures.

Top unique hashtags

It is also interesting to check the corpus looking only for unique hashtags that appear in English or in Spanish.

These are the top hashtags that appear in the fist period of time (April 25th to May 25th):

Produced with coveet.py written by Jerry Bonnell

While this plot bar shows the top unique hashtags from May 25th to June 25th:

Produced with coveet.py written by Jerry Bonnell

We hope, with this blog post, to have shown some more techniques on how to explore our Twitter corpus. Stay tuned for more functionalities and more thoughts about what people are saying about Covid-19 on social media.

Categories
Cleanup Methods Visualization

Outbreak Topics: Topic modeling of COVID-19

In this post, we will present another way to explore our dataset of tweets on Covid-19. We intend to detect emerging topics of interest for our study of the social narratives about the pandemic. For this, we will perform unsupervised machine learning using different Python libraries.

In this case, we work with data in Spanish, but the same processing can be applied to English data; only a few parameters will diverge.

“Cleaning is usually 80% of a data scientist’s time”

Working with big data usually involves spending most of the time cleaning and organizing the data (source). This case is no exception! Cleaning is crucial for text processing not only because it reduces a text making it easier to read by machine, but also because it can significantly improve the quality of the results.

Our first step is to filter stopwords and emojis. We used generic lists available in standard libraries (NLTK, emoji). Based on the results, we refresh the list of stopwords to eliminate noise (e.g., “retwitt”) or words that are too obvious (e.g., “covid19”).

Another important step of preprocessing is part of speech detection and lemmatization. We chose to use Stanza (from Stanford NLP) because it yields better results for Spanish lemmatization. Spacy could be a better choice for English, since it obtains good results with reduced morphology languages such as English. The processing time is also notably faster than with Stanza.

After preprocessing, we can tackle detection of the main topics of our Covid-19 corpus with machine learning using Gensim, a Python library for topic modeling. We will perform an unsupervised learning because we don’t know the content or quantity of our topics in advance. Then, we will train our models with LDA for 3 to 30 topics.

Topic coherence of the models measured with c_npmi, c_uci and u_mass for all the tweets in Spanish on April 25th

The topic coherence plots generated for the Spanish-language tweets from April 25th show that the conversation in our Covid-19 corpus was very focused, since the coherence drastically falls when the topic number grows. Given the opposite case (when topic coherence scores better for more topics), it is important to find an agreement between the results of the coherence scores and the number of topics interpretable by humans. It is hard to imagine a human analysis dealing with over a dozen topics for the same corpus.

Visualizing the results with graphics is very helpful for the analysis. A popular library for visualizing topic models is pyLDAvis, which shows the most frequent words of each topic in an interactive graphic.

7 topics model for all tweets in Spanish of covid19 on April 25th 2020

On the previous graphic, which shows seven topics for tweets from April 25th in all the locations of our sample (Argentina, Colombia, Ecuador, España, Florida, México, and Perú), we can observe the problem pointed out earlier: it becomes difficult for a human to understand the criteria used to group certain words under a certain topic. As the topic number increases, the topics are less interpretable, even if they have high coherence scores.

This problem is probably due to the size of our sample: aside from Covid-19, Twitter users of each Spanish-speaking location discuss different, largely unrelated subjects. We will compare the results for Argentina and Colombia to find out.

Topic coherence of the models measured with c_npmi, c_uci y u_mass for Argentina, best results for 3 and 5 topics
Topic coherence of the models measured with c_npmi, c_uci y u_mass for Colombia, best results for 3 and 7 topics

But first, a word on another type of visualization that we found very useful for topic modeling, Circle Pack. This type of graphic uses colors to represent different topics and spheres whose size is relative to word frequency. Let’s compare the Circle Pack for April 25th in Argentina and Colombia for three topics, given that both countries received high scores for this topic number.

Topics of covid19 tweets in Colombia on April 25th 2020

On the graphic for Colombia, the red circles represent a topic relating the pandemic to politics. It includes words as “government,” “president,” and “country”. The blue topic addresses public health issues, which includes “vaccine,” “virus,” and “test.” The green topic seems more related to daily statistics of case numbers, deaths, and infected people.

Topics of covid19 tweets in Argentina on April 25th 2020

In order to interpret the Circle Pack for Argentina, it is essential to dive into the latest news of that particular day. Doing so reveals controversy surrounding a baby named Ciro Covid who was born on April 24th in Santa Fé, The question “Who would name her/his baby Ciro Covid?” that flooded Argentinian Twitter the next day is not only clearly represented on the green topic but also invaded the tweets of daily reports of case and death numbers (red topic). With a remarkably smaller dimension, we can observe another trending topic in Argentina that day: the controversy of inmates leaving on probation as a preventative measure, represented in blue.

Once again, we confirm that the humanist approach of understanding data and its contexts is critical for assigning a meaning to the results of automatic processing.

For more details about the processing performed for topic modeling, download the notebook available on our project Github repository.

Categories
Theorizing visualization

What can academic journals tell us about COVID-19 and Education?

The Covid situation has put new terms into our everyday vocabulary, terms such as pandemic or infodemic. This last one, according to Wiktionary can be defined as:

Blend of information +‎ epidemic

Noun

infodemic (plural infodemics)

  1. (informal) An excessive amount of information concerning a problem such that the solution is made more difficult.
  2. (informal) A wide and rapid spread of misinformation.

One good way of surviving infodemia is analyzing data. AVOBMAT (Analysis and Visualization of Bibliographic Metadata and Texts-https://avobmat.hu/) is a text mining research tool that was primarily designed for digital humanities research. It is a powerful digital toolkit for analysing and visualizing bibliographic metadata and texts. AVOBMAT added a COVID-19 dataset to its new text mining research tool. This is a resource of over 138,000 scholarly articles (sadly, only in English), including over 69,000 with full text, regarding COVID-19, SARS-CoV-2, and related coronaviruses. We thought that before delving into the sea of Twitter to see what is happening in relation to the pandemia and Education (Higher Ed, Remote Teaching, etc.), we should build a framework that could support and inform our hypothesis. We used AVOBMAT to explore what scientific journals published between 2019 and 2020 regarding these topics.

First, we did a General Lucene query: we set up a period (2019 and 2020) and chose some general words such as “syllabus”, “education” and “Coronavirus” (not only COVID-19, but all the Coronavirus diseases). The search showed us 298 articles (of course, all of them in English): http://dighum.bibl.u-szeged.hu/avobmat-covid/home

Then, we chose to see what this general search could tell us in a closer approach, though still distant. We chose the WordCloud visualization option, and this was the result:

WordCloud in AVOBMAT

Something  we generally expected, then had confirmed by the cloud, are the references to cities and countries (Wuhan, Hubei, China, Vellingiri) and references to specific months (December, February, March). As the situation in the US was not critical until April, we discovered the presence of the East. However, it is curious that other countries such as Italy, Spain, and the United Kingdom, all of which were in a concerning situation through early 2020, were missing. We could explain these results with an argument that there was a delayed response in academic writing and publishing in tackling this new context, and maybe also that there was not much interest in the topics we were looking at (syllabus, education, coronavirus). However, the explanation itself is in the coronavirus, specifically SARS-CoV (2002-2003) and MERS-CoV (2012-nowadays). All of the other coronaviruses mainly attacked countries from the East and not the West. This explains the appearance of some of the cities that we mentioned before. Actually, it wasn’t until March 2020 that some journals, such as Inside Higher Ed and The Chronicle of Higher Ed, started publishing articles that talked about Covid-19 and Higher Ed in the US. Earlier publications from 2020 or even January and February 2020 were talking about new challenges in Higher Ed in China, South Korea or Europe (Italy, Spain, UK) (See for instance the search we did for Inside Higher Ed journal).

All in all, it is really interesting that in this cloud education is related with medicine (healthcare, pharmacists, emergency, quarantine, transmission) and, obviously with face, mask…and Google. Of course, it is not only bodily medicine referred to here, but also terms such as psychiatrist, mental, etc.

Keyword in Context (KWIC) in AVOBMAT. Education.

Finally, if we do a very close reading and analyze the metadata given in the general search, we find that in most of the articles the term education is related to the variables that the researchers used to study the disease. For instance, this is a passage in “A County-level Dataset for Informing the United States’ Response to COVID-19” by Benjamin D. Killeen et al (2020), in which the authors state that they have used “300 variables that summarize population estimates, demographics, ethnicity, housing, education, employment and income, climate, transit scores, and healthcare system-related metrics.”(https://arxiv.org/pdf/2004.00756.pdf). In other cases, the term education is very much related to a Ministry (in the case of Iran, the work of the Ministry of Health and Medical Education is much cited (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7085938/).

Journals visualization in AVOBMAT

Therefore, it’s not easy to understand what this cloud is telling us. 

If we do a similar Lucene query but replacing Coronavirus with Covid-19, plus education and syllabus, we find 458 articles that show us these words:

WordCloud in AVOBMAT

Of course places (Hubei, Wuhan, China) and months (January, February, March) are still there. Terms related to mental illnesses are there (psychiatrist, mental), but quarantine now has a synonym which has been widely used in anglophone countries: lockdown. We also have words similar to Google (for example, Internet), and newcomers such as  Whatsapp and others related to our new  life, such as online, distance and telemedicine.

But what about education, as teaching and learning? We further detailed our search using terms such as teaching, universities, learning, students and COVID-19. As a result, we got 199 articles in which these were the most used words:

WordCloud in AVOBMAT

Gathering versus lockdown, moodle, moocs, distance, gym, gave us a very realistic picture of the education scenario these days. Even the metadata visualization tells us that these topics are approached from the Medical Sciences, and it gives us a detailed picture of our global COVID-19 situation.

Journal visualization in AVOBMAT

As we suspected, most of the articles published about COVID-19 and the different approaches to topics related to education, Higher Education, etc. are related to studies in the  Medical Sciences. On the one hand, as expected, this is a dominant discipline in a pandemic context, but it also shows how Medical Sciences have improved the slow timing of academic writing. Of course, we are not giving account of all the publications on this topics, as many harvesting services from other latitudes are not included as part of the AVOBMAT service. Nevertheless, it gives us the big picture to move in our next post to an approach of what the tweets are saying on these topics. More distant and close reading coming soon!

Marisol Fila and Gimena del Rio Riande

Categories
Content Analysis Visualization

Analyzing a Twitter Corpus with Voyant (I)

The first step of working with data is to get to know your corpus. Our project, for instance, is most concerned with the linguistic and humanistic contexts in the Twitter discourses generated by the Covid-19 pandemic. Some starting “get-to-know-you” questions we are interested in about our corpus include the trend of daily corpus length, most frequently used words, term co-occurrence, and corpus comparisons by time, locations, and languages.

The large size of data makes manual reading merely impossible. Machine learning, thankfully, assists humanists in understanding key characters of the corpus and, in turn, developing analytical questions for research. Employing digital methods, however, in the humanities does not equate replacing human reading with software. The computer can make otherwise time-consuming, or unimaginable, tasks feasible by showing relationships and patterns in big data. Digital humanists then apply critical analysis and expertise in the humanities to attempt to make sense of these patterns and their broader implications. In other words, machines provide a new method to observe crucial information about large-scale texts that manual reading alone cannot accomplish or detect. The results machines generate is just the beginning of every DH project instead of the output. Human analysis and humanities knowledge remain at the core of DH scholarship.

Voyant is one of the tools we use to capture a snapshot of our corpus. It is a web-based software for large-scale text analysis, including functions of corpus comparisons, counting word frequencies, analyzing co-occurrence, interpreting key topics, etc. It does not require installment and is compatible with most machines. Here is a tutorial, or rather an experiment, of working with Voyant to conduct initial textual explorations with our corpus, updated on a daily basis and available at: https://github.com/dh-miami/narratives_covid19/tree/master/twitter-corpus (check our previous post on Hydrating TweetingSets)

For this tutorial, we selected the English corpus in Florida on April 28, 2020, the day total cases in the U.S. reached the one million mark. Voyant reads plain text (txt.) files either by pasting in the dialogue box or uploading your file. Here are the initial results we got after uploading the hydrated corpus.

Dashboard displaying all patterns observed

Beginning by reading the summary, we know that on April 28, our corpus consists of 21,878 words, of which 4,955 are unique. Vocabulary density is calculated by dividing the number of unique words by the number of total words. The closer to 1 the denser and diverse the corpus is. With a density index number of 0.226, we can know that the corpus is not so diverse on April 28. Once we run tests on the entire collection of our data we will then make sense of whether this density is a norm throughout the entire corpus or a significant finding.

Summary of the April 28 English corpus in Florida

We can also see that empty words, such as “user” and “url,” which are in every Twitter document and do not hold any significance, are distracting the results of most frequent words as well as the cirrus. We can remove these terms by clicking “define options for this tool” on the top-right corner of the cirrus box and by editing the stop word list. Voyant has the function to automatically detect and remove a default list of stop words. To keep a clear record of your results, it is best to keep a list of the words you remove. Here is the new cirrus graph after removing “user” and “url.”

Cirrus visualization with top 45 most frequent terms

The top 5 most frequent words in the corpus are “covid19” (844 counts,) “coronavirus” (77 counts,) “pandemic” (77 counts,) “people” (57 counts,) and “help” (51 counts.) Since our entire collection of tweets are about the Covid-19 pandemic, words include “covid19,” “coronavirus,” and “pandemic” are likely to appear in most daily corpus. To get a closer look at what the corpus on April 28 looks like, we removed these consistent thematic words and generated a new cirrus graph.

Top 45 most frequent words excluding “covid19,” “coronavirus,” and “pandemic”

The new top 5 most frequent words are “people” (57 counts,) “help” (51 counts,) “new” (45 counts,) “just”(44 counts,) and “testing” (44 counts.) Based on these words we can speculate that new cases and testing related topics took a significant portion of the April 28 data. We will keep track of the daily most frequent words, explore other Voyant features, and analyze the larger trend for the next steps.

Categories
Cleanup Methods

How to “hydrate” a TweetSet?

Twitter public discourse is one of our project’s primary research concerns. Twitter’s rich data has also drawn more and more researchers from various disciplines and fields to explore different aspects of society. This blog post serves as a tutorial of using DocNow Hydrator to “hydrate” tweets. Our project, as we explained, is offering a series of datasets on Covid-19 that can be downloaded onfrom our GithHub repo.

Due to Twitter’s Developer terms and research ethics, most TweetSets we can acquire from Twitter’s Application Programming Interface (API) and third-party databases are dehydrated tweets. In other words, instead of collecting tweet contents, geolocations, time, images, and other attached information to tweets, what researchers would initially receive is a plain text file consisting of a list of unique tweet IDs. These IDs allow us to retrieve all tweet metadata, including the text, and they need to be “hydrated” to recover the metadata and to become meaningful research sources. The large size of tweets’ correlated data is another reason why datasets offer only dehydrated IDs. Thus, a file containing only a series of numbers (IDs) is much manageable than, for example, a csv file with thousands of tweets with their metadata.

A sample of dehydrated Twitter IDs

DocNow Hydrator is a commonly used open-source software to hydrate tweet IDs and can be downloaded for free on Github. You need to link to your Twitter account in “Settings” before using Hydrator.

Hydrator “Settings” page to link Twitter account

Once your Hydrator is set up, you can upload your tweet IDs file to Hydrator. In our case, we use the Covid-19 dataset from our Digital Narratives project’s GitHub repo, which we update on a daily basis:

Hydrator “Add” tab to upload Tweet ID files

If your file has been processed correctly, Hydrator would display your file path and compute the total number of tweet IDs detected. In “Title” you can rename your hydrated file, while the rest of the boxes can be ignored. Then click “Add Dataset.”

After uploading a tweet ID file

Click “Start” to hydrate the tweet IDs.

The newly generated dataset “COVID0401” is now available under the “Datasets” tab.

A new window would pop up and ask you to locate and name your hydrated tweet IDs file. Hydrator will generate a .json file by default. Making your document a .csv file makes it more easily assessable by Excel and other file readers.

Saving the hydrated document in .csv format and selecting the correct location to store

Hydrator will then begin the hydration process. Completion time depends on the number of tweet IDs.

The progress bar will be filled with green when hydration is completed.

The completed .csv file now displays all the correlated information of the original tweet IDs.

Due to privacy concerns, we are not displaying the specific contents of the hydrated file

Researchers then can analyze geolocations, images, emoji’s, tweet discourse, hashtags, time, and other correlated information and metadata for various purposes. If you use our dataset, please keep us updated and please feel free to share your valuable feedback and suggestions with us. Stay tuned and thank you for keeping up with our project.

Categories
Data Gathering

A Twitter Dataset for Digital Narratives

At the end of April we started to get familiarized with the Twitter API and asking how to capture the public conversations that are happening in this social media network.

We quickly understood we needed to focus on a plan and method for organizing our corpus, accomplishing our objectives, and dividing the different tasks among our team members.

Datasets in English are very numerous (see post “Mining Twitter and Covid-19 datasets” from April 23rd, 2020). In order to start with a more defined corpus, we decided to focus on Spanish datasets, in general and per areas. We also wanted to give a special treatment to the South Florida area and approach it from a bilingual perspective, due to its linguistic diversity, especially in English and Spanish. With this in mind, a part of the team analyzes public conversations in English and Spanish, and focuses on the area of South Florida and Miami. While the CONICET team is in charge to explore data in Spanish, namely from Argentina.

To enlarge our dataset, we have decided to harvest as well all tweets in Spanish, and to create specific datasets for other parts of Latin America (Mexico, Colombia, Perú, Ecuador), and Spain. For the sake of organization of our corpus, we built a relational database that collects all information connected to these specific tweets and that automatically ingest hundreds of thousands of tweets a day.

We have different queries running, which correspond to the datasets in our ‘twitter-corpus‘ folder in GitHub. In short, there are three main types of queries:

  1. General query for Spanish harvesting all tweets which contain these hashtags and keywords: covidcoronaviruspandemiacuarentanaconfinamientoquedateencasadesescaladadistanciamiento social
  2. Specific query for English in Miami and South Florida. The hashtags and keywords harvested are: covidcoronaviruspandemicquarantinestayathomeoutbreaklockdownsocialdistancing.
  3. Specific queries with the same keywords and hashtags for Spanish in Argentina, Mexico, Colombia, Perú, Ecuador, Spain, using the tweet geolocalization when possible and/or the user information.

Folders are organized by day (YEAR-MONTH-DAY). In every folder there are 9 different plain text files named with “dhcovid”, followed by date (YEAR-MONTH-DAY), language (“en” for English, and “es” for Spanish), and region abbreviation (“fl”, “ar”, “mx”, “co”, “pe”, “ec”, “es”):

  1. dhcovid_YEAR-MONTH-DAY_es_fl.txt: Dataset containing tweets geolocalized in South Florida. The geo-localization is tracked by tweet coordinates, by place, or by user information.
  2. dhcovid_YEAR-MONTH-DAY_en_fl.txt: This file contains only tweets in English that refer to the area of Miami and South Florida. The reason behind this choice is that there are multiple projects harvesting English data, and, our project is particularly interested in this area because of our home institution (University of Miami) and because we aim to study public conversations from a bilingual (EN/ES) point of view.
  3. dhcovid_YEAR-MONTH-DAY_es_ar.txt: Dataset containing tweets geolocalized (by georeferences, by place, or by user) in Argentina.
  4. dhcovid_YEAR-MONTH-DAY_es_mx.txt: Dataset containing tweets geolocalized (by georeferences, by place, or by user) in Mexico.
  5. dhcovid_YEAR-MONTH-DAY_es_co.txt: Dataset containing tweets geolocalized (by georeferences, by place, or by user) in Colombia.
  6. dhcovid_YEAR-MONTH-DAY_es_pe.txt: Dataset containing tweets geolocalized (by georeferences, by place, or by user) in Perú.
  7. dhcovid_YEAR-MONTH-DAY_es_ec.txt: Dataset containing tweets geolocalized (by georeferences, by place, or by user) in Ecuador.
  8. dhcovid_YEAR-MONTH-DAY_es_es.txt: Dataset containing tweets geolocalized (by georeferences, by place, or by user) in Spain.
  9. dhcovid_YEAR-MONTH-DAY_es.txt This dataset contains all tweets in Spanish, regardless of its geolocation.

As of today May 23rd, we have a total of :

  1. Spanish from South Florida (es_fl): 6,440 tweets
  2. English from South Florida (en_fl): 22,618 tweets
  3. Spanish from Argentina (es_ar): 64,398 tweets
  4. Spanish from Mexico (es_mx): 402,804 tweets
  5. Spanish from Colombia (es_co): 164,613 tweets
  6. Spanish from Peru (es_pe): 55,008 tweets
  7. Spanish from Ecuador (es_ec): 49,374 tweets
  8. Spanish from Spain (es_es): 188,503 tweets
  9. Spanish (es): 2,311,482 tweets

We do not include retweets, only original tweets.

The corpus consists of a list of Tweet Ids. As a way of obtaining the original tweets, you can use the “Twitter hydratator” which takes the id and download for you all metadata in a csv file.

Fig. 1. Screenshot of a list of tweets ids.

We started collecting our dataset on April 24th, 2020. For prior dates (January – April 24th), we hope to use the PanaceaLab dataset, since it is one of the few that collects data in all languages, and we expect achieve this in the next couple of months.

word cloud

We have released a first version of our dataset through Zenodo: Susanna Allés Torrent, Gimena del Rio Riande, Nidia Hernández, Romina De León, Jerry Bonnell, & Dieyun Song. (2020). Digital Narratives of Covid-19: a Twitter Dataset (Version 1.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3824950

Categories
Content Analysis Curricula Visualization

COVID-19 and Higher Ed. A Look From the Digital Humanities

The 2020 opened with the news of a new disease. In a couple of weeks it became a global pandemic and we have all been concerned with this topic since then. Higher education is not exempt of it and in the last few months, we have seen how discussions on the pandemic have reached the syllabi.

From Humanities to Sciences, all disciplines are having discussions on causes, local and global consequences, history, politics… all about COVID-19. Aligned with the spirit of our project, we believe that Digital Humanities can help us to grasp what, how, and where these topics are discussed in Higher Ed.

Over the next few months, we will be posting some analysis and visualizations on the way syllabi are reacting to the global pandemic, and under which perspectives. Since we are relying on sources that have been made publicly available, our initial corpus will be composed by syllabi from the US, but we aim to open it up to Latin America as new material comes up. Stay tuned!

Undergraduate Course Syllabi | National Communication Association
Categories
Capture Data

Mining Twitter and Covid-19 datasets

The only topic these days: the coronavirus, Covid-19, the pandemic, SARS, the crisis, disease, the enemy, the survival… We all are under the same global situation and we all are concerned by the many impacts and consequences that this event is having and will have in our lives.

This pandemic can be approached from infinite perspectives, and we think that digital humanities can also contribute. We are especially interested in the digital narratives that inform the outbreak. Which are the outbreaks narratives? Certainly, they are not unique or monolithic.

Social distancing brings to the frontline social media, some of which are open for mining and retrieving what people are saying. The most clear example is Twitter that has an API to recover tweets, containing texts and social interactions. Many scholars and projects are already mining data about Covid-19 and providing tweet datasets to be downloaded and explored. Here you have a list of these datasets:

  • Covid-19 Twitter chatter dataset for scientific use” (Panacea Lab) is an online Data set, stored in GitHub and distributed under a DOI with Zenodo (the number version is updated almost every week). They gather data since January 27th and they capture all languages, but -as they explain- the higher prevalence are:  English, Spanish, and French. They deliver the datasets in two different forms: one dataset contains all mentions and retweets, while the other is a clean version containing only the tweets. They also perform NLP tasks and provide the top 1,000 frequent words and top concurrencies. They complement their dataset by building general statistics. This corpus -as it is established by Twitter Terms of service– consists of a list of tweets identifiers, which need to be hydrated. Check also their e-prints posted on arXiv “A large-scale COVID-19 Twitter chatter dataset for open scientific research — an international collaboration“.
  • COVID-19-TweetIDs (E.Chen, K. Lerman, and E. Ferrara) is another ongoing collection of tweets associated with the pandemic. They commenced gathering data on January 28th. In their particular case, besides harvesting hashtags, they use Twitter’s streaming API to track specified user accounts and specific keywords. They have structured their GitHub repository by month, day, and hour. Each month folder contains a .txt file per day and hour. These .txt files also consist of the Tweet IDs and thus need to be hydrated. Check also their e-prints posted on arXiv “COVID-19: The First Public Coronavirus Twitter Dataset.”
  • Coronavirus Tweet Ids” (D. Kerchner, L. Wrubel) contains the tweet ids of 155,206,805 tweets related to Coronavirus. Their starting date was between March 3, 2020 and they keep releasing a new version every two weeks approximately. To build the collections they use Social Feed Manager.
  • Corona Virus (COVID-19) Tweets Dataset” (R. Lamsal) provides a CSV dataset with the tweet ids. This initiative monitors the real-time Twitter feed by tracking only English “en”, and the words “corona”, “covid”, “covid-19”, “coronavirus” and the variants of “sars-cov-2”. Simultaneously, they have sentiment.live, a site that visualizes sentiment analysis of the Twitter feed.

There are many other catalogs, projects, and repositories that gather Twitter collections. We recommend also to have a look here:

and to check the stonishing Covid-19 Dashboard to track the number of cases worldwide.

In these momentum of data, our project Digital Narratives of Covid-19 would like to create as well a Twitter dataset conceived under this criteria :

  • By language: English, Spanish
  • By region: South Florida, Miami
  • By date: January 27th –
  • By hashtags (covid, covid-19, coronavirus, etc.)

We are fairly new in these techniques so bear with us while we post tutorials on how we are doing, and join us!!