An Analysis of COVID-19 Tweets

Shivam Sharma
12 min readMay 23, 2021

This article aims to give a brief description of the work done for the paper by Shivam Sharma and Cody Buntain titled “An Evaluation of Twitter Datasets from Non-Pandemic Crises Applied to Regional COVID-19 Contexts”, in the Proceedings of the 18th ISCRAM Conference, under the Social Media for Disaster Response and Resilience track.

Visits to popular social media websites.

COVID-19 has affected all of our lives in a major way. Everyone is trying to stay indoors and quarantine to be safe from the virus. Thus everyone turns to social media to connect with society, be it to seek advice, ask for aid, or show their support, as seen in any other crisis events. As evident from the above figure, statistics show that all major social media websites have seen an increase in average monthly visits as compared to 2019. However, a pandemic is different from other crisis events as it goes on for a longer period of time and affects a larger number of people. In this work, we aim to analyze whether the information shared across social media during the COVID pandemic has some underlying similarity with general, non-pandemic type crisis events.

The major takeaway from this presentation and from our work presented in our paper is that

There is some consistency across information shared during pandemic type events and non-pandemic events.

To show the evidence of this, we propose the following hypothesis

Information signals from non-pandemic events will aid in the improvement of information type classification of COVID-only data

In this article, and through our paper, we aim to test this hypothesis by conducting a cross-validation experiment. We split this article into 5 sections to better understand our approach towards testing of the above mentioned hypothesis:

  1. Dataset Selection
  2. Cross Validation Experiment
  3. Model Description
  4. Results and Conclusion
  5. Future Work

Dataset Selection

As discussed earlier, there are various social media platforms through which people are connecting with society for different purposes related to this pandemic. We wanted to select a dataset out of these which would be structured and well labeled.

Incident Streams is a track in the TREC conference working in the field of Crisis Informatics from 2018. They have a standardized Twitter dataset, with the latest edition containing tweets regarding COVID-19 as well.

Number of tweets in each TREC-IS edition

TREC-IS releases two editions of test dataset each year, which are then included in the training data for the next edition. For example, the testing data released in 2019-A edition would be labeled and included in the training data for testing on 2019-B edition. For the experiment described in our paper, we would be using the labeled data from 2018-A, 2018-B, 2019-A, 2019-B, 2020-A, and 2020-A COVID editions.

There are two main tasks to be accomplished using the data provided by TREC-IS

  1. Information Type Classification: This task involves classifying the given tweets into 25 different Information Type Labels. This is a multi-label classification task, which means one tweet can belong to more than one Information Label.
  2. Priority Score Prediction: This task involves predicting a priority score for each tweet between 0 to 1, with 0 being lowest priority and 1 being highest priority.

In our paper, we use the 25 different Information Type Labels for our cross-validation test.

Priority Score Distribution

The labeled dataset can be divided into 4 major classes based on their priority scores:

  1. Critical : priority score > 0.75
  2. High : 0.5 < priority score <= 0.75
  3. Medium : 0.25 < priority score <= 0.5
  4. Low : priority score <= 0.25
Labeled tweets distribution based on priority labels

As evident from this table, there is a heavy imbalance in the labeled dataset, with more than 80% of the tweets beings labeled as “Low” and “Medium” .

High Level Information Type Labels

High-level information type label distribution

The 25 different Information Type Labels can be further clustered into 4 different High-Level Information Type labels, as shown in this table. As evident from the table, there is heavy imbalance in the labeled dataset, as High-Level Information Labels like “Report” and “Other” are way more than High-Level Information Labels like “Request” and “CallToAction”.

1. Request

This table shows the 3 different Information Labels in the “Request” High-Level Information Label.

Information label distribution in “Request” high-level information label

2. CallToAction

This table shows the 3 different Information Labels in the “CallToAction” High-Level Information Label.

Information label distribution in “CallToAction” high-level information label

3. Report

This table shows the 14 different Information Labels in the “Report” High-Level Information Label.

Information label distribution in “Report” high-level information label

4. Other

This table shows the 14 different Information Labels in the “Other” High-Level Information Label.

4. Other This table shows the 14 different Information Labels in the “Other” High-Level Information Label.
Information label distribution in “Other” high-level information label

As evident from the above tables, there is heavy imbalance in individual information type labels, with labels like “Irrelevant”, “Sentiment”, “News”, etc being way more than labels like “GoodsServices”, “InformationWanted”, “MovePeople”, etc. We describe a simple synonym-based text augmentation method to counter this imbalance in the labeled dataset.

Cross-Validation Experiment

Our hypothesis suggests that models trained on COVID as well as general non-pandemic data will outperform models trained only on COVID data, when tested on a held-out set of COVID-only data.

We aim to test this by comparing two models trained on two different datasets, one containing only COVID tweets and the other containing COVID as well as general non-pandemic type tweets, named the Full dataset. We test these two models across a held-out COVID only dataset.

COVID-only data cross-validation test

This image describes the cross-validation experiment for COVID-only dataset. The training data contains only labeled COVID tweets provided by TREC-IS. Using the provided dataset, a 10-fold cross-validation experiment was conducted.

Full data cross-validation test

This image describes the cross-validation experiment for the Full dataset. As mentioned above, the Full dataset contains labeled tweets from COVID as well as general non-pandemic events. Similar to cross-validation experiment for COVID-only tweets, we carry out a 10-fold cross-validation experiment for the Full dataset as well.

A point to note is that, since we want to compare the two models, model trained on COVID-only as well as Full dataset, the Held-Out fold/set for all the 10 folds for both the experiments are same, containing COVID-only labeled tweets.

Metrics used to compare performance of the two models trained on the two datasets.

As shown in the table above, we use 4 different metrics to compare the performance of models trained using the two different datasets, COVID-only and Full. For our hypothesis, we would expect these average precision, average recall and consequently average F1 score to be significantly higher for a system trained on FULL data. We expect average accuracy to not be a good metric to compare the two models because of the heavy imbalance in the labeled dataset, which would cause some labels to be predicted with higher frequency than others.

Model Description

For this experiment, we use only the tweet text for training our models. We start by removing the numbers and punctuation from the tweet text. We then replace URLs, mentions and hashtags with their “URLHERE”, “MENTIONHERE” and “HASHTAGHERE” tags, respectively.

Examples for tag replacement in tweet text

For this experiment we used Natural Language Model (NLM) based pipeline. This pipeline is one of the systems submitted by us in previous TREC-IS editions. The pipeline uses a simple synonym-based augmentation method to counter the imbalance in the training data, thus it’s referred to as the Synonym-Augmentation Pipeline.

Synonym-Augmentation Pipeline

As discussed above, there is heavy imbalance in the labeled dataset. To counter this imbalance, we propose a simple synonym-replacement method to augment the tweet text. The text augmentation was done on the idea of replacing specific words in a tweet with their synonyms, so as to “generate” new tweets for training data which do not deviate from their original meaning but still have some semantic differences from the original tweet. Since the tweets are related to disasters and calamities, we decided to replace the verbs with their synonyms as those would be the words which would have higher weight in the sentence.

As discussed above, the tweets labeled as “Critical” and “High” priority are way less frequent than tweets labeled as “Medium” and “Low” priority. This pipeline aims to counter this by augmenting the “Critical” priority tweets with new “generated” tweets.

The synonym replacement method can be divided into two major sections:

  1. Parts of Speech (POS) tagging
  2. Synonym replacement

Parts of Speech (POS) Tagging

POS tagging using TextBlob python library

We use the TextBlob python library to extract POS tags for each word in the cleaned, pre-processed “Critical” Priority tweet text. We then sample the verbs from the tweet text using the tags, which would be replaced with their synonyms to “generate” new tweet text.

Synonym Replacement

Using WordNet corpus to get synonyms for verbs

We use the WordNet Corpus to get the synonyms for the list of verbs found using TextBlob. We replace the original verbs in the tweet text with their corresponding synonym, one at a time, by which we mean, every “generated” tweet text has only one word replaced from the original tweet text.

Synonym replacement method results, with the first tweet text being the original

This image shows an example of results for one tweet text (the first tweet in the image) after “generating” new tweet text using the synonym-replacement method. As evident from the image, only one word is replaced in the original tweet text. A point to note is that, after some preliminary tests, we decided on replacing one single verb with top four synonyms, as more than that and the synonym would sometimes change the meaning of the text. This approach gave us more than four thousand new tweet texts to be used in training with “Critical” Priority labels.

Neural Language Model (NLM)

The new training data, including original tweet text as well as “generated” tweet text, was then passed through a Neural Language Model (NLM), which predicted the various information labels the model thinks are relevant to the provided tweet text. For example, in the above image, the models evaluates the information labels “Advice” and “InformationWanted” to be related to the provided tweet text.

Transformer models and corresponding model pre-trained weights

The NLM models used for our experimentation are the transformer models. This table lists out all the 9 different transformer models used to test our hypothesis and their corresponding pre-trained weights used. We load these pre-trained weights and re-train each models for 2 epochs for each fold, using the training data obtained from the synonym-replacement method.

Synonym-Augmentation Pipeline

Thus we use the new training data obtained through synonym-replacement method to train the NLM models. The above image outlines the complete Synonym-Augmentation Pipeline.

Results and Conclusion

Using the Synonym-Augmentation Pipeline, we compare the performance showcased by 9 different transformer (NLM) models on two different dataset, the COVID-only dataset and the Full dataset, containing COVID as well as general non-pandemic data.

Comparison of average accuracy across two datasets

This table compares the average accuracy over the 10 fold cross-validation using the 9 models trained over COVID and FULL datasets. The mean and median accuracy of models trained on COVID-only dataset are higher than the models trained on the FULL dataset. This was expected due to the imbalance in the labeled dataset and this result gives the evidence that even synonym-augmentation method was not able to counter the imbalance completely.

Comparison of average precision across two datasets

This table compares the average precision over the 10 fold cross-validation using the 9 models trained over COVID and FULL datasets. As evident from the table, there is improvement observed when models are trained on the FULL dataset as compared to models trained on COVID-only dataset. This improvement is significant, as evident from the T-statistic and P-value. The T-statistic and P-value for this experiment were calculated by comparing the average scores shown by the 9 different NLM models on the two datasets.

Comparison of average recall across two datasets

This table compares the average recall over the 10 fold cross-validation using the 9 models trained over COVID and FULL datasets. As evident from the table, there is improvement observed when models are trained on the FULL dataset as compared to models trained on COVID-only dataset. This improvement is significant, as evident from the T-statistic and P-value. We use the same method to calculate the T-statistic and P-value for this experiment as well, by comparing the average scores shown by the 9 different NLM models on the two datasets.

Comparison of average F1 across two datasets

This table compares the average F1 over the 10 fold cross-validation using the 9 models trained over COVID and FULL datasets. As evident from the table, there is improvement observed when models are trained on the FULL dataset as compared to models trained on COVID-only dataset. This improvement is significant, as evident from the T-statistic and P-value. We use the same method to calculate the T-statistic and P-value for this experiment as well, by comparing the average scores shown by the 9 different NLM models on the two datasets.

Thus, we see significant improvements in the Average Precision, Average Recall and Average F1 Scores when training model on FULL dataset as compared to model trained on COVID-only dataset. These results give evidence that information shared across non-pandemic events and pandemic events have some underlying similarities between them, thus supporting our hypothesis.

Future Work

The work presented in this article and in our paper is a “Work-In-Progress”. We aim to build on the work described above by exploring the following area get a better idea of the importance of information shared during different crisis events.

Analysis of Individual Information Labels

We aim to study the improvements shown by models trained on the FULL dataset using a more focused. We aim to analyze the improvements, if any, shown by the 25 different information type labels. Through this we can find out which information labels are showing major improvement and, by comparing the information labels in the two different datasets, understand the reason behind it.

Performance Analysis of Different Events

We aim to analyze whether the information from natural disasters like wildfires or floods should be given the same importance as the information from man-made disasters like bombings or shootings when predicting on a natural disaster.

Performance Analysis on Non-Pandemic Data

Our cross-validation test gives strong evidence that inclusion of general non-pandemic type events improves classification for pandemic type event data. But this raises a question on whether the reverse is true also, that is, does inclusion of pandemic type event data improves our model’s ability to classify general non-pandemic type event data. Though, based on the evidence shown by our cross-validation test, we can expect our hypothesis to hold true for reverse condition, but we aim to analyze this by conducting a similar cross-validation experiment as described above.

Analysis of Priority for Pandemic Type Events

The nature of pandemic-type events raises a question on the priority scoring of the tweets. In general crises like wildfires and bombings, we have a clear concept of which tweet can be classified as High Priority tweets, like the ones asking for aid or volunteers. But since pandemics are such long lasting events, and affect the people at a much slower rate, it would be hard to say which type of tweets will be of High Importance. By High Importance we mean the tweets or posts which should be seen by some emergency responders. It is hard to define a set of tweets type which needs to be seen by some emergency responders during a pandemic-type event. We aim to study the information types which can be classified as High Priority during such an event.

--

--

Shivam Sharma

Data Scientist working in the field of NLP, NLG and NLU