We present two amazing works by our group on building generalizable cross-disease, cross-lingual frameworks for detecting, predicting, and providing information about epidemics using social media. The crux of our framework is building robust Event Extraction (EE) models for the social media and epidemiological domains. Here are our two works.
Social media is an easy-to-access platform providing timely updates about societal trends and events. Discussions regarding epidemic-related events such as infections, symptoms, and social interactions can be crucial for informing policymaking during epidemic outbreaks. In our work, we pioneer exploiting Event Detection (ED) for better preparedness and early warnings of any upcoming epidemic by developing a framework to extract and analyze epidemic-related events from social media posts. To this end, we curate an epidemic event ontology comprising seven disease-agnostic event types and construct a Twitter dataset SPEED with human-annotated events focused on the COVID-19 pandemic. Experimentation reveals how ED models trained on COVID-based SPEED can effectively detect epidemic events for three unseen epidemics of Monkeypox, Zika, and Dengue; while models trained on existing ED datasets fail miserably. Furthermore, we show that reporting sharp increases in the extracted events by our framework can provide warnings 4-9 weeks earlier than the WHO epidemic declaration for Monkeypox. This utility of our framework lays the foundations for better preparedness against emerging epidemics.
Event Detection simply involves identifying semantic events in natural language text. Here's an example of detecting various epidemic-related events like Symptom, Infect, and Death.
Here are all the epidemic-related events that are prevalently discussed in social media along with some examples. The core principle during our dataset construction is preserving pandemic-related yet disease-independent. Each event is carefully designed such that it can be generalized to all potential epidemic, and during annotation, the chosen trigger words are generalized as possible such that it is not exclusively defined under COVID context.
We collect multi-disease data for SPEED and provide the statistics of SPEED dataset on the left side below. We benchmark various existing epidemiological works with the trained EE models on our SPEED data - as shown in the right figure below.
SPEED models perform much better in the zero-shot disease transfer scenario compared to other baselines. More importantly, the performance of our zero-shot models is at par with models trained on limited target epidemic data - highlighting the strong utility of our model.
To evaluate the practical validity, we aggregate the epidemic-based events predicted by our SPEED framework across time. Any sharp increases in the events are reported as epidemic warnings. We conduct this study for Monkeypox epidemic of 2022 (based on models trained on COVID-19 data of 2020) and show the warnings with the number of cases in the figure below.
Our framework can provide warnings 4-9 weeks before the WHO warning declaring Monkeypox as a global health concern - highlighting the practical utility of our work.
Here are examplar extracted events extracted by our framework from actual tweets.
Another way to use our framework is to generate event-based disease profiles (based on proportion of different events extracted) using public sentiments, which can provide high-level overview of what people are talking/concerned about regarding the epidemic. We provide the disease-profiles of various diseases developed through our framework below.
Social media is often the first place where communities discuss the latest societal trends. Prior works have utilized this platform to extract epidemic-related information (e.g. infections, preventive measures) to provide early warnings for epidemic prediction. However, these works only focused on English posts, while epidemics can occur anywhere in the world, and early discussions are often in the local, non-English languages. In this work, we introduce the first multilingual Event Extraction (EE) framework SPEED++ for extracting epidemic event information for a wide range of diseases and languages. To this end, we extend a previous epidemic ontology with 20 argument roles; and curate our multilingual EE dataset SPEED++ comprising 5.1K tweets in four languages for four diseases. Annotating data in every language is infeasible; thus we develop zero-shot cross-lingual cross-disease models (i.e., training only on English COVID data) utilizing multilingual pre-training and show their efficacy in extracting epidemic-related events for 65 diverse languages across different diseases. Experiments demonstrate that our framework can provide epidemic warnings for COVID-19 in its earliest stages in Dec 2019 (3 weeks before global discussions) from Chinese Weibo posts without any training in Chinese. Furthermore, we exploit our framework's argument extraction capabilities to aggregate community epidemic discussions like symptoms and cure measures, aiding misinformation detection and public attention monitoring. Overall, we lay a strong foundation for multilingual epidemic preparedness.
Event Extraction extends Event Detection (ED) by not only identifying the event triggers but also corresponding arguments (event-related information) from natural language text. Here's an example of detecting various epidemic-related events like Infect and Control.
Since our work focuses on multilinguality, we also provide some example for Hindi here below.
We improve the existing SPEED ontology by supplementing each event with corresponding arguments. We provide this enriched ontology below.
We train models in a zero-shot cross-lingual cross-disease setup. To evaluate the models, we annotate some EE data in three other languages. Below, we provide the multilingual statistics of our SPEED++ dataset. Starting from the left, we have: (a) number of sentences per language, (b) average length of each sentence, (c) number of event mentions, and (d) number of supporting arguments.
To benchmark models in the zero-shot cross-lingual cross-disease setup, we consider the following data splits.
We train cross-lingual models using TagPrime and synthetic data generation using CLaP on our SPEED++ data. We benchmark our model with various works and show the performances below.
To practically utilize our work, we study its utility for global epidemic trends by plotting the extracted events per language with the number of infections in each country below, all written in a single day (May 28, 2020). We show a strong correlation of 0.73 across 65 languages and 117 countries - highlighting the strong practicality of our work for global epidemic tracking.
We also show a geographical correlation for European countries as shown below. The blue circles indicate the number of extracted events using our framework.
To further demonstrate the strength of our framework's multilingual capabilities, we utilize SPEED++ framework for Chinese Weibo posts in a zero-shot way (no training on Chinese) for providing epidemic warnings for COVID-19, as shown below.
The epidemic warnings indicated by the sharp increases in aggreagated events highlight the significance of our framework which could provide warnings as early as Dec 30 - 3 weeks before global infection tracking even began.
We further provide some qualitative posts and extracted events by our model below.
Finally, we develop an information aggregation system utilizing the argument extraction capability of our framework. Specifically, we aggregate and cluster extracted arguments across social media for each disease, argument, and language. We demonstrate some of the top relevant ones below.
Manual inspection shows the strong argument extraction capability of our framework. Such an information aggregation can be utilized for better epidemic preparedness through public attention shift monitoring as well as misinformation detection.
We further provide some qualitative posts for multilingual arguments extracted by our framework below.
If you find our work inspirational or useful for your research, you can cite our works as below.
@misc{parekh2024eventdetectionsocialmedia,
title={Event Detection from Social Media for Epidemic Prediction},
author={Tanmay Parekh and Anh Mac and Jiarui Yu and Yuxuan Dong and Syed Shahriar and Bonnie Liu and Eric Yang and Kuan-Hao Huang and Wei Wang and Nanyun Peng and Kai-Wei Chang},
year={2024},
eprint={2404.01679},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2404.01679},
}
@misc{parekh2024speedmultilingualeventextraction,
title={SPEED++: A Multilingual Event Extraction Framework for Epidemic Prediction and Preparedness},
author={Tanmay Parekh and Jeffrey Kwan and Jiarui Yu and Sparsh Johri and Hyosang Ahn and Sreya Muppalla and Kai-Wei Chang and Wei Wang and Nanyun Peng},
year={2024},
eprint={2410.18393},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.18393},
}