|
|
Salemi, H., Senarath, Y., & Purohit, H. (2023). A Comparative Study of Pre-trained Language Models to Filter Informative Code-mixed Data on Social Media during Disasters. In Jaziar Radianti, Ioannis Dokas, Nicolas Lalone, & Deepak Khazanchi (Eds.), Proceedings of the 20th International ISCRAM Conference (pp. 920–932). Omaha, USA: University of Nebraska at Omaha.
Abstract: Social media can inform response agencies during disasters to help affected people. However, filtering informative messages from social media content is challenging due to the ungrammatical text, out-of-vocabulary words, etc., that limit the context interpretation of messages. Further, there has been limited exploration of the challenge of code-mixing (using words from another language in a given text of one language) in user-generated content during disasters. Hence, we proposed a new code-mixed dataset of tweets related to the 2017 Iran-Iraq Earthquake and annotated them based on their informativeness characteristics. Additionally, we have evaluated the performance of state-of-the-art pre-trained language models: mBERT, RoBERTa, and XLM-R, on the proposed dataset. The results show that mBERT (with F1 score of 72%) overweighs the other models in classifying informative code-mixed messages. Moreover, we analyzed some patterns of exploiting code-mixing by users, which can help future works in developing these models.
|
|