ISCRAM Digital Library -- Query Results

	Salemi, H., Senarath, Y., & Purohit, H. (2023). A Comparative Study of Pre-trained Language Models to Filter Informative Code-mixed Data on Social Media during Disasters. In Jaziar Radianti, Ioannis Dokas, Nicolas Lalone, & Deepak Khazanchi (Eds.), Proceedings of the 20th International ISCRAM Conference (pp. 920–932). Omaha, USA: University of Nebraska at Omaha. Abstract: Social media can inform response agencies during disasters to help affected people. However, filtering informative messages from social media content is challenging due to the ungrammatical text, out-of-vocabulary words, etc., that limit the context interpretation of messages. Further, there has been limited exploration of the challenge of code-mixing (using words from another language in a given text of one language) in user-generated content during disasters. Hence, we proposed a new code-mixed dataset of tweets related to the 2017 Iran-Iraq Earthquake and annotated them based on their informativeness characteristics. Additionally, we have evaluated the performance of state-of-the-art pre-trained language models: mBERT, RoBERTa, and XLM-R, on the proposed dataset. The results show that mBERT (with F1 score of 72%) overweighs the other models in classifying informative code-mixed messages. Moreover, we analyzed some patterns of exploiting code-mixing by users, which can help future works in developing these models. Keywords: Code-Mixing; Crisis Informatics; Language Model, Multilingual Data Track: AI for Crisis Management Share to FB \| Save citation: RTF PDF LaTeX \| Export record: Atom XML MODS XML OAI_DC XML ODF XML SRW_DC XML SRW_MODS XML

Abstract: Social media can inform response agencies during disasters to help affected people. However, filtering informative messages from social media content is challenging due to the ungrammatical text, out-of-vocabulary words, etc., that limit the context interpretation of messages. Further, there has been limited exploration of the challenge of code-mixing (using words from another language in a given text of one language) in user-generated content during disasters. Hence, we proposed a new code-mixed dataset of tweets related to the 2017 Iran-Iraq Earthquake and annotated them based on their informativeness characteristics. Additionally, we have evaluated the performance of state-of-the-art pre-trained language models: mBERT, RoBERTa, and XLM-R, on the proposed dataset. The results show that mBERT (with F1 score of 72%) overweighs the other models in classifying informative code-mixed messages. Moreover, we analyzed some patterns of exploiting code-mixing by users, which can help future works in developing these models.

Keywords: Code-Mixing; Crisis Informatics; Language Model, Multilingual Data

Track: AI for Crisis Management

Share to FB

| Save citation: RTF PDF LaTeX

| Export record: Atom XML MODS XML OAI_DC XML ODF XML SRW_DC XML SRW_MODS XML

	ISCRAM Digital Library Home \| Show All \| Simple Search \| Advanced Search	Login Quick Search: Field: contains: ...
	1–1 of 1 record found matching your query (RSS \| history):