Volume 19, No. 4, 2022
Adapting Machine Learning And Deep Learning Approach Towards Language Identification And Sentiment Analysis Of Code-Mixed Urdu-English And Hindi-English Social Media Text ‘
Gazi Imtiyaz Ahmad , Dr. Sruchi Talwani , Dr. Jimmy Singla
Abstract
Large amounts of textual data are produced by social networking sites in the form of posts, comments, reviews, and other user-generated content. This data can be useful in providing insights into public opinions, sentiments and trends and can be analyzed using Natural Language Processing techniques. This type of data is influenced by regional language text expressed in Romanized (Latin script) form. The idea of using Latin alphabet is that it allows text to be represented in a form that can be easily typed and processed by computers. Moreover, in multilingual societies people express their opinions and sentiments in a code-mixed fashion which refers to the practice of using languages or language verities mixed together in a single stretch of discourse. People often communicate with one another on social networking platforms and microblogging sites using multiple languages language varieties, a practice known as "code-mixing.". This can be due to a variety of factors, such as the multilingual nature of many online communities, the desire to reach a wider audience, or the influence of language trends and memes. However, such informal and code-mixed textual data are under resourced in terms of labelled datasets and language models. Therefore, it is difficult to use Natural Language Processing algorithms on this type of textual data. In this work we present machine learning and deep learning approaches for word level Language Identification and Sentiment classification of Urdu-English and Hindi-English “code-mixed” text. For deep learning models we use character level and word level feature for embeddings and feature based approach for machine learning models. The paper also describes the development of “code-mixed” Urdu-English dataset from social media. The dataset was annotated for Sentiment classification and word-level Language detection.
Pages: 728-741
Keywords: Code-mixed, Sentiment Analysis, LSTM, Natural Language Processing, Machine Learning, Deep Learning.