Predicting the political party affiliation from the Indian Parliament questions using text classification with BERT
1. Introduction
Is it possible to predict which questions members of a respective political party ask using text classification techniques? We used the TCDP-IPD dataset published by the Ashoka University of India to conduct our analysis. This dataset is a repository of all the questions asked in the lower House (Lok Sabha) of the Indian Parliament by the elected Members of the different constituencies to the ruling government at that time, from 1999 to 2019. These questions reflect the concerns of the people since they elect the Members of Parliament (MP) to represent their interests in the Indian Parliament.
Subsequently, we conducted a detailed literature review related to the new topic. The literature review brought out several unique insights into the corpora of work already carried out in this area. It helped in refining our goals to ensure we carry out original research with the new dataset. We aim to use text classification techniques covered in the course to understand if it is possible to predict which questions are asked by members of which political party. We hypothesize that people from ideologically different political parties are likely to ask different questions.
2. Related Work
To assess the potential contribution to new knowledge, we position our research within the existing scientific literature on political polarization and online communities. Our research aims to contribute new insights into the political landscape with a focus on determining whether it is possible to predict which questions are asked by members of which political party.
Blei et al. introduced a framework for modeling the evolution of topics in text over time using Bayesian inference (Blei and Lafferty, 2006). This provides a potential method for our proposed method. Analyzing text topics over time can reveal differences that have occurred, which can be investigated and used in classification models.
Phand et al. conducted sentiment analysis on Twitter data and were able to find this information useful for parties such as businesses, politicians, and brands (Phand and Phand, 2017). Their work revolved around understanding customer sentiment but this concept can also be applied to political sentiment within the data we have and used in classification models.
Sen et al. investigated key stakeholders in Indian democracy, including parliament, over the discourse of economic policies (Sen et al., 2019). Their findings regarding parliament uncovered certain tendencies that they prioritize over other groups. A similar methodology will be used to analyze different parties in our Indian political data with the hopes of finding different tendencies of different parties.
Mikolov et al. introduce extensions to Skip-gram models to enhance the speed and quality of vectors by considering word order and idiomatic phrases (Mikolov et al., 2013). This means that nuanced trends in the text can be accounted for, and this can be implemented in political data to hopefully pick up on nuanced trends in each political party, which can help our classification models.
Greene and Cross analyze European political data using dynamic topic modeling methods (Greene and Cross, 2017). Their method can capture nuances in the text data and analyze how topics change over time. Our research is similar but it is applied to Indian political data and we can use these dynamic topic models to help us see trends and small nuances in our data.
Garrido-Merchan et al. examined the rise of BERT models and how they compare to traditional models (Garrido-Merchan et al., 2023). They endorse BERT as the best model and provide analysis on why this is. When picking classification models, BERT must be towards the top of our list as we also explore other traditional models.
Feng et al. explore the implementation of complex neural networks with BERT models. These neural networks do not improve BERT when conducting sentiment analysis (Gao et al., 2019). We will implement BERT models for our classification model but we also want to test out traditional methods that are mentioned in this paper.
Adhikari et al. applied BERT to classification tasks and they were highly effective. They were computationally heavy, so they applied bidirectional LSTMs and got similar results (Adhikari et al.,2020). If BERT models are too heavy for us to run on the hardware that we have access to, we can try using their ideas with LSTMs in order to get similar results for less computational power.
Yu et al. conducted a study to fix the problem within BERT models where BERT models do not have domain-specific knowledge. They found that their improved model, which implements domain knowledge, performed better than BERT (Yu et al.2019). Our dataset is so specific to politics, and I believe domain-specific knowledge will greatly help our model in accuracy and speed.
Huang et al. explores adding ensemble methods to BERT (Huang et al., 2020). They applied boosting techniques to the model and then fine-tuned the model. Also, the model they made is a multiclass classifier so we have the option to do multiclass classification instead of just binary classification.
Sakala et al. performs analysis on Indian parliament data and their paper focuses on classifying whether a speaker supports or opposes as well as classifies the intention behind a quote (Rohit and Singh, 2018). We will build upon this paper by potentially using these ideas they provide and using them towards a classification model to determine whether the speaker is from a certain party over another.
Bhogale investigates parliamentary questions, specifically questions regarding women and Indian Muslims. She raises the point that the questions in parliament are extremely important in telling us about the current issues as well as better understanding the political landscape, in her case, about women and Indian Muslims (Bhogale,2020). We will build on this by more thoroughly investigating the question data in our dataset and hopefully finding meaningful features.
3. Data
3.1 Initial Dataset
The dataset was downloaded from the repository website (Bhogale, 2019. Trivedi Centre for Political Data, Ashoka University)(https://qh.lokdhaba.ashoka.edu.in/browse-data) as a tsv file. The file was unpacked to get a .tsv file of about 608 MB, which was used for exploratory data analysis and feature engineering. This is crucial for generating better results while using simpler models.
Shape of the initial dataset – The initial dataset had the following dimensions - (298293, 15).
Initial features and their explanation (Bhogale, 2019. Trivedi Centre for Political Data, Ashoka University). The dataset had 15 features initially and they are explained below –
- 'id' -Unique ID for a question asked in the Lok Sabha
- 'date'- Question was raised on this date in the Lok Sabha
- 'ls_number'-The number of the Lok Sabha in which the question was asked, numbered from 13 to 16
- 'ministry'- The ministry in the government who needs to prepare the reply for the said question
- 'question_type'- It specifies whether question is starred or unstarred
- 'question_text'- The complete text of the question asked
- 'answer_text'-The complete text of the answer given by the government for the question
- 'member'-Names of the people asking the same question. It is possible that multiple people can ask the same question
- 'party'- Records the political affiliation of the people asking the question
- 'state'- Records which state the member of the parliament‘s constituency is located at
- 'constituency'-Name of the LokSabha constituency which the MP is representing,
- 'constituency_type'- Stores information on the category of the constituency it can be GEN, SC or ST
- 'gender'-Records the gender of the MP
- 'subject'- Records the subject of the question
- 'link'- URL of the page where the details of the question are stored
4. Method
4.1 Exploration
We used “df.nunique()” to identify the number of unique values in each field. It provides a quick assessment of which fields are having categorical data and which are having non-categorical values. This knowledge is crucial for carrying out advanced analysis.
4.2 Data Cleaning
Using df.isna().sum(), the number of null values in each field were determined. They were in question_text (42), answer_text (120), and subject (2). Since the number was very small compared to the dataset size, all the rows were dropped using df =df.dropna().
4.3 Cleaning and Feature Engineering of "party" field
Some parliament questions are asked by more than one person, but currently, they are clubbed together. It is essential to understand how many such questions are there. So, we will create a new feature to store the number of people asking the same question. In such cases, under the "member" field, the names are separated by "," and the names of the party are separated by ",". We will also convert a string of member names separated by "," to a list of names. We repeat the same process for the following fields: "party", "state", "constituency", "constituency_type", and "gender".
We get the following new fields: "qn_askers_num" (number of people asking the question), 'member_list' , 'party_list,' 'state_list,' 'constituency_type_list,' 'constituency_list,' and 'gender_list.' While exploring the 'party_list' field, we observed that multiple people from the same party may ask the same question. If we use the len() function to find the number of parties asking a question, we may not get the correct value due to duplication. To avoid duplication, we create a new field, "party_set," that converts the list of party names into a set of unique party names.
4.4 Creating duplicate fields for rows with multiple members of parliament
Since the goal of this project is to do classification to identify which party is asking particular types of questions. We need to make separate data points for every MP asking the same question, even if there are multiple people from the same party.
4.5 Cleaning and feature engineering of the "question_text" field
We noted that the structure of the "question_text" was not uniform. Some of the questions start with phrase like "Will the Minister of HOUSING AND URBAN AFFAIRS be pleased to state:", such questions need to be identified and corrected to only include the relevant portions of the question to ensure uniformity. We used regex functions to clean this field to bring uniformity to the question text.
Since every question also has sub-questions, we will create a new field "subquestion_num," which stores the number of sub-questions present in each main question.
Some of the questions without sub-questions starting with (a) or (b) would only be counted as 0. We identified edge cases where sub-questions were represented in other formats and converted them to the uniform format using regex.
Still, questions with no sub-questions were appended with (a) at the start of the text for returning 1 as the number of sub-questions.
4.6 Cleaning and Feature Engineering the "answer_text" field that stores the answers received
The "'answer_text" field contents were not uniform and many of them started with text describing the minister related to the particular Ministry. These were redundant information and not providing any value and had to be cleaned. We used regex for identifying the different types of such starting phrases for different answers and cleaned them to bring uniformity.
We also created feature "ans_char_length" for the raw length of the answer, by counting the number of charcters using len. Feature "ans_word_length" for the raw word length of the answer, was created by counting the number of words using len after splitting at single space.
4.7 Shape of the final dataset after EDA and feature engineering
The final dataset had the following dimensions - (481194, 29)
4.8 Data Binarization
We used the text of questions asked by members of the Indian parliament to predict their political party affiliation, specifically between the Bharatiya Janata Party (BJP) and the Indian National Congress (INC). This process involves several steps:
The script starts by converting the ‘party_text’ column into binary labels using `LabelBinarizer`. This is a preprocessing step to transform categorical party affiliation data into a format suitable for model training, where ‘BJP’ and ‘INC’ are transformed into binary labels.
4.9 Data Sampling & Splitting
There are 91,047 BJP and 62,318 INC questions. We randomly selected 62318 questions for each party so that the dataset is balanced between two classes. The dataset is split into training and testing sets using `train_test_split` with an 80-20 ratio, ensuring that both sets are representative of the overall dataset.
5. Analysis
5.1 Baseline Model
5.1.1 Text Representation
For the text representation, the `TaggedDocument` function from the `gensim` library is used. Each document (a member’s question) is tagged with its corresponding binary label. This is essential for training the Doc2Vec model, which will understand the context of words in relation to the document label.
5.1.2 Model Training
A Doc2Vec model (`model_dbow`) is instantiated and trained on the tagged documents. Doc2Vec is an unsupervised algorithm to generate vectors for sentences/paragraphs/documents, which captures the essence of the text data. The model is trained for multiple epochs with shuffling each time to ensure robust learning.
5.1.3 Feature Extraction
Once the model is trained, document embeddings are created for each text sample. The `vec_for_learning` function generates feature vectors (regressors) for the logistic regression model and also provides the target labels from the tagged documents.
5.1.4 Logistic Regression
With the embeddings as features, a logistic regression model is fitted on the training data. Logistic regression is a statistical method for predicting binary classes, which in this case are the political parties.
5.2 BERT Method
5.2.1 Approach
After getting poor results from our traditional model, we decided to implement BERT model for classification. We ran this model over all of our data and it took an extremely long time and did not have great results. The results were somewhat comparable to the baseline model, so we thought about what we could do to improve this. Since each Lok Sabha has temporal differences, we decided to analyze each Lok Sabha separately. We analysed questions asked in the 13th to 16th Lok Sabha. We took care to ensure that classes were equally represented. Due to limitation in computational power we ran each model for only four epochs each.
5.2.2 TextClassificationDataset Class
The TextClassificationDataset Class, inheriting from PyTorch’s Dataset, serves as a custom class for organizing and preprocessing text data for BERT model training and evaluation. In PyTorch, the Dataset class is fundamental for data management, dictating the loading and formatting of data items, ensuring compatibility with the model’s requirements. Specifically, the TextClassificationDataset Class is tailored to handle text data, performing tokenization and aligning these tokenized texts with their respective labels. This setup facilitates efficient and structured data handling, vital for training and evaluating models in natural language processing tasks.
5.2.3 Training
In our model training and evaluation process, we define a BERTClassifier Class, a subclass of PyTorch’s nn.Module. This class intricately combines a pre-trained BERT model with a dropout layer for regularization and a linear layer for classification, tailoring the architecture for our specific task. The training involves iterating over data with forward and backward passes to update the model’s weights. For evaluation, we switch to a validation set, calculating key metrics like accuracy, precision, recall, F1 score, and constructing a confusion matrix. This rigorous approach ensures a robust and well-tuned model, adept at handling natural language processing tasks.
6. Results
6.1 Baseline Model Results
These are the results from our baseline model.The accuracy and F1 Score are very low and this pushed us to use a BERT model.
- Accuracy: .52
- F1 Score: .52
6.2 BERT Model Results
Implementing BERT greatly increases the accuracy from around 51% to around 76% and f1 score from around .52 to .76. 76% of the time we can correctly classify our data but that is still not as high as we would like.
Lok Sabha | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|
13th BJP Govt | 0.7641 | 0.7676 | 0.7641 | 0.7633 |
14th INC Govt | 0.7452 | 0.7460 | 0.7452 | 0.7450 |
15th INC Govt | 0.7197 | 0.7211 | 0.7197 | 0.7191 |
16th BJP Govt | 0.7370 | 0.7425 | 0.7370 | 0.7362 |
7. Conclusion
In conclusion, this study demonstrates the potential of text classification techniques in predicting political party affiliations in the Indian Parliament. Our research indicates that while traditional models provided a baseline, advanced models like BERT significantly improved accuracy. However, challenges such as computational limitations and the complexity of political language indicate room for further improvement. This project highlights the intersection of data science and political analysis, suggesting a path forward for more nuanced and effective tools in political discourse analysis.
7.1 Next Steps
In our analysis we had a bit of computational limitations on Google Colab. Due to limitation in computational power we ran each model for only four epochs each. 76% of the time we can correctly classify our data but that is still not as high as we would like. Better computational power may help us run this model over more epochs to get better results.
While we were able to correctly classify data in more than 70% cases, we should conduct further analysis to understand if the model is truly identifying the party or just whether the MP is from the party in government/opposition. If the analysis indicate that it is classifying based on party, then we can conclude that there is a significant diffefence in the way members of BJP and INC ask questions. If it indicates that the model is only able to predict whether the MP is from the party in government or opposition, then we cannot conclude that there is significant difference in how members of these two parties ask questions.
References
Ashutosh Adhikari, Achyudh Ram, Raphael Tang, William L. Hamilton, and Jimmy Lin. 2020. Exploring the limits of simple learners in knowledge distillation for document classification with docbert. Proceedings of the 5th Workshop on Representation Learning for NLP.
Saloni Bhogale. 2019. Trivedi Centre for Political Data, Ashoka University(a). TCPD-IPD: TCPD Indian Parliament codebook (question hour).
Saloni Bhogale. 2019. Trivedi Centre for Political Data, Ashoka University(b). TCPD-IPD: TCPD Indian Parliament dataset (question hour) 1.0.
Saloni Bhogale. 2020. What can Question Hour tell us about representation in the Indian Parliament?
David M. Blei and John D. Lafferty. 2006. Dynamic topic models. Proceedings of the 23rd International Conference on Machine Learning; - ICML ’06.
Zhengjie Gao, Ao Feng, Xinyu Song, and Xi Wu. 2019. Target-dependent sentiment classification with BERT. IEEE Access, 7:154290–154299.
Eduardo C. Garrido-Merchan, Roberto Gozalo-Brizuela, and Santiago Gonzalez-Carvajal. 2023. Comparing BERT against traditional machine learning models in text classification. Journal of Computational and Cognitive Engineering.
Derek Greene and James P. Cross. 2017. Exploring the political agenda of the European Parliament using a dynamic topic modeling approach. Political Analysis, 25(1):77–94.
Tongwen Huang, Qingyun She, and Junlin Zhang. 2020. Boostingbert: integrating multi-class boosting into BERT for NLP tasks.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality.
Shital Anil Phand and Jeevan Anil Phand. 2017. Twitter sentiment classification using Stanford NLP. 2017 1st International Conference on Intelligent Systems and Information Management (ICISIM). 4555
Sakala Venkata Krishna Rohit and Navjyoti Singh. 2018.Analysis of speeches in Indian parliamentary debates.
Anirban Sen, Debanjan Ghatak, Kapil Kumar, Gurjeet Khanuja, Deepak Bansal, Mehak Gupta, Kumari Rekha, Saloni Bhogale, Priyamvada Trivedi, and Aaditeshwar Seth. 2019. Studying the discourse on economic policies in India using mass media, social media, and the parliamentary question hour data.Proceedings of the 2nd ACM SIGCAS Conference on Computing and Sustainable Societies.
Shanshan Yu, Jindian Su, and Da Luo. 2019. Improving BERT based text classification with auxiliary sentence and domain knowledge. IEEE Access,4687:176600–176612.469