Reducing Bias in Large Language Models
Large language models (LLMs) are trained on a massive corpus of internet text, documents, forums, and more. Much of this data is littered with inconsistencies, inaccuracies, and discriminatory bias. Without accounting for these problems, large language models have been prone to displaying undesirable behavior, including hallucinations, toxicity, and biased stereotypes. Our mission is to eliminate the extremes of stereotyping bias on both ends of the spectrum. We want to reduce the impact of positive and negative bias by training neutral models that don't perpetuate racial and professional stereotypes.
To combat the impact of bias on LLM results, we trained a bias detection model on Huggingface's stereoset. Because of the class imbalance between racial, gender, religious, and professional stereotypes found in the data set, we chose to focus specifically on racial and professional biases. We were able to train a Bert model to detect racial/professional bias with 94% out-of-sample accuracy.
After training the bias classification model we were able to prompt Llama, our LLM of choice, in ways that revealed biased outputs. Using our Bert model, we calculated bias scores for each Llama prompt response and fed those scores back into Llama as finetuned data. After finetuning, Llama displayed significantly less bias in the racial/professional categories that we targeted.
There is more research to be done on this approach to finetuning LLMs. Due to costs and the length of time required, we did not train our Llama model from scratch nor did we complete as extensive of finetuning as is possible. Those with better resources might see success by taking either of those approaches. While we succeeded in reducing bias measurements in the finetuned Llama model, the new model appeared to fare worse at "question-answering" tasks. Future research might look for ways to finetune the model without any loss in performance.
Overall, we showed that Bert can be an excellent choice for completing bias classification tasks using natural language. We had limited success at using a Bert model for reinforcement learning and believe that our approach could bear fruit with further time and research.