Drug AI: https://www.ddw-online.com/exscientia-accelerates-covid-19-drug-discovery-using-ai-12373-202107/
MIDS Capstone Project Summer 2023

DD.ai

Problem & Motivation*

Over 2 million people in the US are hospitalized for severe adverse drug effects, many of which are preventable. Drug to Drug interactions (DDI), which occur with usage of multiple drugs, account for approximately 20% of these adverse drug effects. Pharmacists keep track of patients’ prescribed medication to ensure no interactions occur, but this can only be done on drugs that have been previously tested. The drug development market is large and is expected to grow to $133 billion by 2032. DDI testing is expensive and takes time, typically 4-5 years. With a growing market and limited resources, it is just not feasible to test all drug combinations. 

We introduce our solution: DD.ai, a platform for drug-drug interaction prediction. The mission of this project is to bring value to drug development companies by optimizing their drug-drug interaction testing process. By predicting potential drug interactions of the new drug against existing drugs, we will help companies inform their testing strategies, decrease approval time, and reduce the cost of the testing process. 

Data Source & Data Science Approach

The model uses two features: a chemical structure in the format of a Simplified Molecular-Input Line-Entry System (SMILES) and a text summary for the drug’s action pathway, sourced from DrugBank and the National Library of Medicine. These features are free-form text, so we use NLP techniques to preprocess the data.

For the SMILES structures we use a Morgan Similarity embedding technique, which converts the chemical bond representation into sparse binary vectors. For the action pathway feature, we first use Google’s Pegasus PubMed model to summarize the text and bring it down to a standard token length. From here, we pass it into BioClinicalBert, a BERT model pre-trained on biological text. 

The features are run through two-branches with three hidden layer networks and to a final classification layer for prediction. Since we predict the binary interaction of two drugs, each of these features is a combination of the values of drug 1 and drug 2.

Evaluation

To evaluate the model, we focus on the F2 score and the Area Under Precision Recall Curve (AUPRC). We use the F2 score because the consequence of missing interactions between drugs could cause potential medical damage to patients, so we value recall and give higher penalties to false negatives. We also consider the AUPRC to balance precision and recall during model evaluation. Our final model has an F2 and AUPRC score of 0.83 on our test set.

Key Learnings & Impact

Though we work with free form text (SMILES) as features, these need to be processed differently than using NLP techniques. We first embed these into vectors before passing into our neural network. Figuring out how to serve a large model with batch prediction for our users was also a great learning experience.

Acknowledgements

A huge thank you to professors Joyce Shen and Cornelia Ilin for their invaluable guidance and feedback. We also want to thank professor James Winegar for enabling our production pipeline, Roop Raich, Tim Roth, Dr. Han Dang, PharmD., Mai Do, RPh., MBA. and Dr. Tuan La, PharmD. for helping us understand the space and giving us user feedback. Finally, we would like to thank our peers in Section 7– we’ve all come a long way and we are very proud of the cool projects!

*Sources for numbers are on website and slides.

Last updated: August 9, 2023