MIDS Capstone Project Fall 2024

ChemiMap

Team members

Problem & Motivation

"Reading through hundreds of scientific papers is so fun and efficient!" said no one, ever.

A researcher in the biochemical field typically spends hundreds of hours sifting through vast amount of scientific papers in search of information about a chemical, a disease, and their relationship. Exploring complex chemical-disease relationships is tough, and although there exist tools and databases for researchers, it's very difficult to visualize these relationships efficiently as they have to find and read through these articles in order to find the information that they are looking for. Researchers have to sift through fragmented and time-consuming material, and they often explore multiple hypotheses simultaneously, making it challenging to efficiently cross-reference large volumes of material. In the time spent reading, important connections get lost, and the process is slow and overwhelming.

Our Solution

Introducing ChemiMap, a chemical-disease relationship explorer that allows the user to quickly find relationships and relevant papers across vast amounts of literature. We designed a product to make research more efficient. ChemiMap uses state-of-the-art Named Entity Recognition (NER) and Relation Extraction (RE) models to find chemicals and diseases in papers and analyzes the most salient relationships between them. This information is presented in an easy-to-visualize interactive knowledge graph along with annotated abstracts of papers highlighting the entities, making research quicker and more efficient. With ChemiMap, we estimate an annual saving of $4 million in total with a user base of 1,000 researchers.

Data Source and Data Science Approach

The BioCreative V Chemical Disease Relation (CDR) dataset is a text corpus of human annotations of chemicals, diseases, and interactions in 1,500 PubMed articles. This dataset is used to train our two core models: the ConNER NER model and the SSGU-CD RE model.

The ConNER model is a Transformer-based model utilizing a fully connected layer for making tentative label classifications, followed by a BiLSTM refinement process that learns the internal structure of the BIO classification. The model determines whether each of the word or token of the text is a valid entity using the BIO classification.

The SSGU-CD RE model is a BERT-based model that uses a Graph Convolutional Network (GCN) to analyze connections between mentions of chemicals and diseases, and then uses U-Net, a tool for processing longer text to capture context. It identifies the relationships between the combinations of chemicals and diseases.

The modeling steps we go through are the NER model, coreference resolution (MeSH normalization), and the RE model.

At a high-level, our pipeline follows a text input and goes through the following steps to get an output of chemicals, diseases, and relationships:

A batch of text containing the paper title and abstract is input.
The file then goes through pre-processing to get it into a format that is compatible with the ConNER model. This includes tokenizing text and mapping words to token indices for precise entity alignment.
The ConNER model performs the named entity recognition.
The outputs of the ConNER model then goes through MeSH ID normalization. This step is necessary as the recognized entities can be referred to in many ways. For example, we need to make sure that "hypertension" and "high blood pressure" should be the same disease. To solve this coreference problem, we created a dictionary of entities with normalized MeSH IDs (unique medical ID of the entities) using several sources.
After the MeSH normalization step, the intermediate output goes through the SSGU-CD relation extraction model.
The final output contains the title, abstract, a list of chemicals and diseases identified by the NER, and the salient relationships identified by the RE.
The final output is handled in a MongoDB database which stores all of the relationships.

Evaluation

The results for each model are as follows:

NER
- F1: 91.5%
- Recall: 92.2%
- Precision: 90.8%
MeSH Normalization (Coreference Solution)
- Accuracy: Chemicals 80%, Diseases 63%
RE
- F1: 45.4%
- Recall: 41.3%
- Precision: 50.2%

The results for NER are quite good, as they are in line with the best performing NER models. MeSH normalization provides a decent accuracy for identifying the same chemicals/diseases in the majority of the cases, but having numerous ways to describe the same medical terminology, esp. for diseases, is difficult. RE models have the lowest accuracy scores as they remain particularly challenging due to factors such as complex vocabulary, diverse entities, cross-sentence dependencies, and data imbalance. Notably, however, the highest performing RE model for a similar but non-medical dataset achieved only 58% F1 score, so our results are decent.

Key Learnings and Impact

Finding chemical-disease relationships is a difficult task, but our team created a solution for utilizing NER and RE models to easily identify key relationships across vast amounts of scientific literature with notable accuracy. With ChemiMap, we expect to drive impact in the research industry by greatly reducing time spent on finding and reading through articles. With a user base of 1,000 researchers, healthcare professionals, pharmaceutical experts, and more, we estimate about $4 million annually of cost-saving.

ChemiMap is a prototype solution that can expand to a user-uploadable business model, and over time, the effectiveness of ChemiMap will only increase. It can also potentially be diversified into other fields, making ChemiMap a growing database of expert knowledge.

Acknowledgement

We acknowledge our two mentors who helped us throughout our entire journey, Korin Reid and Puya Vahabi. They provided us with great feedback and guided us through difficult decisions and challenges. We also extend our thanks to those who reviewed our application, esp. to the PhD candidate at Oxford University who wishes to remain anonymous, for providing a detailed interview and feedback on ChemiMap. We also acknowledge Michelle Sinani for giving us a detailed ethics review of ChemiMap, and all other fellow classmates who supported our journey. We are grateful for the scientific community for their continuous contributions in the academic database.

References

Jeong, M., & Kang, J. (n.d.). Enhancing Label Consistency on Document-Level Named Entity Recognition. https://doi.org/arXiv:2210.12949v1

Nie, P., Ning, J., Lin, M., Yang, Z., & Wang, L. (2024). SSGU-CD: A combined semantic and structural information graph U-shaped network for document-level chemical-disease interaction extraction. Journal of Biomedical Informatics, 157, 104719. https://doi.org/10.1016/j.jbi.2024.104719

Course

Data Science 210. Capstone , Fall 2024

Class Project Gallery

More Information

Go to ChemiMap

ChemiMap Application GitHub Repo

ChemiMap Workbooks GitHub Repo

Video

Last updated: December 9, 2024