MIDS Capstone Project Summer 2024

Genomics Adapters

Problem and Motivation

In genomics research, there are numerous specialized tools designed to accomplish various tasks. However, the sheer volume of these tasks, often numbering in the hundreds, combined with the steep learning curve for each tool, poses a significant challenge. Researchers frequently find themselves spending more time learning how to use these tools rather than focusing on analyzing their data. This diversion of effort can slow down progress and reduce efficiency in research.

Our project is motivated by the need to streamline this process. The ultimate goal is to develop a chat-like interface where researchers can input prompts related to raw genomics data. This tool will perform the necessary actions, eliminating the need for researchers to master multiple complex tools. By simplifying the process, we hope to enable researchers to dedicate more time to critical analysis and interpretation of their results, ultimately accelerating advancements in genomics research. For the MVP, our goal is to highlight the feasibility of this grand vision by engineering a trained model designed to accomplish some sample genomics tasks that typically requires other software. 

Data Source and Approach

The model that was engineered was trained and evaluated on the GUE (Genome Understanding Evaluation) benchmark dataset, a collection of 28 distinct datasets that covers 10 different genomic classification tasks.   This dataset had came directly from the DNA-BERT2 paper, and includes sequence information from humans, mice, yeast, and fungus.  The GUE is largely balanced across all tasks, and includes fixed sequence lengths for each task ranging from 70 to 1000 base pairs.  We primarily utilized four binary classification tasks (transcription factor prediction in mice, transcription factor prediction in humans, promoter site prediction, and epigenetic mark prediction), as well as two multi-class classification tasks (splice site prediction and COVID variant prediction. It combines a state of the art DNA encoder, DNABERT2, and a LLM encoder-decoder to intake sequence data and questions and output sensible, English-language responses. To connect the to models, it leverages a Querying Transformer, a lightweight transformer model that is trained to attend to the outputs of the encoder to find the most salient information in the encoder outputs and project them into a vector usable by the LLM.

Evaluation

We evaluated models, with the intent of performance nearing the level that the DNA-BERT2 encoder could achieve at various genomics tasks.  These were evaluated on tasks found on the GUE dataset, including four binary classification tasks (transcription factor prediction in mice, transcription factor prediction in humans, promoter site prediction, and epigenetic mark prediction), as well as two multi-class classification tasks (splice site prediction and COVID variant prediction.  

In order to use common metrics within an LLM setting, we chose the top-1 perplexity of all possible classes for each input to be the prediction of the model.  These values were then used to calculate the matthew’s correlation coefficient in binary classification tasks, and F1 in multi-class classification tasks.  These values could then be directly compared against results of the standalone DNA-BERT2 model.

We first measured performance within the standalone straight-out-the-box LLM.  We additionally  trained the LLM with DNA-BERT2 with only a linear projection between the encoder and decoder.  The linear projection model was only trained with one task at a time, and evaluated only on that single task.  In such a case, neither model was able to infer any information given the DNA sequence.  This emphasized the need of using more sophisticated techniques to transfer the knowledge of genomic syntax from DNA-BERT2 to the language model.

We then evaluated using a Q-Former between DNA-BERT2 and the LLM, only training the Q-Former in between. We also looked at the DNA-BERT2 connected to the LLM with a linear projection, and using parameter-efficient fine-tuning (LORA) on DNA-BERT2.  These models were also only trained and evaluated at one task at a time.  Such models were able to gain understanding on the various tasks, specifically the binary tasks.  Additionally, the LORA trained model had slightly better performance, however it took longer to train.  

Lastly, we looked at training the Q-Former model and the LORA model with multiple binary tasks all at once, and evaluating all the tasks the model was trained on.  We found that this significantly increased performance within the LORA model, while removing any syntactic knowledge of the genome from the Q-Former model.  Nonetheless, the LORA model was able to get near encoder-level metrics on epigenetic mark prediction, promoter prediction, and transcription factor prediction.  Key Learning & Impacts

As a whole, we find that we are able to bootstrap and transfer knowledge from DNA encoders to LLMs.  Doing so was possible by adapting multi-modal vision architectures, within the realm of genomics tasks.  We also find that not only does doing so allow for a single model to do many genomics tasks, we find that training on multiple genomics tasks in a single model allows for near-encoder level performance on various tasks.

In the future, we look towards training on more data in order to allow for our LLMs to leverage the DNA encdoder better.  We also seek to utilize various different adapter models, with the end goal to allow for better metrics on multi-class prediction.  We also look to train using varying generated inputs that capture the essence of the original prompt for the task, to allow for better instruction generalization.

 

TF prediction (mouse)

epi mark prediction

promoter prediction

TF prediction (human)

Splice site prediction

COVID variant prediction

DNABERT2 fine-tuned

56.76

80.17

86.77

71.99

84.99

71.02

Q-former

0

63.70

44.41

42.84

0

2.38

Q-former binary

0

0

0

0

NA

NA

LORA

0

71.60

25.30

52.14

74.06

2.38

LORA binary

22.70

70.96

85.03

38.38

NA

NA

Acknowledgements

Huge thanks to Artemis Pangopoulou, the original author behind X-InstructBLIP and being a resource to bootstrapping our code. Additional thanks to two of our interview candidates who were able to provide insight into the gaps in bio-research space.

Last updated: August 8, 2024