MIDS Capstone Project Fall 2024

AllTrials: Personalized Clinical Trial Matching

Problem & Motivation

Clinical trials are research studies conducted to determine the effectiveness of medical treatments on specific medical conditions. They play an important part in the advancement of medicine, however, inequitable access to clinical trials is a problem. Patients in rural areas often miss out on direct trial recruitment from research organizations and trial outreach through their physicians. Furthermore, online trial information can be difficult to navigate on currently available platforms and requires more advanced technical and medical skills to find relevant trials. Limited trial awareness in rural areas can limit diversity in clinical trials as these populations are not accounted for in the development of medical treatments. This, in turn, can limit the generalizability of medicines. Our solution, AllTrials.info, aims to make the clinical trial discovery process more accessible for patients through personalized trial matching using large language models (LLMs). 

AllTrials.info is a decision support tool to make it easier for patients to find trials for which they may be eligible. This tool is not intended to replace medical advice, but to help patients discover trial options they can discuss with their healthcare providers. With AllTrials.info, users can enter their medical information and find studies related to their specific conditions through our advanced algorithms. Users no longer need to sift through dozens of clinical trials they are ineligible for just to find the few they do qualify for. Instead, we bring eligible trials to the surface immediately, potentially giving users access to treatments they would not have otherwise received. As an online service, our aim is to help people overcome geographic limitations to clinical trial discovery and, thereby, address the critical issue in clinical research of inequitable access to clinical trials.

Data Science Approach

The AllTrials.info web application is deployed with Docker on AWS EC2. By leveraging cloud-based infrastructure, AllTrials.info ensures scalability, availability, and optimal performance, making trial discovery seamless for users. The website is built upon Streamlit, a Python framework for creating interactive apps. The system interacts with:

  • Qdrant Cloud: A cloud-hosted scalable vector database for storing sparse (keyword) and dense (semantic) embeddings. It is used for vector similarity search and keyword-based filtering to retrieve relevant trials based on patient data. 
  • ChatGPT-4o Mini: An OpenAI lightweight LLM for understanding and scoring clinical trial relevance based on patient data and trial eligibility requirements.
  • ClinicalTrials.gov: An online database and website housing clinical trial information from which trial eligibility data is sourced. 

Below are architecture diagrams illustrating the workflow.

This architecture combines state-of-the-art machine learning to streamline the process of clinical trial matching, ensuring both precision and usability.

As shown in the diagrams above, the first trial matching comes from querying the Qdrant database. As previously mentioned, patient data is compared against both dense and sparse embeddings of trial data. Dense embeddings provide trials semantically aligned with the patient query. Sparse embeddings provide trials matched on specific attributes. This hybrid approach of using both embeddings retrieves a comprehensive set of trials that combine broad relevance (dense embeddings) with precise criteria matching (sparse embeddings). Both embeddings are described further below.

Dense Embeddings

  • Definition: Dense embeddings are continuous vector representations created from high-dimensional data, designed to capture semantic similarities.
  • Purpose:
    • Match clinical trials based on general relevance to the input query (e.g., condition similarity, patient attributes).
    • Ensure trials that align broadly with the user’s query are identified.
  • Usage: Dense vectors are computed for clinical trial metadata, such as trial summary, and compared with the user query to identify semantically similar trials.

Sparse Embeddings

  • Definition: Sparse embeddings are discrete and focus on specific, unique features, such as exact matches for terms or rare conditions.
  • Purpose:
    • Capture fine-grained details that may not appear in dense embeddings.
    • Address situations where specific attributes (e.g., a rare disease or exact age range) must be prioritized.
  • Usage: Sparse vectors are generated from structured trial data, such as age, and matched against the query to retrieve trials with exact attribute matches.

Data Source

Our system was developed as a demonstration of a trial matching solution and is not automatically pulling active trials from ClinicalTrials.gov. The Qdrant vector database was populated with trials that could be used for evaluating the system since the trials had labels provided by medical professionals in reference to suitability for synthetic patients. We sourced this labeled data from two areas. The first area was publicly available patient-trial matching test data provided by the Special Interest Group on Information Retrieval (SIGIR) in 2016 and Text REtrieval Conference (TREC) in 2021 and 2022. The second area was ophthalmology related trials evaluated by a Subject Matter Expert (SME) Kevin Lee, M.D. against two synthetic test patients he created for our project. In total, we included 2,915 trials that were labeled either eligible, ineligible, or irrelevant corresponding to 12 patients with 12 different conditions.

Evaluation

Since we provide the LLM in our process specific trials and patient data to compare in scoring the trials and developing an explanation of trial applicability, our system is using Retrieval-Augmented Generation (RAG). As such, our system was evaluated using Retrieval-Augmented Generation Assessment Scores (RAGAS). The RAGAS metrics we utilized were as follows:

  • Answer Correctness - compared and evaluated the factual accuracy of the generated response with the ground truth answer
  • Context Recall - measured the number of relevant chunks in context
  • Context Precision - measured the proportion of relevant chunks in context
  • Faithfulness - measured the factual consistency of the generated response against the given context
  • Noise Sensitivity - measured how often a system made errors by providing incorrect responses when utilizing either relevant or irrelevant chunks in context

Key Learnings & Impact

We learned the importance of transparency in results developed by LLMs which is why we decided the final output to the user would include an explanation about why the trial might be relevant to the patient. This way, patients would have a better understanding of the trial matching result and could truly use it as a decision support tool. Having an explanation would also help patients discuss enrolling in the matched trials with their physician before volunteering for a study. 

AllTrials.info empowers patients by breaking down barriers to clinical trials, making trial discovery simpler and more inclusive. This helps patients access innovative treatments faster and supports a healthcare system that brings life-saving therapies to market. In summary, the impact of AllTrials.info is as follows:

  • Improved Access: Patients can easily discover trials that meet their specific conditions.
  • Reduced Costs: By streamlining the process, patients and researchers save time and money.
  • Accelerated Treatments: Faster trial matching leads to quicker patient enrollment and treatment initiation.
  • Increased Diversity: Enabling patients in various geographic locations to find trials could help diversify clinical trials and ensure that new treatments are safe and effective across varied populations.

Acknowledgements

We would like to thank our instructors, Korin Reid and Ramesh Sarukkai, for their guidance. We would also like to thank Kevin Lee, M.D. for creating synthetic patients and identifying clinical trials for which those patients are eligible.

Last updated: December 10, 2024