screen_shot_2024-04-13_at_2.46.45_pm.png
MIDS Capstone Project Spring 2024

CareFirst

Problem & Motivation

People seeking information on their mild to serious health issues are often caught between two extremes. They are either overloaded with information from scouring various medical websites or journals with mixed/ambiguous answers, or they are under-observed or treated by busy hospitals and urgent cares. Built as a conversational AI chatbot, CareFirst provides a singular, medically-sourced solution that can guide users on what the medical issue may be, how it may be treated, and where they should go for medical attention. 

Our goal is always the same: help you find the best possible solution to your first-aid needs. That's why we don't just rely on AI conversation. We encourage feedback from our users and our very own verified medical professionals. See transparent feedback on CareFirst AI on our website Model page.

Data Science Approach and Impact

CareFirst is a Retrieval Augmented Generation app with health information backed by Red Cross Guidelines: a trusted medical source that we parse through so you don't have to!

Our impact as an AI application is guided by three key principles:

  • Trust - Opposed to asking ChatGPT, an AI solution with trusted and transparent sources that provide the verified medical advice to use when generating a response.
  • Communication - The AI proactively asks the user customized follow-up questions. Built with a knowledge graph of medical scenarios that we know need to be mapped to the user’s question before we can respond.
  • Safety - Guardrails powered by NeMo Guardrails and GPT3.5 to detect potential emergency situations.

This capability is then extensible to include audio integration! With open source automatic speech recognition and text-to-speech models we demonstrated how our AI application could allow for differentiated user experiences. See the video on the side panel for a demonstration of CareFirst and the audio generation.

Evaluation

Our AI application is evaluated across three measures: technical metrics, Subject Matter Experts (SMEs) feedback and User feedback.

Metrics

We experimented with three different Large Language Models in the development of the CareFirst AI application and compared to the baseline answers that GPT3.5 would provide based on a large sample of first-aid intents and the associated reference answers.

  • Carefirst, implemented with GPT3.5, has the highest semantic similarity with the reference answer (Sentence BERT - 0.7)
  • Carefirst, implemented with GPT3.5, has the most sequences of similar words to the reference answer (ROUGE-L - 0.45)
  • 77% of Carefirst’s answers are retrieved from the same source as the reference answer.

Our baseline GPT3.5 answers were the furthest from our reference answers, with CareFirst implemented with Gemma-7b-it and Mistral-7B-Instruct-v0.2 performing better than the baseline.

Subject Matter Experts

4 board-certified physicians practicing general surgery, trauma surgery, and internal medicine and 1 licensed Emergency Medical Technician (EMT) reviewed CareFirst responses.

Overall, CareFirst was rated a 7.2 out of 10 for trustworthiness; demonstrating the correctness of the information provided in the response. CareFirst was rated a 6.7 out of 10 for comprehensiveness; demonstrating the completeness of the information provided in the response.

Feedback received included:

  • CareFirst appropriately asks for additional symptoms as needed
  • CareFirst accurately provides guidelines for when and where to seek further care
  • CareFirst can provide more specialized care with further triaging and user profiling

Key Learnings

We learnt so much through this product development! Including the following key learnings:

  1. Domain expertise for your LLM application is crucial to ensure it works the most effectively for your task. This comes from understanding the strengths and limitations of your data and working with subject matter experts.
  2. Evaluating LLM applications takes an array of measures to be confident it is working effectively. With continuous evaluation as you iterate through your development.
  3. Deploying an end-to-end experience including a web application, backend infrastructure and an LLM application takes a lot of time and effort to get right.

Our roadmap from here includes:

  1. Broadening usability by deploying the audio integration
  2. Introducing capability to add and maintain source documents in line SME advice
  3. Continuous improvement based on learnings from the embedded user feedback

Acknowledgements

Thank you for the advice and support from Professor Mark Butler, Professor James Winegar, Professor Korin Reid, Professor Fred Nugen. 

Thank you for the user testing from SMEs, friends, and family.

Last updated: April 17, 2024