VitalStory: Better Conversations, Better Decisions, Better Health.
Problem & Motivation
Millions of patients struggle to recall important details during medical visits. Between remembering symptoms' timelines, treatment changes, and new concerns, vital information often slips through the cracks. The result? Missed diagnoses, jeopardized doctor-patient trust, and increased patient costs—especially for those with less access to continuous care or caregiver support.
We realized that improving these conversations doesn't start in the clinic—it starts days, weeks or even months before the visit. VitalStory was born from a simple idea: what if patients could capture relevant health updates in real time and receive AI-guided follow-ups, so they walk into every appointment better prepared?
Chronic Care Market
This problem is even more acute for those managing chronic conditions or undergoing treatment transitions. Patients with chronic conditions like diabetes, long COVID, cancer, and autoimmune disorders often face complex, multi-symptom journeys. These individuals stand to benefit most from better tools to log, track, and communicate what’s happening between visits.
Our Solution
Today's solutions are fragmented. On one side you have a lot of competitors that focus solely on the patients. They try to capture every detail, but often it requires tedious tracking, doesn’t capture important context and is missing the means to help patients communicate it in a way that’s meaningful to their physicians. On the other side you have competitors that focus on doctors and often these tools only work as well as the story they're being told and that’s dependent on the patient. At VitalStory we're fixing this communication breakdown by guiding patients to capture meaningful context in the moment and help communicate their story when it matters.
We built VitalStory to work for real people managing complex health journeys—not just clinicians or tech-savvy users. The AI-Guided Health Logs turn scattered symptoms into structured narratives, while the AI Appointment Companion turns those logs into useful, on-demand insights before and during visits.
Both features are designed to feel natural, helpful, and lightweight—no dashboards, forms, or clinical jargon. Just better conversations.
AI-Guided Health Logs – Users capture health updates in real time and receive intelligent follow-up questions to ensure they’re logging the right details, even when they’re not sure what matters most.
AI Appointment Companion – When it’s time for a checkup, users can ask natural language questions like “When did I start feeling worse?” or “What changed after I took the medication?” and get personalized summaries based on their past entries.
Together, these features help patients walk into appointments with clarity and confidence—making every visit more effective.
Data Science Approach
Here’s how it works under the hood: We start with the frontend, this frontend lets user register and sign-in for the first time, and captures user input and formats the prompts and conversation history. When a message is submitted, it sends the prompt to a Med42 LLM endpoint, which is hosted on AWS SageMaker. This endpoint serves a fine-tuned Med42 model, optimized for medical question answering. We chose SageMaker for scalability and managed hosting of the model.
The gateway endpoint acts as a secure bridge. API Gateway forwards the request to a Lambda function, which then invokes the SageMaker model endpoint and returns the response back through API Gateway to the frontend. This decoupling using API Gateway and Lambda gives us flexibility—for example, we can log requests, add caching with Redis, or do light preprocessing in Lambda in the future. Additionally, chat history is maintained using session state, and responses are dynamically displayed. Overall, this architecture is modular, cost-effective, and cloud-native—leveraging AWS services for model inference while keeping the UX simple and responsive.
We made significant enhancements to the MED42 core model
- Tokenization & Embeddings:
Let’s start with tokenization and embeddings. Med42 uses a tokenizer that’s optimized for medical vocabulary—so complex terms like “uveitis” or “metformin” are recognized as single tokens instead of being split. This improves both embedding accuracy and language understanding, which is critical in clinical scenarios. - Bias & Hallucination Mitigation:
Our team implemented guardrails and prompt engineering to reduce hallucinations and address biases—especially around gender, age, and racial disparities in clinical contexts. - Security & Compliance:
We prioritized data security with encryption at rest and in transit, and ensured compliance with HIPAA standards when handling user data or interacting with protected health information. - Dataset Curation & Fine-Tuning:
We curated domain-specific datasets (PubMed, MedQA, PubMedQA, MedMCQA)—focusing on patient symptom logs, clinical summaries, and common Q&A—then fine-tuned MED42 to improve contextual reasoning and medical accuracy in user interactions.
Evaluation
In our app, we have two main features we needed to test — the AI-guided health logs and the AI appointment companion. Below is the approach framework we used for evaluation.
1 | Constructing the gold standard dataset
Since we had two very different model outputs for our features, that meant we had to create two separate gold standard datasets: one focused on follow-up questions, and the other on a fictional patient history with Q&A tasks. To ensure these were clinically meaningful, we partnered with physicians while building these datasets.
2 | Selecting key evaluation metrics
We then evaluated performance using a blended composite score. We weighted BERTScore (F1) most heavily to capture semantic similarity, but also included ROUGE-L and BLEU to catch structure and phrasing. That was important because when we only used BERTScore, we saw high scores, but sometimes got weird or unexpected outputs when we audited them.
3 | Applied custom eval logic for feature 1 model
For our AI-Guided Health Logs model, we faced a challenge: Our model generates just 3 questions, while the doctor-approved gold set has up to 10 — unranked, unordered, but all valid follow-ups.
Generating 10 model questions to match the gold set wouldn’t reflect real-world use, so we needed a fairer evaluation.
We used pairwise matching — for each model question, we found its best match with its highest composite score in the gold set. This was pretty heavy computationally, but allowed us to do it without having to manually review every possible match for the log and in a clear cut way, which will be critical as we scale evaluation further.
4 | Benchmarking model variants
Using the one-to-many matching approach, we tested LLaMA 3–8B and Mistral–7B as Baselines that did well, but saw the best results after using the fine-tuned Med42–8B model. We set the temperature to 0.2 and used few-shot system prompting.
We experimented with other hyperparameters and quantization, but temperature and prompt engineering had the biggest impact. Higher temperatures generally hurt performance, and few-shot prompting gave us the most consistent, relevant outputs.
For the AI companion, we tested how well the model could extract answers from a year of fictional patient logs — from summaries and symptom extraction to timeline questions like “When did the patient first report feeling tired?”
Med42–8B again performed best with a few-shot system prompt. Interestingly, it did slightly better at a 0.5 temperature, likely because these responses are more open-ended and benefit from a bit of creativity.
There’s still a lot of exciting work to be done here — and plenty of opportunities to build new features and push the testing further as the product evolves.
Key Learnings
Below are some key learnings that we had throughout the development of VitalStory.
1. Regulatory Compliance and Ethical Considerations:
- HIPAA Compliance is Non-Negotiable:
- Strict adherence to healthcare regulations like HIPAA is essential for data privacy and security.
- Ethical AI Implementation is Crucial for Trust:
- Ensuring the AI chatbot complements, rather than replaces, doctors is vital for building trust with patients and healthcare professionals.
- It is vital to remember the ethical implications of using AI in the medical field.
2. Technical Development and Architecture:
- Define Clear Objectives Early:
- Establishing a well-defined objective at the outset is essential for informed architectural decisions and aligned development efforts.
- Strategic LLM Selection and Fine-Tuning:
- Selecting the right LLM and planning for domain-specific fine-tuning are critical for optimal performance.
- Proactive Cloud Infrastructure Planning:
- Anticipating and securing necessary cloud infrastructure resources is essential for scalability and avoiding deployment delays.
- Implement Robust Logging from the Start:
- Early implementation of robust logging is vital for debugging, monitoring, and continuous improvement.
- Standardize Prompt and Output Formatting:
- Consistent prompt and output formatting within Lambda functions is crucial for smooth integration with applications.
- Robust Technical Architecture:
- Creating a modular page system with persistent state management that preserves user context and enables seamless transitions throughout the healthcare documentation process.
- Thoughtful Accessible Interaction Design:
- Designing flexible dual input methods (voice/text) that will accommodate diverse user preferences and accessibility needs when logging health symptoms.
Acknowledgements
This project wouldn’t have been possible without the support, feedback, and contributions from so many thoughtful individuals. From user interviews to model evaluations, each step of the process was shaped by those who generously shared their time, expertise, and encouragement.
Friends, family, and colleagues who participated in user interviews and beta testing—your feedback helped shape the product from early ideas to a working prototype.
Clinicians who reviewed and advised on dataset development, ensuring our outputs remained clinically relevant and grounded in real-world care.
The teams behind the open-source LLaMA, Mistral, and Med42 clinical language models, whose work made accessible, high-quality medical AI possible.
Mentors and peers who provided guidance, encouragement, and thoughtful critique throughout the journey.