Juniper: Privacy Interface for Large Language Model Interaction
Summary
Our project, JUNIPER is a proxy interface that enables individuals and organizations to unlock the vast potential of Large Language Models (LLMs) by upholding user or organization privacy. It is applied to medical diagnosis for initial proof of concept, where large quantities of private data are present. Individuals and companies should no longer fear sensitive data leaks and utilize the power of open-source LLMs.
Mission Statement
At our core, we are committed to enabling individuals and organizations unlock the vast potential of Large Language Models by steadfastly upholding the paramount importance of privacy.
Background
Privacy breaches and data exposure are significant concerns with the use of LLMs. Users might unintentionally share sensitive information in their LLM prompts, which can be accessed by LLM providers and potentially utilized elsewhere. Organizations are subject to strict regulations such as GDPR, which dictate the handling of personal data. Furthermore, traditional data anonymization methods, while intended to protect privacy, can sometimes compromise the effectiveness of downstream tasks.
MVP
Our MVP focuses on three core objectives essential for the effectiveness and user-friendliness of our system. Firstly, robust measures are implemented to redact or replace private data in prompts before they reach an open-source model like OpenAI, ensuring privacy and compliance with regulations. Secondly, the integrity of the diagnostic process is maintained to ensure consistency between original and treated prompts, bolstering confidence in system accuracy. Lastly, user autonomy is prioritized by allowing intervention and modification of treated prompts, empowering users and enhancing trust. By addressing these aspects, our MVP aims to deliver a comprehensive and user-centric solution for medical diagnosis while upholding privacy and accuracy standards.
Data Sources
We used multiple datasets to address this complex privacy preservation compute problem for LLMs.
- Symptom_to_diagnosis - https://huggingface.co/datasets/gretelai/symptom_to_diagnosis
- Names by gender - https://archive.ics.uci.edu/dataset/591/gender+by+name
- Race and ethnicity data for first, middle, and surnames - https://www.nature.com/articles/s41597-023-02202-2
- Data for: Demographic aspects of first names - https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DV…
- Disease Symptoms and Patient Profile Dataset - https://www.kaggle.com/datasets/uom190346a/disease-symptoms-and-patient…