Technical Enablement of safe and distributed data collection
Problem & Motivation
In today’s digital economy, data is one of the most valuable assets yet the individuals who generate it have little to no control over how it's used or who profits from it. Data sellers, particularly gig workers, students, and underrepresented communities, are routinely disenfranchised. They generate rich behavioral data every day, but that value is extracted by third-party platforms without consent or compensation.
At the same time, data buyers like AI/ML companies, consultants, and market researchers struggle to access high-quality, ethically sourced, and context-rich datasets. The current landscape is dominated by opaque brokers and static data portals with little transparency or flexibility.
We’re building MIDS, an equitable framework and application, that allows for data ingestion and collection from both ends: sellers can contribute, control, and monetize their data, while buyers can efficiently search, filter, and collect what they need.
Data Source & Data Science Approach
We use synthetic data to simulate user-contributed datasets, including (Python Faker and FEC data)
Voter records
Activity data (like Strava logs)
Demographics
From a technical standpoint, MIDS integrates:
Schema-based ingestion to normalize incoming datasets
Automated metadata generation
Query-to-schema matching using LLMs and RAG pipelines
Vector search (via Faiss + OpenAI embeddings)
A Slice & Pack system to dynamically create custom dataset packages based on buyer demand
Evaluation
We evaluated query response consistency using ROUGE convergence scores
Tested buyer search functionality against schema match accuracy
Evaluated dataset assembly pipelines using end-to-end query testing.
Conducted interviews with both buyers and potential data sellers to understand friction points
Key Learnings & Impact
Sellers feel excluded from the value chain they’re data producers but not data participants.
Buyers are seeking faster, cleaner access to data, but spend excessive time on parsing, cleansing, and verifying sources.
MIDS bridges this gap by creating a two-sided marketplace: buyers gain control, customization, and clarity; sellers gain agency, transparency, and economic opportunity.
An equitable data economy is not only possible it’s essential for responsible innovation in AI and analytics.
Acknowledgements
Project by:
Teyana Ildefonso – Market Research & Dataset Generation
Michael Edward Kalish – LLM Engineering & Architecture
Snehal Desai – ML Evaluation, Back-End Systems
Drew Piispanen – Front-End Developer & Data Workflow Design
Thanks to our mentors, peers, and the UC Berkeley MIDS faculty for their invaluable feedback.