MIDS Capstone Project Spring 2025

Technical Enablement of safe and distributed data collection

Team members

Problem & Motivation

In today’s digital economy, data is one of the most valuable assets yet the individuals who generate it have little to no control over how it's used or who profits from it. Data sellers, particularly gig workers, students, and underrepresented communities, are routinely disenfranchised. They generate rich behavioral data every day, but that value is extracted by third-party platforms without consent or compensation.

At the same time, data buyers like AI/ML companies, consultants, and market researchers struggle to access high-quality, ethically sourced, and context-rich datasets. The current landscape is dominated by opaque brokers and static data portals with little transparency or flexibility.

We’re building MIDS, an equitable framework and application, that allows for data ingestion and collection from both ends: sellers can contribute, control, and monetize their data, while buyers can efficiently search, filter, and collect what they need.

Data Source & Data Science Approach

We use synthetic data to simulate user-contributed datasets, including (Python Faker and FEC data)

Voter records
Activity data (like Strava logs)
Demographics

From a technical standpoint, MIDS integrates:

Schema-based ingestion to normalize incoming datasets
Automated metadata generation
Query-to-schema matching using LLMs and RAG pipelines
Vector search (via Faiss + OpenAI embeddings)
A Slice & Pack system to dynamically create custom dataset packages based on buyer demand

Evaluation

We evaluated query response consistency using ROUGE convergence scores
Tested buyer search functionality against schema match accuracy
Evaluated dataset assembly pipelines using end-to-end query testing.
Conducted interviews with both buyers and potential data sellers to understand friction points

Key Learnings & Impact

Sellers feel excluded from the value chain they’re data producers but not data participants.
Buyers are seeking faster, cleaner access to data, but spend excessive time on parsing, cleansing, and verifying sources.
MIDS bridges this gap by creating a two-sided marketplace: buyers gain control, customization, and clarity; sellers gain agency, transparency, and economic opportunity.
An equitable data economy is not only possible it’s essential for responsible innovation in AI and analytics.

Acknowledgements

Project by:

Teyana Ildefonso – Market Research & Dataset Generation
Michael Edward Kalish – LLM Engineering & Architecture
Snehal Desai – ML Evaluation, Back-End Systems
Drew Piispanen – Front-End Developer & Data Workflow Design

Thanks to our mentors, peers, and the UC Berkeley MIDS faculty for their invaluable feedback.

Course

Data Science 210. Capstone , Spring 2025

Class Project Gallery

More Information

Production Website

Github Repository

210-final-presentation-2-team-mids_0.pdf

Last updated: March 24, 2025