MIDS Capstone Project Spring 2025

Technical Enablement of safe and distributed data collection

Problem & Motivation

In today’s digital economy, data is one of the most valuable assets yet the individuals who generate it have little to no control over how it's used or who profits from it. Data sellers, particularly gig workers, students, and underrepresented communities, are routinely disenfranchised. They generate rich behavioral data every day, but that value is extracted by third-party platforms without consent or compensation.

At the same time, data buyers like AI/ML companies, consultants, and market researchers struggle to access high-quality, ethically sourced, and context-rich datasets. The current landscape is dominated by opaque brokers and static data portals with little transparency or flexibility.

We’re building MIDS, an equitable framework and application, that allows for data ingestion and collection from both ends: sellers can contribute, control, and monetize their data, while buyers can efficiently search, filter, and collect what they need.

Data Source & Data Science Approach

We use synthetic data to simulate user-contributed datasets, including (Python Faker and FEC data)

  • Voter records

  • Activity data (like Strava logs)

  • Demographics

From a technical standpoint, MIDS integrates:

  • Schema-based ingestion to normalize incoming datasets

  • Automated metadata generation

  • Query-to-schema matching using LLMs and RAG pipelines

  • Vector search (via Faiss + OpenAI embeddings)

  • A Slice & Pack system to dynamically create custom dataset packages based on buyer demand

Evaluation

  • We evaluated query response consistency using ROUGE convergence scores

  • Tested buyer search functionality against schema match accuracy

  • Evaluated dataset assembly pipelines using end-to-end query testing.

  • Conducted interviews with both buyers and potential data sellers to understand friction points

Key Learnings & Impact

  • Sellers feel excluded from the value chain they’re data producers but not data participants.

  • Buyers are seeking faster, cleaner access to data, but spend excessive time on parsing, cleansing, and verifying sources.

  • MIDS bridges this gap by creating a two-sided marketplace: buyers gain control, customization, and clarity; sellers gain agency, transparency, and economic opportunity.

  • An equitable data economy is not only possible it’s essential for responsible innovation in AI and analytics.

Acknowledgements

Project by:

  • Teyana Ildefonso – Market Research & Dataset Generation

  • Michael Edward Kalish – LLM Engineering & Architecture

  • Snehal Desai – ML Evaluation, Back-End Systems

  • Drew Piispanen – Front-End Developer & Data Workflow Design

Thanks to our mentors, peers, and the UC Berkeley MIDS faculty for their invaluable feedback.

Last updated: March 24, 2025