MIDS Capstone Project Fall 2023

Orchestrate

Problem & Motivation

Music production by independent artists is at its highest level ever, and yet many barriers to entry still exist for aspiring musicians hoping to create songs or find interesting tracks to accompany.  Manual music production requires learning music theory, developing technical proficiency, and investing in costly instruments. Even in the digital music space, steep learning curves exist for those hoping to make use of the leading Digital Audio Workstations (DAWS) for their creative process.

At Orchestrate, our mission is to use machine learning and generative AI to break down those barriers to allow for creative and interesting music creation from day one.  Whether you are a professional or amateur musician looking for an accompaniment, a digital music creator looking for inspiration and easier production, or even just a music lover on the hunt for new and interesting songs to listen to, Orchestrate exists to make your vision a reality.

Orchestrate makes use of innovative NLP and autoregressive models to allow users to generate tracks in seconds using simple natural language prompts, putting the power of music creation directly in your hands.

Data Source & Data Science Approach

The Data

This project builds upon the foundation laid by the ComMU dataset and modeling work carried out by Pozalabs. This dataset contains over 11,000 MIDI files generated by professional composers, each with twelve metadata attributes covering features such as instrument, key, time signature, chord progression, and others.  In order to expand the dataset and tailor it towards our specific use case we drew upon the MetaMIDI database.  While significantly larger and more balanced, the MetaMIDI database is built upon files scraped from various web resources, and therefore a wide variety of preprocessing tasks were required in order to integrate the MetaMIDI files into the ComMU dataset.  In total, we added approximately 22,000 files from metaMIDI to the original ComMU dataset, resulting in a threefold increase in the size of the final overall dataset.

For the NLP modeling, we made use of ChatGPT to generate a database of sample prompts that users may provide, drawn from the features in our existing music database. This allowed for an existing BERT language model to be fine-tuned to our particular use case.

Data Science Approach

The resulting datasets described above were used to fine-tune the transformerXL architecture described in the article ComMU: Dataset for Combinatorial Music Generation. Since we’re building upon an existing model, we had to perform several experiments to ensure that our model converges. The experiments were designed around computation resources, expanding the base encodings, and hyperparameters tunings.

To align with the requirement of generating MIDI metadata based on a text prompt for the music model, BERT was chosen for fine tuning due to its encoder architecture, which is effective for text classification. The model selection and evaluation process involved exploring both multitask and single-task models. While the multitask model yielded mediocre performance, further finetuning was hindered by time and resource constraints. On the other hand, BERT single-task models exhibited strong performance, but required parallelization during production to enhance inference speed.

For tasks where BERT faced challenges, ChatGPT was enlisted. Its generative capability proved effective for tasks characterized by severe class imbalance or a multitude of labels. Consequently, the Orchestrate NLP process leverages the strengths of both BERT and ChatGPT, depending on each model's proficiency in specific tasks. In essence, the Orchestrate NLP component processes a text input through BERT for classification tasks and ChatGPT for generative tasks. The resulting output contains metadata used by the music model to generate music.

Evaluation

NLP Model

The Orchestrate NLP component integrates BERT single-task models and ChatGPT (set at 3.5-turbo as of December 2023) to perform eight tasks that predict metadata for the music model. For a quantitative evaluation, the macro average was calculated across tasks for accuracy, F1-score, precision, and recall metrics using a dedicated test dataset. This dataset includes labels for each task and corresponding text prompts, providing a robust basis for evaluation.

In addition to quantitative assessments, we conducted a qualitative evaluation of the Orchestrate NLP component. This involved gathering insights from users and focusing on their feedback regarding model inputs and outputs. Our qualitative approach ensures a thorough understanding of both the quantitative performance metrics and the real-world user experience with the model.

Music Generation Model

The ComMU music generation model is trained on the negative log likelihood, which produces a log of probability of the next event in the sequence given the entire sequence. We reserved 10% of the overall dataset for testing at each 1000 training iterations and measured the validation negative log-likelihood loss over time. 

After training, we utilized Ferchet Audio Distance as an objective metric. FAD was performed across the instrument categories to evaluate how close the generated song was compared to a repository of songs of that instrument. For a subjective metric, we reach out to a number of users with Orchestrate and collect their feedback.

Key Learnings & Impacts

NLP Model

The primary challenges encountered during the development of the NLP component revolved around computational resources, data availability, and data imbalance. Despite attempts to pre-train a BERT model on music domain data using parameter-efficient fine-tuning (PEFT LoRA), convergence proved difficult due to a lack of masked language model (MLM) music data and computational limitations. Another undertaking involved training a BERT multitask model on eight tasks, yielding mediocre performance. However, the exploration of different parameters for the multitask model was cut back due to a shortage of compute resources.

At the start of the NLP modeling process, the absence of text prompt data posed a challenge for the team. The initial plan to manually generate prompts for thousands of music metadata data points was deemed impractical due to time constraints. To address this, ChatGPT was employed to automatically generate prompts based on provided labels. Upon generating the prompt dataset, it became apparent that several tasks exhibited varying degrees of class imbalance. To mitigate this, downsampling was applied to the data, minimizing imbalances. However, this approach also resulted in a reduction of the number of classes available for the model to learn.

Additionally, the BERT model struggled to predict the desired chords input by the user as there were thousands of various labels/combinations with very few samples for each. As an alternative, we complemented BERT training and inference with ChatGPT's text generation capabilities, proving effective in handling both class imbalance and an abundance of labels. While we faced and overcame many obstacles when developing the Orchestrate NLP component, the impact of it is very clear – it empowers users to craft creative prompts for generating desired music. Looking ahead, there are opportunities to enrich NLP training datasets with more diverse prompts, genres, and instrument types to enhance model generalizability.

Music Generation Model

For the music generator component, our team aimed to build upon the existing ComMU model by supplying additional data and enhancing the models capability to learn and generate songs from different genres, instruments, and key signatures. However, we faced several challenges along the way. First, AWS Elastic Compute Instances do not have the same consumer grade Nvidia RTX 3090 GPUs specified in the ComMU’s article. To minimize training cost, we opted for a similar configuration with a server grade Nvidia A10 GPUs. In terms of raw power, our server grade GPUs are computed to be 2x more powerful than that of the consumer grade GPUs. This increase in performance showed up in one of our experiments where training the same model specified by ComMU’s authors would converge to best validate negative log likelihood at approximately 3,000 iterations, 2x faster than the reported 6,000 iterations.

Second, the additional metaMIDI dataset we obtained expanded the original ComMU base encoding by 25%, which resulted in a non-convergence problem. To overcome this challenge, we opted to filter out some music files that have rare metadata events or are out of original ComMU encoding ranges. 

Finally, after adding the filtered metaMIDI data, we noticed that it took longer for the model to converge to a low validation negative log likelihood score. Here, we ran two parallel experiments, one with a larger number of iterations and another where we reset the learning rate after each 20,000 iterations. Here we noticed that resetting the learning rate after each 20,000 iterations does result in the model converging approximately 15,000 iterations faster than training it for a long period of time. Nevertheless, we were able to sufficiently address the non-convergence problem by first filtering the metaMIDI dataset to somewhat resemble that of ComMU. Afterwhich, applying different training techniques resulted in a model that can take in keywords and generate songs that resemble those keywords. This music generator model, along with the aforementioned NLP model, enable us to provide to our end users with Orchestrate. 

Impact

Music production by independent artists is at the highest level ever and has been growing consistently year over year. Currently, this do-it-yourself segment is the fastest growing segment in the music industry, meaning there is an increasing appetite for musicians to produce music independently, utilizing techniques other than legacy methods. We believe Orchestrate can help foster this growth both by supporting existing musicians in incorporating instruments and styles into their compositions that were previously difficult to include as well as by inspiring aspiring musicians with limited musical knowledge to participate in the music creation process.

Opportunities for future growth

We have identified several key future additions and enhancements to Orchestrate that we believe would help grow and refine the tool. One key enhancement would be the ability to incorporate voice prompting into our product that would allow users to simply speak to our tool to request music instead of typing. This would expand accessibility and convenience for our userbase. Additionally, incorporating multi-instrument accompaniments would greatly improve the user experience by allowing them to truly build a full band behind them to jam with. Finally, continual improvements from the data perspective would be necessary to ensure that our model continues to improve while incorporating additional instruments and music styles.

Acknowledgments

We would like to close by giving thanks to our Course professors, Puya Vahabi and Daniel Aranki, who both provided invaluable assistance and feedback throughout the project.

We would also like to thank the following groups and individuals who played a key role in bringing Orchestrate to fruition:

  • James York-Winegar | UC Berkeley MIDS 255 Course Lead

  • Prabhu Narsina | 210 Teaching Assistant

  • Dennis Brown | Audio Engineer

  • My Young | Lead Singer of JupeJupe

  • Jay Prakash | MIDS Student

  • Pozalabs Research Team (ComMU)

  • MetaMIDI Team

Last updated: December 13, 2023