5.6 KiB
1. Introduction and research problem analysis
We aim to transcribe raw audio to some kind of more structural representation like MIDI. In technical literature it's called Automatic Music Transcription (AMT). Converting audio recordings of music into a symbolic form makes many tasks in music information re-trieval easier to accomplish. It's a hard task even for humans.
There are several factors that make AMT difficult:
- Polyphonic music contains a mixture of multiple simultaneous sources.
- Overlapping sound events often exhibit harmonic relations with each other.
- The timing of musical voices is governed by the regularmetrical structure of the music.
- The annotation of ground-truth transcriptions for polyphonic music is very time consuming and requires high expertise.
Especialy the last one which limits the amount of available datasets, for training models.
To limit the scope of the research, we are going to mainly concern ourselves with transcribing polyphonic piano recordings. Standard piano can play 88 pitches, in many combinations and that complexity makes it prime research subject for polyphonic AMT.
State of the art machine learning architectures used for this task take two approaches one similar to speech recognition and consist of two parts: acoustic model and music language model. The acoustic model is used for estimating probability of a pitch being present in a frame of audio. The music language model is analogous to language models used in natural language processing and was introduced to predicts probability of a pitch being played in a context of previous pitches and standards of composing music. The second one also combines two networks: one network detects note onsets and its output is used to inform asecond network, which focuses on detecting note lengths. The task is also implemented through different techniques such as Non-Negative Matrix Factorization but we are going to focus on Neural Networks.
Acoustic models usually use RNNs because they can capture long-term and short-term temporal patterns in music. CNNs however provide benefits, because music is created not only by modifing the height of a pitche, but also through changing the temporal "distance" between them, which can be detected using CNNs, which are very good in preserving and recognising spatial features. In case of the second approach both networks consist of Convolutional layer to process the input and RNNs that do the inference, connected in a specific way so that output of the first one can inform the RNN part of the second one.
The input
A music file, that has to be preprocesed in some way (usually through frequency domain transformations) to a form processable by a CNN.
The output
Series of probability vectors representing sounds present in each frames ready to be transformed into a MIDI file.
List of publications/articles/blog posts used for problem
[:)] https://web.stanford.edu/class/ee384m/Handouts/HowtoReadPaper.pdf [1] An End-to-End Neural Network for Polyphonic Piano Music Transcription [2] Music Transcription by Deep Learning with Data and “Artificial Semantic” Augmentation understanding [3] Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription [4] Onsets and Frames: Dual-Objective Piano Transcription <- Strona nowszego rozwiązania które wydaje się prostsze i bardziej efektywne [5] https://github.com/ybayle/awesome-deep-learning-music [6] https://drive.google.com/file/d/0B1OooSxEtl0FcTBiOGdvSTBmWnc/view
List of repositories/code examples that will be used for implementation
[1] https://github.com/IraKorshunova/folk-rnn [2] https://github.com/BShakhovsky/PolyphonicPianoTranscription [3] https://github.com/wgxli/piano-transcription
Framework selected for implementation
TensorFlow with Keras - probably? Hard to say, but we have most experience with it. Keras über alles. PyTorch - if it looks like something can be made with it with low effort
Dataset to be used for experiments
MAPS: used in [1], "31 GB of CD-quality recordings in .wav format", "The ground truth is provided for all sounds, in MIDI and text formats. The audio was generated from the ground truth in order to ensure the accuracy of the annotation." MAESTRO: "172 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms.", "based on recordings from the International Piano-e-Competition" MusicNet: "a collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise time of each note in every recording, the instrument that plays each note, and the note's position in the metrical structure of the composition", we probably have to filter this dataset in order to acquire music recordings with piano labels only. Ładny opis czterech datasetów. [2] opisuje lossless & lossy augumentation spektogramów