Add readme

This commit is contained in:
Kacper Donat 2020-05-23 22:36:07 +02:00
parent 2649c6eb3f
commit f3d9222dae
3 changed files with 74 additions and 52 deletions

View File

@ -1,5 +1,5 @@
image: Dockerfile
docker build . --tag transcription:latest
raport.pdf: README.md
raport.pdf: raport.md
pandoc -f markdown-implicit_figures -V geometry:margin=1in $^ -o $@

View File

@ -1,56 +1,22 @@
# 1. Introduction and research problem analysis
# AGU - projekt
Projekt bazowy: [](https://github.com/tensorflow/magenta/tree/master/magenta/models/onsets_frames_transcription)
We aim to transcribe raw audio to some kind of more structural representation like MIDI. In technical literature it's called _Automatic Music Transcription_ (AMT). Converting audio recordings of music into a symbolic form makes many tasks in music information re-trieval easier to accomplish. It's a hard task even for humans.
## Wymagania
- docker
- LUB skonfigurowane środowisko magenta - [](https://github.com/tensorflow/magenta)
There are several factors that make AMT difficult:
## Uruchomienie w dockerze
Aby wykorzystać obraz dockera najpierw trzeba go zbudować. Najłatwiej to zrobić z wykorzystaniem dołączonego Makefile:
1. Polyphonic music contains a mixture of multiple simultaneous sources.
2. Overlapping sound events often exhibit harmonic relations with each other.
3. The timing of musical voices is governed by the regularmetrical structure of the music.
4. The annotation of ground-truth transcriptions for polyphonic music is very time consuming and requires high expertise.
```
$ make image # wywołuje docker build . --tag transcription:latest
```
Especialy the last one which limits the amount of available datasets, for training models.
W ten sposób zostanie zbudowany obraz `transcription:latest`. Teraz możemy go wykorzystać:
```
$ docker run -v "$(pwd):/root/experiment" -it transcription:latest
```
To limit the scope of the research, we are going to mainly concern ourselves with transcribing polyphonic piano recordings. Standard piano can play 88 pitches, in many combinations and that complexity makes it prime research subject for polyphonic AMT.
State of the art machine learning architectures used for this task take two approaches one similar to speech recognition and consist of two parts: _acoustic model_ and _music language model_. The acoustic model is used for estimating probability of a pitch being present in a frame of audio. The music language model is analogous to language models used in natural language processing and was introduced to predicts probability of a pitch being played in a context of previous pitches and standards of composing music.
The second one also combines two networks: one network detects note onsets and its output is used to inform asecond network, which focuses on detecting note lengths. The task is also implemented through different techniques such as Non-Negative Matrix Factorization but we are going to focus on Neural Networks.
Acoustic models usually use RNNs because they can capture long-term and short-term temporal patterns in music. CNNs however provide benefits, because music is created not only by modifing the height of a pitche, but also through changing the temporal "distance" between them, which can be detected using CNNs, which are very good in preserving and recognising spatial features. In case of the second approach both networks consist of Convolutional layer to process the input and RNNs that do the inference, connected in a specific way so that output of the first one can inform the RNN part of the second one.
### The input
A music file, that has to be preprocesed in some way (usually through frequency domain transformations) to a form processable by a CNN.
### The output
Series of probability vectors representing sounds present in each frames ready to be transformed into a MIDI file.
## List of publications/articles/blog posts used for problem
[:)] https://web.stanford.edu/class/ee384m/Handouts/HowtoReadPaper.pdf
[1] [An End-to-End Neural Network for Polyphonic Piano Music Transcription](https://arxiv.org/pdf/1508.01774.pdf)
[2] [Music Transcription by Deep Learning with Data and “Artificial Semantic” Augmentation understanding](https://arxiv.org/pdf/1712.03228.pdf)
[3] [Modeling Temporal Dependencies in High-Dimensional Sequences:
Application to Polyphonic Music Generation and Transcription](https://arxiv.org/ftp/arxiv/papers/1206/1206.6392.pdf)
[4] [Onsets and Frames: Dual-Objective Piano Transcription](https://storage.googleapis.com/magentadata/papers/onsets-frames/index.html) <- Strona nowszego rozwiązania które wydaje się prostsze i bardziej efektywne
[5] [https://github.com/ybayle/awesome-deep-learning-music](https://github.com/ybayle/awesome-deep-learning-music)
[6] https://drive.google.com/file/d/0B1OooSxEtl0FcTBiOGdvSTBmWnc/view
## List of repositories/code examples that will be used for implementation
[1] https://github.com/IraKorshunova/folk-rnn
[2] https://github.com/BShakhovsky/PolyphonicPianoTranscription
[3] https://github.com/wgxli/piano-transcription
## Framework selected for implementation
TensorFlow with Keras - probably? Hard to say, but we have most experience with it.
Keras über alles.
PyTorch - if it looks like something can be made with it with low effort
## Dataset to be used for experiments
[MAPS](http://www.tsi.telecom-paristech.fr/aao/en/2010/07/08/maps-database-a-piano-database-for-multipitch-estimation-and-automatic-transcription-of-music/): used in [1], "31 GB of CD-quality recordings in .wav format", "The ground truth is provided for all sounds, in MIDI and text formats. The audio was generated from the ground truth in order to ensure the accuracy of the annotation."
[MAESTRO](https://magenta.tensorflow.org/maestro-wave2midi2wave): "172 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms.", "based on recordings from the International Piano-e-Competition"
[MusicNet](https://homes.cs.washington.edu/~thickstn/musicnet.html): "a collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise time of each note in every recording, the instrument that plays each note, and the note's position in the metrical structure of the composition", we probably have to filter this dataset in order to acquire music recordings with piano labels only.
[Ładny opis czterech datasetów](https://arxiv.org/ftp/arxiv/papers/1206/1206.6392.pdf).
[2] opisuje lossless & lossy augumentation spektogramów
# 2. Methodology: dataset, tools, experiments
# 3. Experiment results and discussion
W ten sposób powinniśmy znaleźć się wewnątrz kontenera z skonfigurowanym środowiskiem i pobranym checkpointem. Katalog
w którym się znajdujemy powinien być zamontowany w obrazie jako `~/experiment`.

56
raport.md Normal file
View File

@ -0,0 +1,56 @@
# 1. Introduction and research problem analysis
We aim to transcribe raw audio to some kind of more structural representation like MIDI. In technical literature it's called _Automatic Music Transcription_ (AMT). Converting audio recordings of music into a symbolic form makes many tasks in music information re-trieval easier to accomplish. It's a hard task even for humans.
There are several factors that make AMT difficult:
1. Polyphonic music contains a mixture of multiple simultaneous sources.
2. Overlapping sound events often exhibit harmonic relations with each other.
3. The timing of musical voices is governed by the regularmetrical structure of the music.
4. The annotation of ground-truth transcriptions for polyphonic music is very time consuming and requires high expertise.
Especialy the last one which limits the amount of available datasets, for training models.
To limit the scope of the research, we are going to mainly concern ourselves with transcribing polyphonic piano recordings. Standard piano can play 88 pitches, in many combinations and that complexity makes it prime research subject for polyphonic AMT.
State of the art machine learning architectures used for this task take two approaches one similar to speech recognition and consist of two parts: _acoustic model_ and _music language model_. The acoustic model is used for estimating probability of a pitch being present in a frame of audio. The music language model is analogous to language models used in natural language processing and was introduced to predicts probability of a pitch being played in a context of previous pitches and standards of composing music.
The second one also combines two networks: one network detects note onsets and its output is used to inform asecond network, which focuses on detecting note lengths. The task is also implemented through different techniques such as Non-Negative Matrix Factorization but we are going to focus on Neural Networks.
Acoustic models usually use RNNs because they can capture long-term and short-term temporal patterns in music. CNNs however provide benefits, because music is created not only by modifing the height of a pitche, but also through changing the temporal "distance" between them, which can be detected using CNNs, which are very good in preserving and recognising spatial features. In case of the second approach both networks consist of Convolutional layer to process the input and RNNs that do the inference, connected in a specific way so that output of the first one can inform the RNN part of the second one.
### The input
A music file, that has to be preprocesed in some way (usually through frequency domain transformations) to a form processable by a CNN.
### The output
Series of probability vectors representing sounds present in each frames ready to be transformed into a MIDI file.
## List of publications/articles/blog posts used for problem
[:)] https://web.stanford.edu/class/ee384m/Handouts/HowtoReadPaper.pdf
[1] [An End-to-End Neural Network for Polyphonic Piano Music Transcription](https://arxiv.org/pdf/1508.01774.pdf)
[2] [Music Transcription by Deep Learning with Data and “Artificial Semantic” Augmentation understanding](https://arxiv.org/pdf/1712.03228.pdf)
[3] [Modeling Temporal Dependencies in High-Dimensional Sequences:
Application to Polyphonic Music Generation and Transcription](https://arxiv.org/ftp/arxiv/papers/1206/1206.6392.pdf)
[4] [Onsets and Frames: Dual-Objective Piano Transcription](https://storage.googleapis.com/magentadata/papers/onsets-frames/index.html) <- Strona nowszego rozwiązania które wydaje się prostsze i bardziej efektywne
[5] [https://github.com/ybayle/awesome-deep-learning-music](https://github.com/ybayle/awesome-deep-learning-music)
[6] https://drive.google.com/file/d/0B1OooSxEtl0FcTBiOGdvSTBmWnc/view
## List of repositories/code examples that will be used for implementation
[1] https://github.com/IraKorshunova/folk-rnn
[2] https://github.com/BShakhovsky/PolyphonicPianoTranscription
[3] https://github.com/wgxli/piano-transcription
## Framework selected for implementation
TensorFlow with Keras - probably? Hard to say, but we have most experience with it.
Keras über alles.
PyTorch - if it looks like something can be made with it with low effort
## Dataset to be used for experiments
[MAPS](http://www.tsi.telecom-paristech.fr/aao/en/2010/07/08/maps-database-a-piano-database-for-multipitch-estimation-and-automatic-transcription-of-music/): used in [1], "31 GB of CD-quality recordings in .wav format", "The ground truth is provided for all sounds, in MIDI and text formats. The audio was generated from the ground truth in order to ensure the accuracy of the annotation."
[MAESTRO](https://magenta.tensorflow.org/maestro-wave2midi2wave): "172 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms.", "based on recordings from the International Piano-e-Competition"
[MusicNet](https://homes.cs.washington.edu/~thickstn/musicnet.html): "a collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise time of each note in every recording, the instrument that plays each note, and the note's position in the metrical structure of the composition", we probably have to filter this dataset in order to acquire music recordings with piano labels only.
[Ładny opis czterech datasetów](https://arxiv.org/ftp/arxiv/papers/1206/1206.6392.pdf).
[2] opisuje lossless & lossy augumentation spektogramów
# 2. Methodology: dataset, tools, experiments
# 3. Experiment results and discussion