Add readme

2020-05-23 22:36:07 +02:00 · 2020-05-23 22:36:07 +02:00 · f3d9222dae
commit f3d9222dae
parent 2649c6eb3f
3 changed files with 74 additions and 52 deletions
--- a/2
+++ b/2
@ -1,5 +1,5 @@
 image: Dockerfile
 	docker build . --tag transcription:latest

-raport.pdf: README.md 
+raport.pdf: raport.md 
 	pandoc -f markdown-implicit_figures -V geometry:margin=1in $^ -o $@
--- a/README.md
+++ b/README.md
@ -1,56 +1,22 @@
-# 1. Introduction and research problem analysis
+# AGU - projekt
+ 
+Projekt bazowy: [](https://github.com/tensorflow/magenta/tree/master/magenta/models/onsets_frames_transcription)

-We aim to transcribe raw audio to some kind of more structural representation like MIDI. In technical literature it's called _Automatic Music Transcription_ (AMT). Converting  audio  recordings  of  music  into  a  symbolic  form  makes  many  tasks  in  music  information  re-trieval easier to accomplish. It's a hard task even for humans. 
+## Wymagania
+ - docker
+ - LUB skonfigurowane środowisko magenta - [](https://github.com/tensorflow/magenta)

-There are several factors that make AMT difficult:
+## Uruchomienie w dockerze
+Aby wykorzystać obraz dockera najpierw trzeba go zbudować. Najłatwiej to zrobić z wykorzystaniem dołączonego Makefile:

-1. Polyphonic music contains a mixture of multiple simultaneous sources.
-2. Overlapping  sound  events  often  exhibit  harmonic  relations   with   each   other.
-3. The timing of musical voices is governed by the regularmetrical structure of the music.
-4. The  annotation  of  ground-truth  transcriptions  for  polyphonic music is very time consuming and requires high expertise.
+```
+$ make image # wywołuje docker build . --tag transcription:latest
+```

-Especialy the last one which limits the amount of available datasets, for training models.
+W ten sposób zostanie zbudowany obraz `transcription:latest`. Teraz możemy go wykorzystać:
+```
+$ docker run -v "$(pwd):/root/experiment" -it transcription:latest
+```

-To limit the scope of the research, we are going to mainly concern ourselves with transcribing polyphonic piano recordings. Standard piano can play 88 pitches, in many combinations and that complexity makes it prime research subject for polyphonic AMT.
-
-State of the art machine learning architectures used for this task take two approaches one similar to speech recognition and consist of two parts: _acoustic model_ and _music language model_. The acoustic model is used for estimating probability of a pitch being present in a frame of audio. The music language model is analogous to language models used in natural language processing and was introduced to predicts probability of a pitch being played in a context of previous pitches and standards of composing music.
-The second one also combines two networks: one network detects note  onsets  and  its  output  is  used  to  inform  asecond network, which focuses on detecting note lengths. The task is also implemented through different techniques such as  Non-Negative Matrix Factorization but we are going to focus on Neural Networks.
-
-Acoustic models usually use RNNs because they can capture long-term and short-term temporal patterns in music. CNNs however provide benefits, because music is created not only by modifing the height of a pitche, but also through changing the temporal "distance" between them, which can be detected using CNNs, which are very good in preserving and recognising spatial features. In case of the second approach both networks consist of Convolutional layer to process the input and RNNs that do the inference, connected in a specific way so that output of the first one can inform the RNN part of the second one.
-
-### The input
-A music file, that has to be preprocesed in some way (usually through frequency domain transformations) to a form processable by a CNN.
-
-### The output
-Series of probability vectors representing sounds present in each frames ready to be transformed into a MIDI file.
-
-## List of publications/articles/blog posts used for problem
-[:)] https://web.stanford.edu/class/ee384m/Handouts/HowtoReadPaper.pdf
-[1] [An End-to-End Neural Network for Polyphonic Piano Music Transcription](https://arxiv.org/pdf/1508.01774.pdf)
-[2] [Music Transcription by Deep Learning with Data and “Artificial Semantic” Augmentation understanding](https://arxiv.org/pdf/1712.03228.pdf)
-[3] [Modeling Temporal Dependencies in High-Dimensional Sequences:
-Application to Polyphonic Music Generation and Transcription](https://arxiv.org/ftp/arxiv/papers/1206/1206.6392.pdf)
-[4] [Onsets and Frames: Dual-Objective Piano Transcription](https://storage.googleapis.com/magentadata/papers/onsets-frames/index.html) <- Strona nowszego rozwiązania które wydaje się prostsze i bardziej efektywne
-[5] [https://github.com/ybayle/awesome-deep-learning-music](https://github.com/ybayle/awesome-deep-learning-music)
-[6] https://drive.google.com/file/d/0B1OooSxEtl0FcTBiOGdvSTBmWnc/view
-
-## List of repositories/code examples that will be used for implementation
-[1] https://github.com/IraKorshunova/folk-rnn
-[2] https://github.com/BShakhovsky/PolyphonicPianoTranscription
-[3] https://github.com/wgxli/piano-transcription
-
-## Framework selected for implementation
-TensorFlow with Keras - probably? Hard to say, but we have most experience with it.
-Keras über alles.
-PyTorch - if it looks like something can be made with it with low effort
-
-## Dataset to be used for experiments
-[MAPS](http://www.tsi.telecom-paristech.fr/aao/en/2010/07/08/maps-database-a-piano-database-for-multipitch-estimation-and-automatic-transcription-of-music/): used in [1], "31 GB of CD-quality recordings in .wav format", "The ground truth is provided for all sounds, in MIDI and text formats. The audio was generated from the ground truth in order to ensure the accuracy of the annotation."
-[MAESTRO](https://magenta.tensorflow.org/maestro-wave2midi2wave): "172 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms.", "based on recordings from the International Piano-e-Competition"
-[MusicNet](https://homes.cs.washington.edu/~thickstn/musicnet.html): "a collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise time of each note in every recording, the instrument that plays each note, and the note's position in the metrical structure of the composition", we probably have to filter this dataset in order to acquire music recordings with piano labels only. 
-[Ładny opis czterech datasetów](https://arxiv.org/ftp/arxiv/papers/1206/1206.6392.pdf).
-[2] opisuje lossless & lossy augumentation spektogramów
-
-# 2. Methodology: dataset, tools, experiments
-
-# 3. Experiment results and discussion
+W ten sposób powinniśmy znaleźć się wewnątrz kontenera z skonfigurowanym środowiskiem i pobranym checkpointem. Katalog
+w którym się znajdujemy powinien być zamontowany w obrazie jako `~/experiment`.
--- a/raport.md
+++ b/raport.md
@ -0,0 +1,56 @@
+# 1. Introduction and research problem analysis
+
+We aim to transcribe raw audio to some kind of more structural representation like MIDI. In technical literature it's called _Automatic Music Transcription_ (AMT). Converting  audio  recordings  of  music  into  a  symbolic  form  makes  many  tasks  in  music  information  re-trieval easier to accomplish. It's a hard task even for humans. 
+
+There are several factors that make AMT difficult:
+
+1. Polyphonic music contains a mixture of multiple simultaneous sources.
+2. Overlapping  sound  events  often  exhibit  harmonic  relations   with   each   other.
+3. The timing of musical voices is governed by the regularmetrical structure of the music.
+4. The  annotation  of  ground-truth  transcriptions  for  polyphonic music is very time consuming and requires high expertise.
+
+Especialy the last one which limits the amount of available datasets, for training models.
+
+To limit the scope of the research, we are going to mainly concern ourselves with transcribing polyphonic piano recordings. Standard piano can play 88 pitches, in many combinations and that complexity makes it prime research subject for polyphonic AMT.
+
+State of the art machine learning architectures used for this task take two approaches one similar to speech recognition and consist of two parts: _acoustic model_ and _music language model_. The acoustic model is used for estimating probability of a pitch being present in a frame of audio. The music language model is analogous to language models used in natural language processing and was introduced to predicts probability of a pitch being played in a context of previous pitches and standards of composing music.
+The second one also combines two networks: one network detects note  onsets  and  its  output  is  used  to  inform  asecond network, which focuses on detecting note lengths. The task is also implemented through different techniques such as  Non-Negative Matrix Factorization but we are going to focus on Neural Networks.
+
+Acoustic models usually use RNNs because they can capture long-term and short-term temporal patterns in music. CNNs however provide benefits, because music is created not only by modifing the height of a pitche, but also through changing the temporal "distance" between them, which can be detected using CNNs, which are very good in preserving and recognising spatial features. In case of the second approach both networks consist of Convolutional layer to process the input and RNNs that do the inference, connected in a specific way so that output of the first one can inform the RNN part of the second one.
+
+### The input
+A music file, that has to be preprocesed in some way (usually through frequency domain transformations) to a form processable by a CNN.
+
+### The output
+Series of probability vectors representing sounds present in each frames ready to be transformed into a MIDI file.
+
+## List of publications/articles/blog posts used for problem
+[:)] https://web.stanford.edu/class/ee384m/Handouts/HowtoReadPaper.pdf
+[1] [An End-to-End Neural Network for Polyphonic Piano Music Transcription](https://arxiv.org/pdf/1508.01774.pdf)
+[2] [Music Transcription by Deep Learning with Data and “Artificial Semantic” Augmentation understanding](https://arxiv.org/pdf/1712.03228.pdf)
+[3] [Modeling Temporal Dependencies in High-Dimensional Sequences:
+Application to Polyphonic Music Generation and Transcription](https://arxiv.org/ftp/arxiv/papers/1206/1206.6392.pdf)
+[4] [Onsets and Frames: Dual-Objective Piano Transcription](https://storage.googleapis.com/magentadata/papers/onsets-frames/index.html) <- Strona nowszego rozwiązania które wydaje się prostsze i bardziej efektywne
+[5] [https://github.com/ybayle/awesome-deep-learning-music](https://github.com/ybayle/awesome-deep-learning-music)
+[6] https://drive.google.com/file/d/0B1OooSxEtl0FcTBiOGdvSTBmWnc/view
+
+## List of repositories/code examples that will be used for implementation
+[1] https://github.com/IraKorshunova/folk-rnn
+[2] https://github.com/BShakhovsky/PolyphonicPianoTranscription
+[3] https://github.com/wgxli/piano-transcription
+
+## Framework selected for implementation
+TensorFlow with Keras - probably? Hard to say, but we have most experience with it.
+Keras über alles.
+PyTorch - if it looks like something can be made with it with low effort
+
+## Dataset to be used for experiments
+[MAPS](http://www.tsi.telecom-paristech.fr/aao/en/2010/07/08/maps-database-a-piano-database-for-multipitch-estimation-and-automatic-transcription-of-music/): used in [1], "31 GB of CD-quality recordings in .wav format", "The ground truth is provided for all sounds, in MIDI and text formats. The audio was generated from the ground truth in order to ensure the accuracy of the annotation."
+[MAESTRO](https://magenta.tensorflow.org/maestro-wave2midi2wave): "172 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms.", "based on recordings from the International Piano-e-Competition"
+[MusicNet](https://homes.cs.washington.edu/~thickstn/musicnet.html): "a collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise time of each note in every recording, the instrument that plays each note, and the note's position in the metrical structure of the composition", we probably have to filter this dataset in order to acquire music recordings with piano labels only. 
+[Ładny opis czterech datasetów](https://arxiv.org/ftp/arxiv/papers/1206/1206.6392.pdf).
+[2] opisuje lossless & lossy augumentation spektogramów
+
+# 2. Methodology: dataset, tools, experiments
+
+# 3. Experiment results and discussion