Initial commit

This commit is contained in:
Kacper Donat 2020-05-23 16:18:57 +02:00
commit 6a6e99f8d1
3 changed files with 199 additions and 0 deletions

141
.gitignore vendored Normal file
View File

@ -0,0 +1,141 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
.pybuilder/
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# pytype static type analyzer
.pytype/
# Cython debug symbols
cython_debug/
# Generated docs
*.pdf

2
Makefile Normal file
View File

@ -0,0 +1,2 @@
raport.pdf: README.md
pandoc -f markdown-implicit_figures -V geometry:margin=1in $^ -o $@

56
README.md Normal file
View File

@ -0,0 +1,56 @@
# 1. Introduction and research problem analysis
We aim to transcribe raw audio to some kind of more structural representation like MIDI. In technical literature it's called _Automatic Music Transcription_ (AMT). Converting audio recordings of music into a symbolic form makes many tasks in music information re-trieval easier to accomplish. It's a hard task even for humans.
There are several factors that make AMT difficult:
1. Polyphonic music contains a mixture of multiple simultaneous sources.
2. Overlapping sound events often exhibit harmonic relations with each other.
3. The timing of musical voices is governed by the regularmetrical structure of the music.
4. The annotation of ground-truth transcriptions for polyphonic music is very time consuming and requires high expertise.
Especialy the last one which limits the amount of available datasets, for training models.
To limit the scope of the research, we are going to mainly concern ourselves with transcribing polyphonic piano recordings. Standard piano can play 88 pitches, in many combinations and that complexity makes it prime research subject for polyphonic AMT.
State of the art machine learning architectures used for this task take two approaches one similar to speech recognition and consist of two parts: _acoustic model_ and _music language model_. The acoustic model is used for estimating probability of a pitch being present in a frame of audio. The music language model is analogous to language models used in natural language processing and was introduced to predicts probability of a pitch being played in a context of previous pitches and standards of composing music.
The second one also combines two networks: one network detects note onsets and its output is used to inform asecond network, which focuses on detecting note lengths. The task is also implemented through different techniques such as Non-Negative Matrix Factorization but we are going to focus on Neural Networks.
Acoustic models usually use RNNs because they can capture long-term and short-term temporal patterns in music. CNNs however provide benefits, because music is created not only by modifing the height of a pitche, but also through changing the temporal "distance" between them, which can be detected using CNNs, which are very good in preserving and recognising spatial features. In case of the second approach both networks consist of Convolutional layer to process the input and RNNs that do the inference, connected in a specific way so that output of the first one can inform the RNN part of the second one.
### The input
A music file, that has to be preprocesed in some way (usually through frequency domain transformations) to a form processable by a CNN.
### The output
Series of probability vectors representing sounds present in each frames ready to be transformed into a MIDI file.
## List of publications/articles/blog posts used for problem
[:)] https://web.stanford.edu/class/ee384m/Handouts/HowtoReadPaper.pdf
[1] [An End-to-End Neural Network for Polyphonic Piano Music Transcription](https://arxiv.org/pdf/1508.01774.pdf)
[2] [Music Transcription by Deep Learning with Data and “Artificial Semantic” Augmentation understanding](https://arxiv.org/pdf/1712.03228.pdf)
[3] [Modeling Temporal Dependencies in High-Dimensional Sequences:
Application to Polyphonic Music Generation and Transcription](https://arxiv.org/ftp/arxiv/papers/1206/1206.6392.pdf)
[4] [Onsets and Frames: Dual-Objective Piano Transcription](https://storage.googleapis.com/magentadata/papers/onsets-frames/index.html) <- Strona nowszego rozwiązania które wydaje się prostsze i bardziej efektywne
[5] [https://github.com/ybayle/awesome-deep-learning-music](https://github.com/ybayle/awesome-deep-learning-music)
[6] https://drive.google.com/file/d/0B1OooSxEtl0FcTBiOGdvSTBmWnc/view
## List of repositories/code examples that will be used for implementation
[1] https://github.com/IraKorshunova/folk-rnn
[2] https://github.com/BShakhovsky/PolyphonicPianoTranscription
[3] https://github.com/wgxli/piano-transcription
## Framework selected for implementation
TensorFlow with Keras - probably? Hard to say, but we have most experience with it.
Keras über alles.
PyTorch - if it looks like something can be made with it with low effort
## Dataset to be used for experiments
[MAPS](http://www.tsi.telecom-paristech.fr/aao/en/2010/07/08/maps-database-a-piano-database-for-multipitch-estimation-and-automatic-transcription-of-music/): used in [1], "31 GB of CD-quality recordings in .wav format", "The ground truth is provided for all sounds, in MIDI and text formats. The audio was generated from the ground truth in order to ensure the accuracy of the annotation."
[MAESTRO](https://magenta.tensorflow.org/maestro-wave2midi2wave): "172 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms.", "based on recordings from the International Piano-e-Competition"
[MusicNet](https://homes.cs.washington.edu/~thickstn/musicnet.html): "a collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise time of each note in every recording, the instrument that plays each note, and the note's position in the metrical structure of the composition", we probably have to filter this dataset in order to acquire music recordings with piano labels only.
[Ładny opis czterech datasetów](https://arxiv.org/ftp/arxiv/papers/1206/1206.6392.pdf).
[2] opisuje lossless & lossy augumentation spektogramów
# 2. Methodology: dataset, tools, experiments
# 3. Experiment results and discussion