Hadlay-Zhang
diff --git a/‎.gitignore‎
Lines changed: 145 additions & 0 deletions b/‎.gitignore‎
Lines changed: 145 additions & 0 deletions
diff --git a/‎LICENSE‎
Lines changed: 21 additions & 0 deletions b/‎LICENSE‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 172 additions & 0 deletions b/‎README.md‎
Lines changed: 172 additions & 0 deletions
diff --git a/‎config/finetune_nisqa.yaml‎
Lines changed: 53 additions & 0 deletions b/‎config/finetune_nisqa.yaml‎
Lines changed: 53 additions & 0 deletions
@@ -0,0 +1,145 @@
+
+# Created by https://www.toptal.com/developers/gitignore/api/python
+# Edit at https://www.toptal.com/developers/gitignore?templates=python
+
+### Python ###
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# End of https://www.toptal.com/developers/gitignore/api/python
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2021 Gabriel Mittag, Quality and Usability Lab
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
@@ -0,0 +1,172 @@
+# NISQA: Speech Quality and Naturalness Assessment
+
+*+++ News: The NISQA model has recently been updated to NISQA v2.0. The new version offers multidimensional predictions with higher accuracy and allows for training and finetuning the model.*
+
+**Speech Quality Prediction:**   
+NISQA is a deep learning model/framework for speech quality prediction. The NISQA model weights can be used to predict the quality of a speech sample that has been sent through a communication system (e.g telephone or video call). Besides overall speech quality, NISQA also provides predictions for the quality dimensions *Noisiness*, *Coloration*, *Discontinuity*, and *Loudness* to give more insight into the cause of the quality degradation. 
+
+**TTS Naturalness Prediction:**  
+The NISQA-TTS model weights can be used to estimate the *Naturalness* of synthetic speech generated by a Voice Conversion or Text-To-Speech system (Siri, Alexa, etc.).
+
+**Training/Finetuning:**   
+NISQA can be used to train new single-ended or double-ended speech quality prediction models with different deep learning architectures, such as CNN or DFF -> Self-Attention or LSTM -> Attention-Pooling or Max-Pooling. The provided model weights can also be applied to finetune the trained model towards new data or for transfer-learning to a different regression task (e.g. quality estimation of enhanced speech, speaker similarity estimation, or emotion recognition) .
+
+**Speech Quality Datasets:**  
+We provide a large corpus of more than 14,000 speech samples with subjective speech quality and speech quality dimension labels. 
+
+## Table of Contents
+- [Installation](#installation)
+- [Using NISQA](#using-nisqa)
+  - [Prediction](#prediction)
+  - [Training](#training)
+    - [Finetuning / Transfer Learning](#finetuning--transfer-learning)
+    - [Training a new model](#training-a-new-model)
+  - [Evaluation](#evaluation)
+- [NISQA Corpus](#nisqa-corpus)
+- [Paper and License](#paper-and-license)
+
+More information about the deep learning model structure, the used training datasets, and the training options, see the [NISQA paper](https://arxiv.org/abs/2104.09494) and the [Wiki](https://github.com/gabrielmittag/NISQA/wiki/).
+
+
+## Installation
+
+To install requirements install [Anaconda](https://www.anaconda.com/products/individual) and then use:
+
+```setup
+conda env create -f env.yml
+```
+
+This will create a new environment with the name "nisqa". Activate this environment to go on:
+
+```setup2
+conda activate nisqa
+```
+
+
+
+## Using NISQA
+
+We provide examples for using NISQA to predict the quality of speech samples, to train a new speech quality model, and to evaluate the performance of a trained speech quality model. 
+
+There are three different model weights available, the appropriate weights should be loaded depending on the domain:
+
+| Model                 | Prediction Output                                               | Domain             | Filename           |
+| --------------------- | --------------------------------------------------------------- | ------------------ | ------------------ |
+| NISQA (v2.0)          | Overall Quality, Noisiness, Coloration, Discontinuity, Loudness | Transmitted Speech | nisqa.tar          |
+| NISQA (v2.0) mos only | Overall Quality only (for finetuning/transfer learning)         | Transmitted Speech | nisqa_mos_only.tar |
+| NISQA-TTS (v1.0)      | Naturalness                                                     | Synthesized Speech | nisqa_tts.tar      |
+
+### Prediction
+
+There are three modes available to predict the quality of speech via command line arguments:
+* Predict a single file
+* Predict all files in a folder
+* Predict all files in a CSV table
+
+**Important:** Select "*nisqa.tar*" to predict the quality of a transmitted speech sample and "*nisqa_tts.tar*" to predict the Naturalness of a synthesized speech sample.
+
+To predict the quality of a single .wav file use:
+
+```
+python run_predict.py --mode predict_file --pretrained_model weights/nisqa.tar --deg /path/to/wav/file.wav --output_dir /path/to/dir/with/results
+```
+To predict the quality of all .wav files in a folder use:
+```
+python run_predict.py --mode predict_dir --pretrained_model weights/nisqa.tar --data_dir /path/to/folder/with/wavs --num_workers 0 --bs 10 --output_dir /path/to/dir/with/results
+```
+
+To predict the quality of all .wav files listed in a csv table use:
+```
+python run_predict.py --mode predict_csv --pretrained_model weights/nisqa.tar --csv_file files.csv --csv_deg column_name_of_filepaths --num_workers 0 --bs 10 --output_dir /path/to/dir/with/results
+```
+
+The results will be printed to the console and saved to a csv file in a given folder (optional with --output_dir). To speed up the prediction, the number of workers and batch size of the Pytorch Dataloader can be increased (optional with --num_workers and --bs). In case of stereo files --ms_channel can be used to select the audio channel.
+
+### Training
+
+#### Finetuning / Transfer Learning
+
+To use the model weights to finetune the model on a new dataset, only a CSV file with the filenames and labels is needed. The training configuration is controlled from a YAML file and can be started as follows:
+
+```
+python run_train.py --yaml config/finetune_nisqa.yaml
+```
+
+- If the [NISQA Corpus](https://github.com/gabrielmittag/NISQA/wiki/NISQA-Corpus) is used, only two arguments need to updated in the YAML file and you are ready to go: The `data_dir` to the extracted NISQA_Corpus folder and the `output_dir`, where the results should be stored.
+
+- If you use your own dataset or want to load the NISQA-TTS model, some other updates are needed. 
+
+  Your CSV file needs to contain at least three columns with the following names
+
+  - `db` with the individual dataset names for each file
+  - `filepath_deg` filepath to the degraded WAV file, either absolute paths or relative to the `data_dir` (CSV column name can be changed in YAML)
+  - `mos` with the target labels (CSV column name can be changed in YAML)
+
+  The `finetune_nisqa.yaml` needs to be updated as follows:
+
+  - `data_dir` path to the main folder, which contains the CSV file and the datasets
+  - `output_dir` path to output folder with saved model weights and results
+  - `pretrained_model` filename of the pretrained model, either `nisqa_mos_only.tar` for natural speech or `nisqa_tts.tar` for synthesized speech
+  - `csv_file` name of the CSV with filepaths and target labels
+  - `csv_deg` CSV column name that contains filepaths (e.g. `filepath_deg`)
+  - `csv_mos_train` and `csv_mos_val` CSV column names of the target value (e.g. `mos`)
+  - `csv_db_train` and `csv_db_val` names of the datasets you want to use for training and validation. Datasets names must be in the `db` column.
+
+See the comments in the YAML configuration file and the [Wiki](https://github.com/gabrielmittag/NISQA/wiki/) (not yet added) for more advanced training options. A good starting point would be to use the NISQA Corpus to get the training started with the standard configuration.
+
+#### Training a new model
+
+NISQA can also be used as a framework to train new speech quality models with different deep learning architectures. The general model structure is as follows:
+
+1. *Framewise model:* CNN or Feedforward network
+2. *Time-Dependency* model: Self-Attention or LSTM
+3. *Pooling:* Average, Max, Attention or Last-Step-Pooling
+
+The framewise and time-dependency models can be skipped, for example to train an LSTM model without CNN that uses the last-time step for prediction. Also a second time-dependency stage can be added, for example for LSTM-Self-Attention structure. The model structure can be easily controlled via the YAML configuration file. The training with the standard NISQA model configuration can be started with the [NISQA Corpus](https://github.com/gabrielmittag/NISQA/wiki/NISQA-Corpus) as follows:
+
+```
+python run_train.py --yaml config/train_nisqa_cnn_sa_ap.yaml
+```
+
+If the [NISQA Corpus](https://github.com/gabrielmittag/NISQA/wiki/NISQA-Corpus) is used, only the `data_dir` needs to be updated to the unzipped NISQA_Corpus folder and the `output_dir` in the YAML file. Otherwise, see the previous [finetuning section](#finetuning-transfer-learning) for updating the YAML file if a custom dataset is applied.
+
+It is also possible to train any other combination of neural networks, for example, to train a model with LSTM instead of Self-Attention, the `train_nisqa_cnn_lstm_avg.yaml` example configuration file is provided. 
+
+To train a **double-ended** model for full-reference speech quality prediction, the `train_nisqa_double_ended.yaml` configuration file can be used as an example. See the comments in the YAML files and the [Wiki](https://github.com/gabrielmittag/NISQA/wiki/) (not yet added) for more details on different possible model structures and advanced training options.
+
+### Evaluation
+
+Trained models can be evaluated on a given dataset as follows (can also be used as a conformance test of the model installation):
+
+```
+python run_evaluate.py
+```
+
+Before running, the options and paths inside the Python script `run_evaluate.py` should be updated. If the [NISQA Corpus](https://github.com/gabrielmittag/NISQA/wiki/NISQA-Corpus) is used, only the `data_dir` and `output_dir` paths need to be adjusted. Besides Pearson's Correlation and RMSE, also an RMSE after first-order polynomial mapping is calculated. If a CSV file with per-condition labels is provided, the script will also output per-condition results and RMSE*. Optionally, correlation diagrams can be plotted. The script should return the same results as in the NISQA paper when it is run on the NISQA Corpus.
+
+## NISQA Corpus
+
+The NISQA Corpus includes more than 14,000 speech samples with simulated (e.g. codecs, packet-loss, background noise) and live (e.g. mobile phone, Zoom, Skype, WhatsApp) conditions.   
+
+For the download link and more details on the datasets and used source speech samples see the [NISQA Corpus Wiki](https://github.com/gabrielmittag/NISQA/wiki/NISQA-Corpus).
+
+## Paper and License
+
+- If you use the **NISQA model** or the **NISQA Corpus** for your research, please cite following paper:  
+  [G. Mittag, B. Naderi, A. Chehadi, and S. Möller “NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,” in Proc. Interspeech 2021, 2021.](https://www.isca-speech.org/archive/pdfs/interspeech_2021/mittag21_interspeech.pdf)
+- Please cite following paper if you use the **NISQA-TTS** model for Naturalness prediction of synthesized speech:  
+  [G. Mittag and S. Moller, “Deep Learning Based Assessment of Synthetic Speech Naturalness,” in Proc. Interspeech 2020, 2020.](https://www.isca-speech.org/archive/Interspeech_2020/abstracts/2382.html)
+- Please cite following paper if you use the **double-ended NISQA model**:  
+  [G. Mittag and S. Möller. Full-reference speech quality estimation with attentional Siamese neural networks. In Proc. ICASSP 2020, 2020.](https://ieeexplore.ieee.org/document/9053951)
+- The older NISQA (v0.42) model version is described in following paper:  
+  [G.  Mittag  and  S.  Möller,  “Non-intrusive  speech  quality  assessment  for  super-wideband  speech  communication  networks,”  in Proc. ICASSP 2019, 2019](https://ieeexplore.ieee.org/document/8683770)
+
+The NISQA code is licensed under [MIT License](LICENSE).
+
+The model weights (nisqa.tar, nisqa_mos_only.tar, nisqa_tts.tar) are provided under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License](weights/LICENSE_model_weights)
+
+The NISQA Corpus is provided under the original terms of the used source speech and noise samples. More information can be found in the  [NISQA Corpus Wiki](https://github.com/gabrielmittag/NISQA/wiki/NISQA-Corpus).
+
+Copyright © 2021 Gabriel Mittag  
+www.qu.tu-berlin.de
+
@@ -0,0 +1,53 @@
+# Config example for transfer-learning or finetuning of NISQA or NISQA-TTS: 
+
+# Runname and paths
+name: training_run_name # name of current training run
+data_dir: C:/Users/Name/Downloads/NISQA_Corpus # main input dir with dataset samples and csv files
+output_dir: C:/Users/Name/Downloads/trained_models # output dir, a new subfolder for current run will be created with yaml, results csv, and stored model weights
+pretrained_model: weights/nisqa_mos_only.tar # absolute path to pretrained model | path to pretrained model relative to current folder
+
+# Dataset options
+csv_file: NISQA_corpus_file.csv # csv-file with MOS labels and filepaths of all datasets, must be placed in 'data_dir', must contain columns 'mos', 'noi', 'dis', 'col', 'loud' with overall and dimension quality ratings
+csv_con: null # csv-file with per-condition MOS used for evaluation (optional)
+csv_deg: filepath_deg # csv column name of filepath to degraded speech sample, path must be relative to 'data_dir'
+csv_mos_train: mos # csv column name of target training value (usually MOS)
+csv_mos_val: mos # csv column name of target validation value (usually MOS)
+csv_db_train: # dataset names of training sets, the dataset names must be in 'db' column of csv file
+    - NISQA_TRAIN_SIM
+    - NISQA_TRAIN_LIVE
+csv_db_val:  # dataset names of validation sets, the dataset names must be in 'db' column of csv file
+    - NISQA_VAL_SIM
+    - NISQA_VAL_LIVE
+
+# Training options
+tr_epochs: 500 # number of max training epochs
+tr_early_stop: 20 # stop training if neither validation RMSE nor correlation 'r_p' does improve for 'tr_early_stop' epochs
+tr_bs: 40 # training dataset mini-batch size (should be increased to 100-200 if enough GPU memory available)
+tr_bs_val: 40 # validation dataset mini-batch size (should be increased to 100-200 if enough GPU memory available)
+tr_lr: 0.001 # learning rate of ADAM optimiser
+tr_lr_patience: 15  # learning rate patience, decrease learning rate if loss does not improve for 'tr_lr_patience' epochs
+tr_num_workers: 4 # number of workers to be used by PyTorch Dataloader (may cause problems on Windows machines -> set to 0)
+tr_parallel: True # use PyTorch DataParallel for training on multiple GPUs
+tr_ds_to_memory: False # load dataset in CPU RAM before starting training (increases speed on some systems, 'tr_num_workers' should be set to 0 or 1)
+tr_ds_to_memory_workers: 0  # number of workers used for loading data into CPU RAM (experimental)
+tr_device: null # train on 'cpu' or 'cuda', if null 'cuda' is used if available.
+tr_checkpoint: every_epoch # 'every_epoch' stores model weights at each training epoch | 'best_only' stores only the weights with best validation correlation | 'null' only stores results but no model weights
+tr_verbose: 2 # '0' only basic results after each epoch | '1' more detailed results and bias loss information | '2' adds progression bar
+ms_max_segments: 1300 # if samples of different duration are used they will be padded. one segment corresponds to 40ms -> 0.04*1300=52sec max sample duration. increase if you apply the model to longer samples
+ms_channel: null # audio channel in case of stereo file (0->left, 1->right). if null, mono mix is used
+
+# Bias loss options (optional)
+tr_bias_mapping: null # set to 'first_order' if bias loss should be applied, otherwise 'null'
+tr_bias_min_r: null # minimum correlation threshold to be reached before estimating bias (e.g. 0.7), set to 'null' if no bias loss should be applied
+tr_bias_anchor_db: null # name of anchor dataset (optional)
+
+
+
+
+
+
+
+
+
+
+