|
| 1 | +# NISQA: Speech Quality and Naturalness Assessment |
| 2 | + |
| 3 | +*+++ News: The NISQA model has recently been updated to NISQA v2.0. The new version offers multidimensional predictions with higher accuracy and allows for training and finetuning the model.* |
| 4 | + |
| 5 | +**Speech Quality Prediction:** |
| 6 | +NISQA is a deep learning model/framework for speech quality prediction. The NISQA model weights can be used to predict the quality of a speech sample that has been sent through a communication system (e.g telephone or video call). Besides overall speech quality, NISQA also provides predictions for the quality dimensions *Noisiness*, *Coloration*, *Discontinuity*, and *Loudness* to give more insight into the cause of the quality degradation. |
| 7 | + |
| 8 | +**TTS Naturalness Prediction:** |
| 9 | +The NISQA-TTS model weights can be used to estimate the *Naturalness* of synthetic speech generated by a Voice Conversion or Text-To-Speech system (Siri, Alexa, etc.). |
| 10 | + |
| 11 | +**Training/Finetuning:** |
| 12 | +NISQA can be used to train new single-ended or double-ended speech quality prediction models with different deep learning architectures, such as CNN or DFF -> Self-Attention or LSTM -> Attention-Pooling or Max-Pooling. The provided model weights can also be applied to finetune the trained model towards new data or for transfer-learning to a different regression task (e.g. quality estimation of enhanced speech, speaker similarity estimation, or emotion recognition) . |
| 13 | + |
| 14 | +**Speech Quality Datasets:** |
| 15 | +We provide a large corpus of more than 14,000 speech samples with subjective speech quality and speech quality dimension labels. |
| 16 | + |
| 17 | +## Table of Contents |
| 18 | +- [Installation](#installation) |
| 19 | +- [Using NISQA](#using-nisqa) |
| 20 | + - [Prediction](#prediction) |
| 21 | + - [Training](#training) |
| 22 | + - [Finetuning / Transfer Learning](#finetuning--transfer-learning) |
| 23 | + - [Training a new model](#training-a-new-model) |
| 24 | + - [Evaluation](#evaluation) |
| 25 | +- [NISQA Corpus](#nisqa-corpus) |
| 26 | +- [Paper and License](#paper-and-license) |
| 27 | + |
| 28 | +More information about the deep learning model structure, the used training datasets, and the training options, see the [NISQA paper](https://arxiv.org/abs/2104.09494) and the [Wiki](https://github.com/gabrielmittag/NISQA/wiki/). |
| 29 | + |
| 30 | + |
| 31 | +## Installation |
| 32 | + |
| 33 | +To install requirements install [Anaconda](https://www.anaconda.com/products/individual) and then use: |
| 34 | + |
| 35 | +```setup |
| 36 | +conda env create -f env.yml |
| 37 | +``` |
| 38 | + |
| 39 | +This will create a new environment with the name "nisqa". Activate this environment to go on: |
| 40 | + |
| 41 | +```setup2 |
| 42 | +conda activate nisqa |
| 43 | +``` |
| 44 | + |
| 45 | + |
| 46 | + |
| 47 | +## Using NISQA |
| 48 | + |
| 49 | +We provide examples for using NISQA to predict the quality of speech samples, to train a new speech quality model, and to evaluate the performance of a trained speech quality model. |
| 50 | + |
| 51 | +There are three different model weights available, the appropriate weights should be loaded depending on the domain: |
| 52 | + |
| 53 | +| Model | Prediction Output | Domain | Filename | |
| 54 | +| --------------------- | --------------------------------------------------------------- | ------------------ | ------------------ | |
| 55 | +| NISQA (v2.0) | Overall Quality, Noisiness, Coloration, Discontinuity, Loudness | Transmitted Speech | nisqa.tar | |
| 56 | +| NISQA (v2.0) mos only | Overall Quality only (for finetuning/transfer learning) | Transmitted Speech | nisqa_mos_only.tar | |
| 57 | +| NISQA-TTS (v1.0) | Naturalness | Synthesized Speech | nisqa_tts.tar | |
| 58 | + |
| 59 | +### Prediction |
| 60 | + |
| 61 | +There are three modes available to predict the quality of speech via command line arguments: |
| 62 | +* Predict a single file |
| 63 | +* Predict all files in a folder |
| 64 | +* Predict all files in a CSV table |
| 65 | + |
| 66 | +**Important:** Select "*nisqa.tar*" to predict the quality of a transmitted speech sample and "*nisqa_tts.tar*" to predict the Naturalness of a synthesized speech sample. |
| 67 | + |
| 68 | +To predict the quality of a single .wav file use: |
| 69 | + |
| 70 | +``` |
| 71 | +python run_predict.py --mode predict_file --pretrained_model weights/nisqa.tar --deg /path/to/wav/file.wav --output_dir /path/to/dir/with/results |
| 72 | +``` |
| 73 | +To predict the quality of all .wav files in a folder use: |
| 74 | +``` |
| 75 | +python run_predict.py --mode predict_dir --pretrained_model weights/nisqa.tar --data_dir /path/to/folder/with/wavs --num_workers 0 --bs 10 --output_dir /path/to/dir/with/results |
| 76 | +``` |
| 77 | + |
| 78 | +To predict the quality of all .wav files listed in a csv table use: |
| 79 | +``` |
| 80 | +python run_predict.py --mode predict_csv --pretrained_model weights/nisqa.tar --csv_file files.csv --csv_deg column_name_of_filepaths --num_workers 0 --bs 10 --output_dir /path/to/dir/with/results |
| 81 | +``` |
| 82 | + |
| 83 | +The results will be printed to the console and saved to a csv file in a given folder (optional with --output_dir). To speed up the prediction, the number of workers and batch size of the Pytorch Dataloader can be increased (optional with --num_workers and --bs). In case of stereo files --ms_channel can be used to select the audio channel. |
| 84 | + |
| 85 | +### Training |
| 86 | + |
| 87 | +#### Finetuning / Transfer Learning |
| 88 | + |
| 89 | +To use the model weights to finetune the model on a new dataset, only a CSV file with the filenames and labels is needed. The training configuration is controlled from a YAML file and can be started as follows: |
| 90 | + |
| 91 | +``` |
| 92 | +python run_train.py --yaml config/finetune_nisqa.yaml |
| 93 | +``` |
| 94 | + |
| 95 | +- If the [NISQA Corpus](https://github.com/gabrielmittag/NISQA/wiki/NISQA-Corpus) is used, only two arguments need to updated in the YAML file and you are ready to go: The `data_dir` to the extracted NISQA_Corpus folder and the `output_dir`, where the results should be stored. |
| 96 | + |
| 97 | +- If you use your own dataset or want to load the NISQA-TTS model, some other updates are needed. |
| 98 | + |
| 99 | + Your CSV file needs to contain at least three columns with the following names |
| 100 | + |
| 101 | + - `db` with the individual dataset names for each file |
| 102 | + - `filepath_deg` filepath to the degraded WAV file, either absolute paths or relative to the `data_dir` (CSV column name can be changed in YAML) |
| 103 | + - `mos` with the target labels (CSV column name can be changed in YAML) |
| 104 | + |
| 105 | + The `finetune_nisqa.yaml` needs to be updated as follows: |
| 106 | + |
| 107 | + - `data_dir` path to the main folder, which contains the CSV file and the datasets |
| 108 | + - `output_dir` path to output folder with saved model weights and results |
| 109 | + - `pretrained_model` filename of the pretrained model, either `nisqa_mos_only.tar` for natural speech or `nisqa_tts.tar` for synthesized speech |
| 110 | + - `csv_file` name of the CSV with filepaths and target labels |
| 111 | + - `csv_deg` CSV column name that contains filepaths (e.g. `filepath_deg`) |
| 112 | + - `csv_mos_train` and `csv_mos_val` CSV column names of the target value (e.g. `mos`) |
| 113 | + - `csv_db_train` and `csv_db_val` names of the datasets you want to use for training and validation. Datasets names must be in the `db` column. |
| 114 | + |
| 115 | +See the comments in the YAML configuration file and the [Wiki](https://github.com/gabrielmittag/NISQA/wiki/) (not yet added) for more advanced training options. A good starting point would be to use the NISQA Corpus to get the training started with the standard configuration. |
| 116 | + |
| 117 | +#### Training a new model |
| 118 | + |
| 119 | +NISQA can also be used as a framework to train new speech quality models with different deep learning architectures. The general model structure is as follows: |
| 120 | + |
| 121 | +1. *Framewise model:* CNN or Feedforward network |
| 122 | +2. *Time-Dependency* model: Self-Attention or LSTM |
| 123 | +3. *Pooling:* Average, Max, Attention or Last-Step-Pooling |
| 124 | + |
| 125 | +The framewise and time-dependency models can be skipped, for example to train an LSTM model without CNN that uses the last-time step for prediction. Also a second time-dependency stage can be added, for example for LSTM-Self-Attention structure. The model structure can be easily controlled via the YAML configuration file. The training with the standard NISQA model configuration can be started with the [NISQA Corpus](https://github.com/gabrielmittag/NISQA/wiki/NISQA-Corpus) as follows: |
| 126 | + |
| 127 | +``` |
| 128 | +python run_train.py --yaml config/train_nisqa_cnn_sa_ap.yaml |
| 129 | +``` |
| 130 | + |
| 131 | +If the [NISQA Corpus](https://github.com/gabrielmittag/NISQA/wiki/NISQA-Corpus) is used, only the `data_dir` needs to be updated to the unzipped NISQA_Corpus folder and the `output_dir` in the YAML file. Otherwise, see the previous [finetuning section](#finetuning-transfer-learning) for updating the YAML file if a custom dataset is applied. |
| 132 | + |
| 133 | +It is also possible to train any other combination of neural networks, for example, to train a model with LSTM instead of Self-Attention, the `train_nisqa_cnn_lstm_avg.yaml` example configuration file is provided. |
| 134 | + |
| 135 | +To train a **double-ended** model for full-reference speech quality prediction, the `train_nisqa_double_ended.yaml` configuration file can be used as an example. See the comments in the YAML files and the [Wiki](https://github.com/gabrielmittag/NISQA/wiki/) (not yet added) for more details on different possible model structures and advanced training options. |
| 136 | + |
| 137 | +### Evaluation |
| 138 | + |
| 139 | +Trained models can be evaluated on a given dataset as follows (can also be used as a conformance test of the model installation): |
| 140 | + |
| 141 | +``` |
| 142 | +python run_evaluate.py |
| 143 | +``` |
| 144 | + |
| 145 | +Before running, the options and paths inside the Python script `run_evaluate.py` should be updated. If the [NISQA Corpus](https://github.com/gabrielmittag/NISQA/wiki/NISQA-Corpus) is used, only the `data_dir` and `output_dir` paths need to be adjusted. Besides Pearson's Correlation and RMSE, also an RMSE after first-order polynomial mapping is calculated. If a CSV file with per-condition labels is provided, the script will also output per-condition results and RMSE*. Optionally, correlation diagrams can be plotted. The script should return the same results as in the NISQA paper when it is run on the NISQA Corpus. |
| 146 | + |
| 147 | +## NISQA Corpus |
| 148 | + |
| 149 | +The NISQA Corpus includes more than 14,000 speech samples with simulated (e.g. codecs, packet-loss, background noise) and live (e.g. mobile phone, Zoom, Skype, WhatsApp) conditions. |
| 150 | + |
| 151 | +For the download link and more details on the datasets and used source speech samples see the [NISQA Corpus Wiki](https://github.com/gabrielmittag/NISQA/wiki/NISQA-Corpus). |
| 152 | + |
| 153 | +## Paper and License |
| 154 | + |
| 155 | +- If you use the **NISQA model** or the **NISQA Corpus** for your research, please cite following paper: |
| 156 | + [G. Mittag, B. Naderi, A. Chehadi, and S. Möller “NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,” in Proc. Interspeech 2021, 2021.](https://www.isca-speech.org/archive/pdfs/interspeech_2021/mittag21_interspeech.pdf) |
| 157 | +- Please cite following paper if you use the **NISQA-TTS** model for Naturalness prediction of synthesized speech: |
| 158 | + [G. Mittag and S. Moller, “Deep Learning Based Assessment of Synthetic Speech Naturalness,” in Proc. Interspeech 2020, 2020.](https://www.isca-speech.org/archive/Interspeech_2020/abstracts/2382.html) |
| 159 | +- Please cite following paper if you use the **double-ended NISQA model**: |
| 160 | + [G. Mittag and S. Möller. Full-reference speech quality estimation with attentional Siamese neural networks. In Proc. ICASSP 2020, 2020.](https://ieeexplore.ieee.org/document/9053951) |
| 161 | +- The older NISQA (v0.42) model version is described in following paper: |
| 162 | + [G. Mittag and S. Möller, “Non-intrusive speech quality assessment for super-wideband speech communication networks,” in Proc. ICASSP 2019, 2019](https://ieeexplore.ieee.org/document/8683770) |
| 163 | + |
| 164 | +The NISQA code is licensed under [MIT License](LICENSE). |
| 165 | + |
| 166 | +The model weights (nisqa.tar, nisqa_mos_only.tar, nisqa_tts.tar) are provided under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License](weights/LICENSE_model_weights) |
| 167 | + |
| 168 | +The NISQA Corpus is provided under the original terms of the used source speech and noise samples. More information can be found in the [NISQA Corpus Wiki](https://github.com/gabrielmittag/NISQA/wiki/NISQA-Corpus). |
| 169 | + |
| 170 | +Copyright © 2021 Gabriel Mittag |
| 171 | +www.qu.tu-berlin.de |
| 172 | + |
0 commit comments