Skip to content

Commit 2a24a3b

Browse files
author
Hadley-Zhang
committed
feat: add descriptive statistics and charts
Added new descriptive tools for Speech Quality Prediction: - Mean and Standard Deviation calculations of the predictions - BarCharts and LineCharts of the predictions This change helps provide better assistance especially for Speech Quality Testing, e.x. repeated averaging of results for the same test sample in practical audio testing scenarios. Also, output name of csv is added just like gabrielmittag#30.
0 parents  commit 2a24a3b

18 files changed

Lines changed: 4977 additions & 0 deletions

.gitignore

Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
2+
# Created by https://www.toptal.com/developers/gitignore/api/python
3+
# Edit at https://www.toptal.com/developers/gitignore?templates=python
4+
5+
### Python ###
6+
# Byte-compiled / optimized / DLL files
7+
__pycache__/
8+
*.py[cod]
9+
*$py.class
10+
11+
# C extensions
12+
*.so
13+
14+
# Distribution / packaging
15+
.Python
16+
build/
17+
develop-eggs/
18+
dist/
19+
downloads/
20+
eggs/
21+
.eggs/
22+
lib/
23+
lib64/
24+
parts/
25+
sdist/
26+
var/
27+
wheels/
28+
share/python-wheels/
29+
*.egg-info/
30+
.installed.cfg
31+
*.egg
32+
MANIFEST
33+
34+
# PyInstaller
35+
# Usually these files are written by a python script from a template
36+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
37+
*.manifest
38+
*.spec
39+
40+
# Installer logs
41+
pip-log.txt
42+
pip-delete-this-directory.txt
43+
44+
# Unit test / coverage reports
45+
htmlcov/
46+
.tox/
47+
.nox/
48+
.coverage
49+
.coverage.*
50+
.cache
51+
nosetests.xml
52+
coverage.xml
53+
*.cover
54+
*.py,cover
55+
.hypothesis/
56+
.pytest_cache/
57+
cover/
58+
59+
# Translations
60+
*.mo
61+
*.pot
62+
63+
# Django stuff:
64+
*.log
65+
local_settings.py
66+
db.sqlite3
67+
db.sqlite3-journal
68+
69+
# Flask stuff:
70+
instance/
71+
.webassets-cache
72+
73+
# Scrapy stuff:
74+
.scrapy
75+
76+
# Sphinx documentation
77+
docs/_build/
78+
79+
# PyBuilder
80+
.pybuilder/
81+
target/
82+
83+
# Jupyter Notebook
84+
.ipynb_checkpoints
85+
86+
# IPython
87+
profile_default/
88+
ipython_config.py
89+
90+
# pyenv
91+
# For a library or package, you might want to ignore these files since the code is
92+
# intended to run in multiple environments; otherwise, check them in:
93+
# .python-version
94+
95+
# pipenv
96+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
97+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
98+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
99+
# install all needed dependencies.
100+
#Pipfile.lock
101+
102+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
103+
__pypackages__/
104+
105+
# Celery stuff
106+
celerybeat-schedule
107+
celerybeat.pid
108+
109+
# SageMath parsed files
110+
*.sage.py
111+
112+
# Environments
113+
.env
114+
.venv
115+
env/
116+
venv/
117+
ENV/
118+
env.bak/
119+
venv.bak/
120+
121+
# Spyder project settings
122+
.spyderproject
123+
.spyproject
124+
125+
# Rope project settings
126+
.ropeproject
127+
128+
# mkdocs documentation
129+
/site
130+
131+
# mypy
132+
.mypy_cache/
133+
.dmypy.json
134+
dmypy.json
135+
136+
# Pyre type checker
137+
.pyre/
138+
139+
# pytype static type analyzer
140+
.pytype/
141+
142+
# Cython debug symbols
143+
cython_debug/
144+
145+
# End of https://www.toptal.com/developers/gitignore/api/python

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2021 Gabriel Mittag, Quality and Usability Lab
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
# NISQA: Speech Quality and Naturalness Assessment
2+
3+
*+++ News: The NISQA model has recently been updated to NISQA v2.0. The new version offers multidimensional predictions with higher accuracy and allows for training and finetuning the model.*
4+
5+
**Speech Quality Prediction:**
6+
NISQA is a deep learning model/framework for speech quality prediction. The NISQA model weights can be used to predict the quality of a speech sample that has been sent through a communication system (e.g telephone or video call). Besides overall speech quality, NISQA also provides predictions for the quality dimensions *Noisiness*, *Coloration*, *Discontinuity*, and *Loudness* to give more insight into the cause of the quality degradation.
7+
8+
**TTS Naturalness Prediction:**
9+
The NISQA-TTS model weights can be used to estimate the *Naturalness* of synthetic speech generated by a Voice Conversion or Text-To-Speech system (Siri, Alexa, etc.).
10+
11+
**Training/Finetuning:**
12+
NISQA can be used to train new single-ended or double-ended speech quality prediction models with different deep learning architectures, such as CNN or DFF -> Self-Attention or LSTM -> Attention-Pooling or Max-Pooling. The provided model weights can also be applied to finetune the trained model towards new data or for transfer-learning to a different regression task (e.g. quality estimation of enhanced speech, speaker similarity estimation, or emotion recognition) .
13+
14+
**Speech Quality Datasets:**
15+
We provide a large corpus of more than 14,000 speech samples with subjective speech quality and speech quality dimension labels.
16+
17+
## Table of Contents
18+
- [Installation](#installation)
19+
- [Using NISQA](#using-nisqa)
20+
- [Prediction](#prediction)
21+
- [Training](#training)
22+
- [Finetuning / Transfer Learning](#finetuning--transfer-learning)
23+
- [Training a new model](#training-a-new-model)
24+
- [Evaluation](#evaluation)
25+
- [NISQA Corpus](#nisqa-corpus)
26+
- [Paper and License](#paper-and-license)
27+
28+
More information about the deep learning model structure, the used training datasets, and the training options, see the [NISQA paper](https://arxiv.org/abs/2104.09494) and the [Wiki](https://github.com/gabrielmittag/NISQA/wiki/).
29+
30+
31+
## Installation
32+
33+
To install requirements install [Anaconda](https://www.anaconda.com/products/individual) and then use:
34+
35+
```setup
36+
conda env create -f env.yml
37+
```
38+
39+
This will create a new environment with the name "nisqa". Activate this environment to go on:
40+
41+
```setup2
42+
conda activate nisqa
43+
```
44+
45+
46+
47+
## Using NISQA
48+
49+
We provide examples for using NISQA to predict the quality of speech samples, to train a new speech quality model, and to evaluate the performance of a trained speech quality model.
50+
51+
There are three different model weights available, the appropriate weights should be loaded depending on the domain:
52+
53+
| Model | Prediction Output | Domain | Filename |
54+
| --------------------- | --------------------------------------------------------------- | ------------------ | ------------------ |
55+
| NISQA (v2.0) | Overall Quality, Noisiness, Coloration, Discontinuity, Loudness | Transmitted Speech | nisqa.tar |
56+
| NISQA (v2.0) mos only | Overall Quality only (for finetuning/transfer learning) | Transmitted Speech | nisqa_mos_only.tar |
57+
| NISQA-TTS (v1.0) | Naturalness | Synthesized Speech | nisqa_tts.tar |
58+
59+
### Prediction
60+
61+
There are three modes available to predict the quality of speech via command line arguments:
62+
* Predict a single file
63+
* Predict all files in a folder
64+
* Predict all files in a CSV table
65+
66+
**Important:** Select "*nisqa.tar*" to predict the quality of a transmitted speech sample and "*nisqa_tts.tar*" to predict the Naturalness of a synthesized speech sample.
67+
68+
To predict the quality of a single .wav file use:
69+
70+
```
71+
python run_predict.py --mode predict_file --pretrained_model weights/nisqa.tar --deg /path/to/wav/file.wav --output_dir /path/to/dir/with/results
72+
```
73+
To predict the quality of all .wav files in a folder use:
74+
```
75+
python run_predict.py --mode predict_dir --pretrained_model weights/nisqa.tar --data_dir /path/to/folder/with/wavs --num_workers 0 --bs 10 --output_dir /path/to/dir/with/results
76+
```
77+
78+
To predict the quality of all .wav files listed in a csv table use:
79+
```
80+
python run_predict.py --mode predict_csv --pretrained_model weights/nisqa.tar --csv_file files.csv --csv_deg column_name_of_filepaths --num_workers 0 --bs 10 --output_dir /path/to/dir/with/results
81+
```
82+
83+
The results will be printed to the console and saved to a csv file in a given folder (optional with --output_dir). To speed up the prediction, the number of workers and batch size of the Pytorch Dataloader can be increased (optional with --num_workers and --bs). In case of stereo files --ms_channel can be used to select the audio channel.
84+
85+
### Training
86+
87+
#### Finetuning / Transfer Learning
88+
89+
To use the model weights to finetune the model on a new dataset, only a CSV file with the filenames and labels is needed. The training configuration is controlled from a YAML file and can be started as follows:
90+
91+
```
92+
python run_train.py --yaml config/finetune_nisqa.yaml
93+
```
94+
95+
- If the [NISQA Corpus](https://github.com/gabrielmittag/NISQA/wiki/NISQA-Corpus) is used, only two arguments need to updated in the YAML file and you are ready to go: The `data_dir` to the extracted NISQA_Corpus folder and the `output_dir`, where the results should be stored.
96+
97+
- If you use your own dataset or want to load the NISQA-TTS model, some other updates are needed.
98+
99+
Your CSV file needs to contain at least three columns with the following names
100+
101+
- `db` with the individual dataset names for each file
102+
- `filepath_deg` filepath to the degraded WAV file, either absolute paths or relative to the `data_dir` (CSV column name can be changed in YAML)
103+
- `mos` with the target labels (CSV column name can be changed in YAML)
104+
105+
The `finetune_nisqa.yaml` needs to be updated as follows:
106+
107+
- `data_dir` path to the main folder, which contains the CSV file and the datasets
108+
- `output_dir` path to output folder with saved model weights and results
109+
- `pretrained_model` filename of the pretrained model, either `nisqa_mos_only.tar` for natural speech or `nisqa_tts.tar` for synthesized speech
110+
- `csv_file` name of the CSV with filepaths and target labels
111+
- `csv_deg` CSV column name that contains filepaths (e.g. `filepath_deg`)
112+
- `csv_mos_train` and `csv_mos_val` CSV column names of the target value (e.g. `mos`)
113+
- `csv_db_train` and `csv_db_val` names of the datasets you want to use for training and validation. Datasets names must be in the `db` column.
114+
115+
See the comments in the YAML configuration file and the [Wiki](https://github.com/gabrielmittag/NISQA/wiki/) (not yet added) for more advanced training options. A good starting point would be to use the NISQA Corpus to get the training started with the standard configuration.
116+
117+
#### Training a new model
118+
119+
NISQA can also be used as a framework to train new speech quality models with different deep learning architectures. The general model structure is as follows:
120+
121+
1. *Framewise model:* CNN or Feedforward network
122+
2. *Time-Dependency* model: Self-Attention or LSTM
123+
3. *Pooling:* Average, Max, Attention or Last-Step-Pooling
124+
125+
The framewise and time-dependency models can be skipped, for example to train an LSTM model without CNN that uses the last-time step for prediction. Also a second time-dependency stage can be added, for example for LSTM-Self-Attention structure. The model structure can be easily controlled via the YAML configuration file. The training with the standard NISQA model configuration can be started with the [NISQA Corpus](https://github.com/gabrielmittag/NISQA/wiki/NISQA-Corpus) as follows:
126+
127+
```
128+
python run_train.py --yaml config/train_nisqa_cnn_sa_ap.yaml
129+
```
130+
131+
If the [NISQA Corpus](https://github.com/gabrielmittag/NISQA/wiki/NISQA-Corpus) is used, only the `data_dir` needs to be updated to the unzipped NISQA_Corpus folder and the `output_dir` in the YAML file. Otherwise, see the previous [finetuning section](#finetuning-transfer-learning) for updating the YAML file if a custom dataset is applied.
132+
133+
It is also possible to train any other combination of neural networks, for example, to train a model with LSTM instead of Self-Attention, the `train_nisqa_cnn_lstm_avg.yaml` example configuration file is provided.
134+
135+
To train a **double-ended** model for full-reference speech quality prediction, the `train_nisqa_double_ended.yaml` configuration file can be used as an example. See the comments in the YAML files and the [Wiki](https://github.com/gabrielmittag/NISQA/wiki/) (not yet added) for more details on different possible model structures and advanced training options.
136+
137+
### Evaluation
138+
139+
Trained models can be evaluated on a given dataset as follows (can also be used as a conformance test of the model installation):
140+
141+
```
142+
python run_evaluate.py
143+
```
144+
145+
Before running, the options and paths inside the Python script `run_evaluate.py` should be updated. If the [NISQA Corpus](https://github.com/gabrielmittag/NISQA/wiki/NISQA-Corpus) is used, only the `data_dir` and `output_dir` paths need to be adjusted. Besides Pearson's Correlation and RMSE, also an RMSE after first-order polynomial mapping is calculated. If a CSV file with per-condition labels is provided, the script will also output per-condition results and RMSE*. Optionally, correlation diagrams can be plotted. The script should return the same results as in the NISQA paper when it is run on the NISQA Corpus.
146+
147+
## NISQA Corpus
148+
149+
The NISQA Corpus includes more than 14,000 speech samples with simulated (e.g. codecs, packet-loss, background noise) and live (e.g. mobile phone, Zoom, Skype, WhatsApp) conditions.
150+
151+
For the download link and more details on the datasets and used source speech samples see the [NISQA Corpus Wiki](https://github.com/gabrielmittag/NISQA/wiki/NISQA-Corpus).
152+
153+
## Paper and License
154+
155+
- If you use the **NISQA model** or the **NISQA Corpus** for your research, please cite following paper:
156+
[G. Mittag, B. Naderi, A. Chehadi, and S. Möller “NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,” in Proc. Interspeech 2021, 2021.](https://www.isca-speech.org/archive/pdfs/interspeech_2021/mittag21_interspeech.pdf)
157+
- Please cite following paper if you use the **NISQA-TTS** model for Naturalness prediction of synthesized speech:
158+
[G. Mittag and S. Moller, “Deep Learning Based Assessment of Synthetic Speech Naturalness,” in Proc. Interspeech 2020, 2020.](https://www.isca-speech.org/archive/Interspeech_2020/abstracts/2382.html)
159+
- Please cite following paper if you use the **double-ended NISQA model**:
160+
[G. Mittag and S. Möller. Full-reference speech quality estimation with attentional Siamese neural networks. In Proc. ICASSP 2020, 2020.](https://ieeexplore.ieee.org/document/9053951)
161+
- The older NISQA (v0.42) model version is described in following paper:
162+
[G. Mittag and S. Möller, “Non-intrusive speech quality assessment for super-wideband speech communication networks,” in Proc. ICASSP 2019, 2019](https://ieeexplore.ieee.org/document/8683770)
163+
164+
The NISQA code is licensed under [MIT License](LICENSE).
165+
166+
The model weights (nisqa.tar, nisqa_mos_only.tar, nisqa_tts.tar) are provided under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License](weights/LICENSE_model_weights)
167+
168+
The NISQA Corpus is provided under the original terms of the used source speech and noise samples. More information can be found in the [NISQA Corpus Wiki](https://github.com/gabrielmittag/NISQA/wiki/NISQA-Corpus).
169+
170+
Copyright © 2021 Gabriel Mittag
171+
www.qu.tu-berlin.de
172+

config/finetune_nisqa.yaml

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# Config example for transfer-learning or finetuning of NISQA or NISQA-TTS:
2+
3+
# Runname and paths
4+
name: training_run_name # name of current training run
5+
data_dir: C:/Users/Name/Downloads/NISQA_Corpus # main input dir with dataset samples and csv files
6+
output_dir: C:/Users/Name/Downloads/trained_models # output dir, a new subfolder for current run will be created with yaml, results csv, and stored model weights
7+
pretrained_model: weights/nisqa_mos_only.tar # absolute path to pretrained model | path to pretrained model relative to current folder
8+
9+
# Dataset options
10+
csv_file: NISQA_corpus_file.csv # csv-file with MOS labels and filepaths of all datasets, must be placed in 'data_dir', must contain columns 'mos', 'noi', 'dis', 'col', 'loud' with overall and dimension quality ratings
11+
csv_con: null # csv-file with per-condition MOS used for evaluation (optional)
12+
csv_deg: filepath_deg # csv column name of filepath to degraded speech sample, path must be relative to 'data_dir'
13+
csv_mos_train: mos # csv column name of target training value (usually MOS)
14+
csv_mos_val: mos # csv column name of target validation value (usually MOS)
15+
csv_db_train: # dataset names of training sets, the dataset names must be in 'db' column of csv file
16+
- NISQA_TRAIN_SIM
17+
- NISQA_TRAIN_LIVE
18+
csv_db_val: # dataset names of validation sets, the dataset names must be in 'db' column of csv file
19+
- NISQA_VAL_SIM
20+
- NISQA_VAL_LIVE
21+
22+
# Training options
23+
tr_epochs: 500 # number of max training epochs
24+
tr_early_stop: 20 # stop training if neither validation RMSE nor correlation 'r_p' does improve for 'tr_early_stop' epochs
25+
tr_bs: 40 # training dataset mini-batch size (should be increased to 100-200 if enough GPU memory available)
26+
tr_bs_val: 40 # validation dataset mini-batch size (should be increased to 100-200 if enough GPU memory available)
27+
tr_lr: 0.001 # learning rate of ADAM optimiser
28+
tr_lr_patience: 15 # learning rate patience, decrease learning rate if loss does not improve for 'tr_lr_patience' epochs
29+
tr_num_workers: 4 # number of workers to be used by PyTorch Dataloader (may cause problems on Windows machines -> set to 0)
30+
tr_parallel: True # use PyTorch DataParallel for training on multiple GPUs
31+
tr_ds_to_memory: False # load dataset in CPU RAM before starting training (increases speed on some systems, 'tr_num_workers' should be set to 0 or 1)
32+
tr_ds_to_memory_workers: 0 # number of workers used for loading data into CPU RAM (experimental)
33+
tr_device: null # train on 'cpu' or 'cuda', if null 'cuda' is used if available.
34+
tr_checkpoint: every_epoch # 'every_epoch' stores model weights at each training epoch | 'best_only' stores only the weights with best validation correlation | 'null' only stores results but no model weights
35+
tr_verbose: 2 # '0' only basic results after each epoch | '1' more detailed results and bias loss information | '2' adds progression bar
36+
ms_max_segments: 1300 # if samples of different duration are used they will be padded. one segment corresponds to 40ms -> 0.04*1300=52sec max sample duration. increase if you apply the model to longer samples
37+
ms_channel: null # audio channel in case of stereo file (0->left, 1->right). if null, mono mix is used
38+
39+
# Bias loss options (optional)
40+
tr_bias_mapping: null # set to 'first_order' if bias loss should be applied, otherwise 'null'
41+
tr_bias_min_r: null # minimum correlation threshold to be reached before estimating bias (e.g. 0.7), set to 'null' if no bias loss should be applied
42+
tr_bias_anchor_db: null # name of anchor dataset (optional)
43+
44+
45+
46+
47+
48+
49+
50+
51+
52+
53+

0 commit comments

Comments
 (0)