You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+10Lines changed: 10 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,19 +25,26 @@ There are two main types of audio datasets: speech datasets and audio event/musi
25
25
*[DAPS Dataset](https://archive.org/details/daps_dataset) - DAPS consists of 20 speakers (10 female and 10 male) reading 5 excerpts each from public domain books (which provides about 14 minutes of data per speaker).
26
26
*[Deep Clustering Dataset](https://www.merl.com/demos/deep-clustering) - Training deep discriminative embeddings to solve the cocktail party problem.
27
27
*[DEMoS](https://zenodo.org/record/2544829) - 9365 emotional and 332 neutral samples produced by 68 native speakers (23 females, 45 males); 7/6 emotions: anger, sadness, happiness, fear, surprise, disgust, and the secondary emotion guilt.
28
+
*[DES](http://kom.aau.dk/~tb/speech/Emotions/) - 4 speakers (2 males and 2 females); 5 emotions: neutral, surprise, happiness, sadness and anger.
28
29
*[DIPCO](https://arxiv.org/abs/1909.13447) - Dinner Party Corpus - The participants were recorded by a single-channel close-talk microphone and by five far-field 7-microphone array devices positioned at different locations in the recording room. The dataset contains the audio recordings and human labeled transcripts of a total of 10 sessions with a duration between 15 and 45 minutes.
30
+
*[EEKK](https://metashare.ut.ee/repository/download/4d42d7a8463411e2a6e4005056b40024a19021a316b54b7fb707757d43d1a889/) - 26 text passage read by 10 speakers; 4 main emotions: joy, sadness, anger and neutral.
*[EmoFilm](https://zenodo.org/record/1326428) - 1115 audio instances sentences extracted from various films.
30
33
*[EmoSynth](https://zenodo.org/record/3727593) - 144 audio file labelled by 40 listeners; Emotion (no speech) defined in regard of valence and arousal.
31
34
*[Emotional Voices Database](https://github.com/numediart/EmoV-DB) - various emotions with 5 voice actors (amused, angry, disgusted, neutral, sleepy).
32
35
*[Emotional Voice dataset - Nature](https://www.nature.com/articles/s41562-019-0533-6) - 2,519 speech samples produced by 100 actors from 5 cultures. With large-scale statistical inference methods, we find that prosody can communicate at least 12 distinct kinds of emotion that are preserved across the 2 cultures.
33
36
*[EmotionTTS](https://github.com/emotiontts/emotiontts_open_db) - Recordings and their associated transcriptions by a diverse group of speakers - 4 emotions: general, joy, anger, and sadness.
34
37
*[Emov-DB](https://mega.nz/#F!KBp32apT!gLIgyWf9iQ-yqnWFUFuUHg!mYwUnI4K) - Recordings for 4 speakers- 2 males and 2 females; The emotional styles are neutral, sleepiness, anger, disgust and amused.
35
38
*[EMOVO](http://voice.fub.it/activities/corpora/emovo/index.html) - 6 actors who played 14 sentences; 6 emotions: disgust, fear, anger, joy, surprise, sadness.
39
+
*[eNTERFACE05](http://www.enterface.net/enterface05/docs/results/databases/project2_database.zip) - Videos by 42 subjects, coming from 14 different nationalities; 6 emotions: anger, fear, surprise, happiness, sadness and disgust.
36
40
*[Free Spoken Digit Dataset](https://github.com/Jakobovski/free-spoken-digit-dataset) -4 speakers, 2,000 recordings (50 of each digit per speaker), English pronunciations.
37
41
*[Flickr Audio Caption](https://groups.csail.mit.edu/sls/downloads/flickraudio/) - 40,000 spoken captions of 8,000 natural images, 4.2 GB in size.
38
42
*[GEMEP corpus](https://www.unige.ch/cisa/gemep) - 10 actors portraying 10 states; 12 emotions: amusement, anxiety, cold anger (irritation), despair, hot anger (rage), fear (panic), interest, joy (elation), pleasure(sensory), pride, relief, and sadness. Plus, 5 additional emotions: admiration, contempt, disgust, surprise, and tenderness.
43
+
*[IEMOCAP](https://sail.usc.edu/iemocap/iemocap_release.htm) - 12 hours of audiovisual data by 10 actors; 5 emotions: happiness, anger, sadness, frustration and neutral.
39
44
*[ISOLET Data Set](https://data.world/uci/isolet) - This 38.7 GB dataset helps predict which letter-name was spoken — a simple classification task.
40
45
*[JL corpus](https://www.kaggle.com/tli725/jl-corpus) - 2400 recording of 240 sentences by 4 actors (2 males and 2 females); 5 primary emotions: angry, sad, neutral, happy, excited. 5 secondary emotions: anxious, apologetic, pensive, worried, enthusiastic.
46
+
*[Keio-ESD](http://research.nii.ac.jp/src/en/Keio-ESD.html) - A set of human speech with vocal emotion spoken by a Japanese male speaker; 47 emotions including angry, joyful, disgusting, downgrading, funny, worried, gentle, relief, indignation, shameful, etc.
47
+
*[LEGO Corpus](https://www.ultes.eu/ressources/lego-spoken-dialogue-corpus/) - 347 dialogs with 9,083 system-user exchanges; emotions classified as garbage, non-angry, slightly angry and very angry.
41
48
*[Libriadapt](https://github.com/akhilmathurs/libriadapt) - It is primarily designed to faciliate domain adaptation research for ASR models, and contains the following three types of domain shifts in the data.
42
49
*[Libri-CSS](https://github.com/chenzhuo1011/libri_css) - derived from LibriSpeech by concatenating the corpus utterances to simulate a conversation and capturing the audio replays with far-field microphones.
43
50
*[LibriMix](https://github.com/JorisCos/LibriMix) - LibriMix is an open source dataset for source separation in noisy environments. It is derived from LibriSpeech signals (clean subset) and WHAM noise. It offers a free alternative to the WHAM dataset and complements it. It will also enable cross-dataset experiments.
@@ -58,6 +65,8 @@ There are two main types of audio datasets: speech datasets and audio event/musi
58
65
*[The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)](https://zenodo.org/record/1188976#.XrC7a5NKjOR) - The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7356 files (total size: 24.8 GB). The database contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions.
59
66
*[sample_voice_data](https://github.com/jim-schwoebel/sample_voice_data) - 52 audio files per class (males and females) for testing purposes.
60
67
*[SAVEE Dataset](http://kahlan.eps.surrey.ac.uk/savee/) - 4 male actors in 7 different emotions, 480 British English utterances in total.
68
+
*[SEMAINE](https://semaine-db.eu/) - 95 dyadic conversations from 21 subjects. Each subject converses with another playing one of four characters with emotions; 5 FeelTrace annotations: activation, valence, dominance, power, intensity.
69
+
*[SER Datasets](https://github.com/SuperKogito/SER-datasets) - A collection of datasets for the purpose of emotion recognition/detection in speech.
61
70
*[SEWA](https://db.sewaproject.eu/) - more than 2000 minutes of audio-visual data of 398 people (201 male and 197 female) coming from 6 cultures; emotions are characterized using valence and arousal.
62
71
*[ShEMO](https://github.com/mansourehk/ShEMO) - 3000 semi-natural utterances, equivalent to 3 hours and 25 minutes of speech data from online radio plays by 87 native-Persian speakers; 6 emotions: anger, fear, happiness, sadness, neutral and surprise.
63
72
*[SparseLibriMix](https://github.com/popcornell/SparseLibriMix) - An open source dataset for source separation in noisy environments and with variable overlap-ratio. Due to insufficient noise material this is a test-set-only version.
@@ -67,6 +76,7 @@ There are two main types of audio datasets: speech datasets and audio event/musi
67
76
*[Spoken Wikipeida Corpora](https://nats.gitlab.io/swc/) - 38 GB in size available in both audio and without audio format.
68
77
*[Tatoeba](https://tatoeba.org/eng/downloads) - Tatoeba is a large database of sentences, translations, and spoken audio for use in language learning. This download contains spoken English recorded by their community.
69
78
*[Ted-LIUM](https://www.openslr.org/51/) - The TED-LIUM corpus was made from audio talks and their transcriptions available on the TED website (noncommercial).
79
+
*[TESS](https://tspace.library.utoronto.ca/handle/1807/24487) - 2800 recording by 2 actresses; 7 emotions: anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral.
70
80
*[Thorsten dataset](https://github.com/thorstenMueller/deep-learning-german-tts/) - German language dataset, 22,668 recorded phrases, 23 hours of audio, phrase length 52 characters on average.
71
81
*[TIMIT dataset](https://catalog.ldc.upenn.edu/LDC93S1) - TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. It includes time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16 kHz speech waveform file for each utterance (have to pay).
72
82
*[URDU-Dataset](https://github.com/siddiquelatif/urdu-dataset) - 400 utterances by 38 speakers (27 male and 11 female); 4 emotions: angry, happy, neutral, and sad.
0 commit comments