Comparison of speaker dependent and speaker independent emotion recognition

Jan Rybka; Artur Janicki

International Journal of Applied Mathematics and Computer Science (2013)

  • Volume: 23, Issue: 4, page 797-808
  • ISSN: 1641-876X

Abstract

top
This paper describes a study of emotion recognition based on speech analysis. The introduction to the theory contains a review of emotion inventories used in various studies of emotion recognition as well as the speech corpora applied, methods of speech parametrization, and the most commonly employed classification algorithms. In the current study the EMO-DB speech corpus and three selected classifiers, the k-Nearest Neighbor (k-NN), the Artificial Neural Network (ANN) and Support Vector Machines (SVMs), were used in experiments. SVMs turned out to provide the best classification accuracy of 75.44% in the speaker dependent mode, that is, when speech samples from the same speaker were included in the training corpus. Various speaker dependent and speaker independent configurations were analyzed and compared. Emotion recognition in speaker dependent conditions usually yielded higher accuracy results than a similar but speaker independent configuration. The improvement was especially well observed if the base recognition ratio of a given speaker was low. Happiness and anger, as well as boredom and neutrality, proved to be the pairs of emotions most often confused.

How to cite

top

Jan Rybka, and Artur Janicki. "Comparison of speaker dependent and speaker independent emotion recognition." International Journal of Applied Mathematics and Computer Science 23.4 (2013): 797-808. <http://eudml.org/doc/262324>.

@article{JanRybka2013,
abstract = {This paper describes a study of emotion recognition based on speech analysis. The introduction to the theory contains a review of emotion inventories used in various studies of emotion recognition as well as the speech corpora applied, methods of speech parametrization, and the most commonly employed classification algorithms. In the current study the EMO-DB speech corpus and three selected classifiers, the k-Nearest Neighbor (k-NN), the Artificial Neural Network (ANN) and Support Vector Machines (SVMs), were used in experiments. SVMs turned out to provide the best classification accuracy of 75.44% in the speaker dependent mode, that is, when speech samples from the same speaker were included in the training corpus. Various speaker dependent and speaker independent configurations were analyzed and compared. Emotion recognition in speaker dependent conditions usually yielded higher accuracy results than a similar but speaker independent configuration. The improvement was especially well observed if the base recognition ratio of a given speaker was low. Happiness and anger, as well as boredom and neutrality, proved to be the pairs of emotions most often confused.},
author = {Jan Rybka, Artur Janicki},
journal = {International Journal of Applied Mathematics and Computer Science},
keywords = {speech processing; emotion recognition; EMO-DB; support vector machines; artificial neural networks},
language = {eng},
number = {4},
pages = {797-808},
title = {Comparison of speaker dependent and speaker independent emotion recognition},
url = {http://eudml.org/doc/262324},
volume = {23},
year = {2013},
}

TY - JOUR
AU - Jan Rybka
AU - Artur Janicki
TI - Comparison of speaker dependent and speaker independent emotion recognition
JO - International Journal of Applied Mathematics and Computer Science
PY - 2013
VL - 23
IS - 4
SP - 797
EP - 808
AB - This paper describes a study of emotion recognition based on speech analysis. The introduction to the theory contains a review of emotion inventories used in various studies of emotion recognition as well as the speech corpora applied, methods of speech parametrization, and the most commonly employed classification algorithms. In the current study the EMO-DB speech corpus and three selected classifiers, the k-Nearest Neighbor (k-NN), the Artificial Neural Network (ANN) and Support Vector Machines (SVMs), were used in experiments. SVMs turned out to provide the best classification accuracy of 75.44% in the speaker dependent mode, that is, when speech samples from the same speaker were included in the training corpus. Various speaker dependent and speaker independent configurations were analyzed and compared. Emotion recognition in speaker dependent conditions usually yielded higher accuracy results than a similar but speaker independent configuration. The improvement was especially well observed if the base recognition ratio of a given speaker was low. Happiness and anger, as well as boredom and neutrality, proved to be the pairs of emotions most often confused.
LA - eng
KW - speech processing; emotion recognition; EMO-DB; support vector machines; artificial neural networks
UR - http://eudml.org/doc/262324
ER -

References

top
  1. Ayadi, M.E., Kamel, M.S. and Karray, F. (2007). Speech emotion recognition using Gaussian mixture vector autoregressive models, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007), Honolulu, HI, USA, Vol. 4, pp. IV-957-IV-960. 
  2. Ayadi, M.E., Kamel, M.S. and Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases, Pattern Recognition 44(3): 572-587. Zbl1207.68275
  3. Batliner, A., Steidl, S., Hacker, C., Noth, E. and Niemann, H. (2005). Tales of tuning-prototyping for automatic classification of emotional user states, Interspeech 2005, Lisbon, Portugal, pp. 489-492. 
  4. Brooks, M. (2012). Voicebox: Speech processing toolbox for Matlab, http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html. 
  5. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. and Weiss, B. (2005). A database of German emotional speech, Interspeech 2005, Lisbon, Portugal, pp. 1517-1520. 
  6. Camacho, A. and Harris, J.G. (2008). A sawtooth waveform inspired pitch estimator for speech and music, Journal of the Acoustical Society of America 124: 1638-1652. 
  7. Cichosz, J. and Slot, K. (2007). Emotion recognition in speech signal using emotion-extracting binary decision trees, ACII 2007, Lisbon, Portugal. 
  8. Clavel, C., Devillers, L., Richard, G., Vasilexcu, I. and Ehrette, T. (2007). Detection and analysis of abnormal situations through fear-type acoustic manifestations, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007), Honolulu, HI, USA, Vol. 4, pp. IV-21-IV-24. 
  9. Devillers, L. and Vidrascu, L. (2006). Real-life emotions detection with lexical and paralinguistic cues on human-human call center dialogs, Interspeech 2006, Pittsburgh, PA, USA, pp. 801-804. 
  10. Ekman, P. (1972). Universals and cultural differences in facial expressions of emotions, in J. Cole (Ed.), Nebraska Symposium on Motivation, Vol. 19, University of Nebraska Press, Lincoln, NE, pp. 207-282. 
  11. Engberg, I.S., Hansen, A.V., Andersen, O. and Dalsgaard, P. (1997). Design, recording and verification of a Danish emotional speech database, Eurospeech 1997, Rhodes, Greece. 
  12. Erden, M. and Arslan, L.M. (2011). Automatic detection of anger in human-human call center dialogs, Interspeech 2011, Florence, Italy, pp. 81-84. 
  13. Gajsek, R., Mihelic, F. and Dobrisek, S. (2013). Speaker state recognition using an HMM-based feature extraction method, Computer Speech and Language 27(1): 135-150. 
  14. Gorska, Z. and Janicki, A. (2012). Recognition of extraversion level based on handwriting and support vector machines, Perceptual and Motor Skills 114(3)(0031-5125): 857-869. 
  15. Grimm, M., Kroschel, K. and Narayanan, S. (2007). Support vector regression for automatic recognition of spontaneous emotions in speech, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007), Honolulu, HI, USA, Vol. 4, pp. IV-1085-IV-1088, ID: 1. 
  16. Hassan, A. and Damper, R.I. (2010). Multi-class and hierarchical SVMs for emotion recognition, Interspeech 2010, Makuhari, Japan, pp. 2354-2357. 
  17. He, L., Lech, M., Memon, S. and Allen, N. (2008). Recognition of stress in speech using wavelet analysis and teager energy operator, Interspeech 2008, Brisbane, Australia, pp. 605-608. 
  18. Hirschberg, J., Benus, S., Brenier, J.M., Enos, F., Friedman, S., Gilman, S., Gir, C., Graciarena, M., Kathol, A. and Michaelis, L. (2005). Distinguishing deceptive from non-deceptive speech, Interspeech 2005, Lisbon, Portugal, pp. 1833-1836. 
  19. Iliou, T. and Anagnostopoulos, C.-N. (2010). Classification on speech emotion recognition-a comparative study, International Journal on Advances in Life Sciences 2(1-2): 18-28. 
  20. Janicki, A. (2012). On the Impact of Non-speech Sounds on Speaker Recognition, Text, Speech and Dialogue, Vol. 7499, Springer, Berlin/Heidelberg, pp. 566-572. 
  21. Janicki, A. and Turkot, M. (2008). Speaker emotion recognition with the use of support vector machines, Telecommunication Review and Telecommunication News (8-9): 994-1005, (in Polish). 
  22. Jeleń, Ł., Fevens, T. and Krzyżak, A. (2008). Classification of breast cancer malignancy using cytological images of fine needle aspiration biopsies, International Journal of Applied Mathematics and Computer Science 18(1): 75-83, DOI: 10.2478/v10006-008-0007-x. 
  23. Kaminska, D. and Pelikant, A. (2012). Recognition of human emotion from a speech signal based on Plutchik's model, International Journal of Electronics and Telecommunications 58(2): 165-170. 
  24. Kang, B.S., Han, C.H., Lee, S.T., Youn, D.H. and Lee, C. (2000). Speaker dependent emotion recognition using speech signals ICSLP 2000, Beijing, China. 
  25. Kowalczuk, Z. and Czubenko, M. (2011). Intelligent decision-making system for autonomous robots, International Journal of Applied Mathematics and Computer Science 21(4): 671-684, DOI: 10.2478/v10006-011-0053-7. Zbl1283.93203
  26. Liberman, M., Davis, K., Grossman, M., Martey, N. and Bell, J. (2002). Emotional Prosody Speech and Transcripts, Linguistic Data Consortium, Philadelphia, PA. 
  27. Liscombe, J., Hirschberg, J. and Venditti, J.J. (2005). Detecting certainess in spoken tutorial dialogues, Interspeech 2005, Lisbon, Portugal. 
  28. Liu, G., Lei, Y. and Hansen, J.H.L. (2010). A novel feature extraction strategy for multi-stream robust emotion identification, Interspeech 2010, Makuhari, Japan, pp. 482-485. 
  29. Lugger, M. and Yang, B. (2007). The relevance of voice quality features in speaker independent emotion recognition, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007), Honolulu, HI, USA, Vol. 4, pp. IV-17-IV-20. 
  30. Lugger, M., Yang, B. and Wokurek, W. (2006). Robust estimation of voice quality parameters under realworld disturbances, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2006), Toulouse, France, Vol. 1, p. I. 
  31. Mehrabian, A. and Wiener, M. (1967). Decoding of inconsistent communications, Journal of Personality and Social Psychology 6(1): 109-114. 
  32. Neiberg, D., Laukka, P. and Ananthakrishnan, G. (2010). Classification of affective speech using normalized time-frequency cepstra, 5th International Conference on Speech Prosody (Speech Prosody 2010), Chicago, IL, USA, pp. 1-4. 
  33. Patan, K. and Korbicz, J. (2012). Nonlinear model predictive control of a boiler unit: A fault tolerant control study, International Journal of Applied Mathematics and Computer Science 22(1): 225-237, DOI: 10.2478/v10006-012-0017-6. Zbl1273.93071
  34. Scherer, K.R. (2003). Vocal communication of emotion: A review of research paradigms, Speech Communication 40(1-2): 227-256. Zbl1006.68948
  35. Schuller, B., Koehler, N., Moeller, R. and Rigoll, G. (2006). Recognition of interest in human conversational speech, Interspeech 2006, Pittsburgh, PA, USA, pp. 793-796. 
  36. Schuller, B., Vlasenko, B., Eyben, F., Rigoll, G. and Wendemuth, A. (2009). Acoustic emotion recognition: A benchmark comparison of performances, IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU 2009), Merano, Italy, pp. 552-557. 
  37. Seppi, D., Batliner, A., Schuller, B., Steidl, S., Vogt, T., Wagner, J., Devillers, L., Vidrascu, L., Amir, N. and Aharonson, V. (2008). Patterns, prototypes, performance: Classifying emotional user states, Interspeech 2008, Brisbane, Australia, pp. 601-604. 
  38. Vapnik, V.N. (1982). Estimation of Dependences Based on Empirical Data, Springer-Verlag, New York, NY, (translation of Vosstanovlenie zavisimostei po empiricheskim dannym by Samuel Kotz). Zbl0499.62005
  39. Xiao, Z., Dellandrea, E., Dou, W. and Chen, L. (2006). Two-stage classification of emotional speech, International Conference on Digital Telecommunications (ICDT'06), Cap Esterel, Côte d'Azur, France, pp. 32-32. 
  40. Yacoub, S., Simske, S., Lin, X. and Burns, J. (2003). Recognition of emotions in interactive voice response systems, Eurospeech 2003, Geneva, Switzerland, pp. 1-4. 
  41. Yu, C., Aoki, P. M. and Woodruff, A. (2004). Detecting user engagement in everyday conversations, 8th International Conference on Spoken Language Processing (ICSLP 2004), Jeju, Korea, pp. 1-6. 

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

    
                

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.