R. Madana Mohana, A. Rama Mohan Reddy



SLID: Hybrid Learning Model and Acoustic Approach to Spoken Language Identification using Machine Learning

pdf PDF


Spoken Language Identification (SLId) is the process of identifying the language of an utterance from an anonymous speaker, irrespective of gender, pronunciation and accent. In this paper we present acoustics based learning model for spoken language identification. An acoustic feature representing the short term power spectrum of sound called Mel Frequency Cepstral Coefficients (MFCC) is used as a part of the investigation in this paper. The proposed system uses a combination of Gaussian Mixture Model (GMM) and the Support Vector Machines (SVM) to handle the problem of multi class classification. The model aims at detecting English, Japanese, French, Hindi, and Telugu. A speech corpus was built using speech samples obtained from a plethora of online podcasts and audio books. This corpus comprised of utterances spanning over a uniform duration of 10 seconds. Preliminary results indicate an overall accuracy of 96%. A more comprehensive and rigorous test indicates an overall accuracy of 80%. The acoustic model combined with learning techniques hence proposed proves to be a viable approach for Language Identification.


MFCC, Language Identification, SVM, GMM, LongRun technique.


[1] K. M. Berkling, T. Arai and E. Barnard (1994). "Analysis of phoneme-based features for language identification", in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing 94, Adelaide, Australia, April 1994.

[2] J. Hieronymous and S. Kadambe (1996). "Spoken Language Identification Using Large Vocabulary Speech Recognition", in Proceedings of the 1996 International Conference on Spoken Language Processing (ICSLP 96), Philadelphia, USA, 1996.

[3] K. M. Berkling and E. Barnard (1994). "Language Identification of Six Languages Based on a Common Set of Broad Phonemes", in Proceedings of the 1994 International Conference on Spoken Language Processing (ICSLP94), Yokohama, Japan, September 1994.

[4] K. M. Berkling and E. Barnard (1995). "Theoretical Error Prediction for a Language Identification System using Optimal Phoneme Clustering", in Proceedings 4rd European Conference on Speech Communication and Technology (Eurospeech 95), Madrid, Spain, September 1995.

[5] Y. K. Muthusamy (1993). "A Segmental Approach to Automatic Language Identification", Ph.D thesis, Oregon Graduate Institute of Science & Technology, July 1993.

[6] M. A. Zissman (1996). "Comparison of Four Approaches to Automatic Language Identification of Telephone Speech", in IEEE Transaction Speech and Audio Processing, SAP-4(1), January 1996.

[7] Chi-Yueh Lin, Hsiao-Chuan Wang (2005). “Language identification using pitch contour information”, from IEEE ICASSP-2005, Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan.

[8] Fadi Biadsy, Julia Hirschberg (2009). “Using Prosody and Phonotactics in Arabic Dialect Identification”, In Proceedings of Interspeech 2009, Brighton, UK.

[9] Pedro A. Torres-Carrasquillo et al (2006). “Approaches to Language Identification using Gaussian Mixture Models and Shifted Delta Cepstral Features”, 2002 International Conference on Spoken Language Processing (ICSLP 2006), Denver, USA, 2006.

[10] E.Singer et al (2003). “Acoustic, Phonetic, and Discriminative Approaches to Automatic Language Identification”, In Proc. Eurospeech, 2003.

[11] Sirko Molau et al (2001). “Computing mel-frequency cepstral coefficients on the power spectrum”, Proceedings. (ICASSP '01). 2001 IEEE International Conference, Salt Lake City, UT, USA.

[12] Fukada et al (1992). “An adaptive algorithm for melcepstral analysis of speech”, IEEE conference on Acoustic, Speech and Signal Processing (ICASSP- 92), 1992, Information Systems Research Center, Canon, Japan.

[13] Hasan et al (2004). “Speaker identification using mel frequency cepstral coefficients, 3rd International Conference on Electrical & Computer Engineering ICECE 2004, 28-30 December 2004, Dhaka, Bangladesh

[14] Campbell et al (2006). “Support Vector Machines for Speaker and Language Recognition”, Computer Speech and Language, 2006, Elsevier, MIT Lincoln Laboratory.

[15] Javad Shiekzadagen and Mahamood Reza Roohani (2000). “Automatic spoken language identification based on ANN using fundamental frequency and relative changes in spectrum”, International Conference on Speech Science and Technology (SST- 2000), 2000, Research centre of intelligent signal processing, Iran.

[16] Margaret H. Dunham (2008). “Data Mining Introductors and advanced topics”, Pearson Education, 2008.

[17] Davis, S.; Mermelstein, P. (1980). “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 4 (1980).

[18] Wei Han, Cheong-Fat Chan, Chiu-Sing Choy and Kong-Pang Pun (2006). “ An Efficient MFCC Extraction Method in Speech Recognition”, Department of Electronic Engineering, The Chinese University of Hong Kong, Hong, IEEE – ISCAS, 2006.

[19] Lindasalwa Muda, Mumtaj Begam and I. Elamvazuthi (2010). “Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques”, Journal of Computing, Volume 2, Issue 3, March 2010, Malaysia.

[20] Shikha Gupta, Jafreezal Jaafar, Wan Fatimah wan Ahmad and Arpit Bansal (2013). “Feature Extraction using MFCC”, Signal & Image Processing : An International Journal (SIPIJ) Vol.4, No.4, August 2013.

[21] Mark Gales and Steve Young (2007). “The Application of Hidden Markov Models in Speech Recognition”, Foundations and Trends in Signal Processing, Vol. 1, No. 3 (2007), UK.

[22] Shi-Huang Chen and Yu-Ren Luo (2009). “Speaker Verification Using MFCC and Support Vector Machine”, Proceedings of the International MultiConference of Engineers and Computer Scientists 2009, Vol.1, IMECS 2009, March 18 - 20, 2009, Hong Kong.

[23] Katrin Kirchhoff, Gernot A. Fink, Gerhard Sagerer (2002). “Combining acoustic and articulatory feature information for robust speech recognition”, Speech Communication 37 (2002), Elsevier, USA.

[24] Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury (2012). “Deep Neural Networks for Acoustic Modeling in Speech Recognition”, IEEE Signal processing Magazine, November, 2012.

Cite this paper

R. Madana Mohana, A. Rama Mohan Reddy. (2017) SLID: Hybrid Learning Model and Acoustic Approach to Spoken Language Identification using Machine Learning. International Journal of Signal Processing, 2, 183-195


Copyright © 2017 Author(s) retain the copyright of this article.
This article is published under the terms of the Creative Commons Attribution License 4.0