EMOTION DETECTION FROM SPEECH

Deepak Kumar, Roll No.: 150108008, Branch: EEE

;

Harsh Sinha, Roll No.: 150108014, Branch: EEE

;

Kapil Kumar, Roll No.: 150108016, Branch: EEE

;

Vishal Kumar Sinha, Roll No.: 150108042, Branch: EEE

;

Abstract

The ability to understand the emotions in a speech is what seperates a human from a machine. Emotion is one of the biggest part of a speech which truly confers the meaning of speech. Thus, With the increasing mechanisation of the modern world, the human-machine interaction is one of the most looke after research in today's scientific community. Our projects aim at classifying a speech into based on the their emotions by extracting different features.

1. Introduction

Emotion classification is one of the most challenging tasks in a speech signal processing domain. The problem of speaker or speech recognition becomes relatively an easier one when compared with recognizing emotion from speech. Sound signal is one of the main medium of communication and it can be processed to recognize the speaker, speech or even emotion. The basic principle behind emotion recognition lies with analysing the acoustic difference that occurs when uttering the same thing under different emotional situations. The Growing use of Machines and the necessity for a Man-Machine interaction like human interaction is what that have motivated the researchers to work on this project. The Whole idea behind the project is to develop a system through which a machine can understand the emotion of a human's speech. Emotion detection is a tool which will define the course of a new age of interdependence of human and machine.

1.1 Introduction to Problem

The project aims at classification of basic human emotions like sad, happy, neutral and angry in a speech. The essence of project rely upon the selection of features of speech that was responsible for various human emotions.

1.2 Figure

1.3 Literature Review

We have taken help of various research papers for the selection of features and further proceedings. One of them is the "CS239 stanford emotion detection from speech" and other one is "Survey on speech emotion recognition: Features, classification schemes and databases by university of cairo". Various research papers we have taken help of are mentioned in the References

As gathered from reading various papers written on this topic, There are many features specifically needed for this project like LFPC, LPCC along with the common features like MFCC. There has been a lot of success in many areas of having a machine with emotion detection feature but there are still reasearch going on and researchers are bidding on achieving the man-machine interaction via speech very soon.

1.4 Proposed Approach

We will start with collecting the sample data from various sources. Then, we are planning on finding the particular features that were important and necessary for emotion detection (found after reading various research papers). We will now extract features from scratch. We will extract some short-term features like MFCC, ZCR and others. We will check the accuracy of our classifier for different combination of features and then selected the combination with best accuracy. After that, we will train the classifier (KNNs and SVM) with our processed data and find out the accuracy.

1.5 Report Organization

Title and Group information[0]
Abstract[0.1]
Introduciton[1]

Introduction to problem[1.1]
Figure[1.2]
Literature Review[1.3]
Proposed approach[1.4]
Report Organization[1.5]

Proposed Approach[2]

Data Collection[2.1]
Preprocessing and Feature selection[2.2]
Classification[2.3]

Experments and Results[3]

Dataset Description[3.1]
Discussion[3.2]

Conclusions[4]

Summary[4.1]
Future Extensions[4.2]

References[5]

2. Proposed Approach

The Approach is majorly divided into three steps :

Data Collection
Preprocessing and Feature selection
Classification

2.1 Data Collection

We have collected data from various online sources which was recorded by semi-professional actors [dataset]and we have also made some data ourselves too but major training data is the online one.

2.2 Preprocessing and Feature selection

We selected the dataset which was noise free and of high quality type. So, we did not need much of pre-processing for our audio clips. So, we went on extracting features from our .Wav files. We extracted a total of 42 features via coding in python. The features are : ZCR, Energy, Entropy of Energy, Spectral centroid, Spectral Spread, Spectral Entropy, Spectral Flux, Spectral Roll-off, MFCCs, Chroma Vectors Chromo-deviation. After that, we normalized these features and got a processed .CSV file. Then we tested the combination of these features and saw the result of it on the classifier accuracy and chose the best combination of the features. The codes written for these extraction are here The codes for classification by KNNs are here

Feature Trying and Outputs

Feature's Combination	Frame Size(in s)	Classifier	Accuracy
All extracted Features	0.5	SVM	0.46
Removing Spectral Centroid and Spread	0.5	SVM	0.48
Removing some of MFCCs	0.5	SVM	0.32
Removing Chroma Vectors*	0.5	SVM	0.76
Removing chroma vectors along with Chroma Deviation*	0.5	SVM	0.82
Removing chroma vectors along with Chroma Deviation**	0.5	SVM	1.00

*we have split the dataset into 80:20 train to test
New test data and same trained classifier
**we have shuffled the whole test and train data and then found this.

2.3 Classification

After reading out papers, We got to know that SVMs and KNNs are best algorithms for classifications.

SVM

In machine learning, support vector machines (SVMs, also support vector networks[1]) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier (although methods such as Platt scaling exist to use SVM in a probabilistic classification setting). An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall

KNN

In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression.[1] In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression: In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. In k-NN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbors.

We firstly trained our data on KNNs and got not very satisfactory accuracy. Then, we went on training our data on SVM and got satisfactory accuracy. Firstly we trained and tested our dataset with 80:20 ratio and then we went on with same trained classifier and tested it on new test data. After that, we shuffled the whole train and test data and got a final classification.

3. Experiments & Results

3.1 Dataset Description

We used two datasets.
Toronto emotional speech set (TESS) collection was used. Target words were spoken in the carrier phrase "Say the word _____" by an actress aged 26 and recordings were made of the set portraying each of three emotions(anger, happiness and sadness).
.Actress speaks English as her first language and has musical training.
Authors: Kate Dupuis, M. Kathleen Pichora-Fuller University of Toronto, Psychology Department, 2010. Dataset1
Another dataset we used was Fullon Emotional Speech Synthesis collection. Dataset2
We recorded some of our own samples.Here

3.2 Discussion and Results

Initially while starting the project, we went on reading many research papers and discussed about the various features which was being used and the research which is currently going on. We first trained the data using SVM and KNN.The data was in csv format with 34 features .Then we wrote a code for testing various combinations of features.We found that removing chroma deviation and some other chroma features gave a better result. The first code was giving accuracy based on a segment of a clip .Our Second Code was to find the classification of a whole clip and it then gave acuracy close to 90%.The concept was that in a clip there are multiple segment each classified as different label and we took maximum count of the label for each clip and hence we got better accuracy.

4. Conclusions

4.1 Summary

The aim of our project was to classify emotions from speech. We started out with data collection from various sources. We tried some online datasets which are mentioned and we have tried collecting some datasets ourselves too. After the Data collection, we went for pre-processing and feature selection. In pre-processing we used Mean Normalization and we also tried scaling. In feature selection, we wrote codes to extract 34 short-term features which included MFCC, ZCR and others and we tried out other different features by applying it to our classifier. Then, we went on training our classifier We trained SVM firstly and then KNN. We got better results with SVM, so we stick with it. Finally we tested data on our classifier.

4.2 Future Extensions

We have extracted very basic features in the this project due to time-constraint. So, we are thinking of extracting some more features like LFPC and Formants and then implement it.

This project was a single level classification where we were finding a single emotion in one clip of sound. We have thought of extending it further and build a multi-label classification system, where a clip will be classified into different emotions content which it will contain. We have worked a bit and build a multi-label KNN for this. This is the we have worked on.

5. References

Survey on speech emotion recognition: Features, classification schemes,and databases Moataz El Ayadi a, , Mohamed S. Kamel b , Fakhri Karray bLink
ShahHewllet Emotion Detection from SpeechLink
FEATURE EXTRACTION FOR SPEECH RECOGNITONLink
Emotional speech recognition: Resources, features, and methodsDimitrios Ververidis, Constantine KotropoulosLink
Detection and Analysis of Emotion From Speech Signals Assel Davletcharova, Sherin Sugathan, Bibia Abraham, Alex Pappachen JamesLink
F.Yu, E.Chang, Y.Xu, H.Shum, “Emotion detection from speech to enrich multimedia content”Link

Hi, Deepak again, yeah, long time no see