$24
TA: Amanjit Kainth (amanjitsk@cs.toronto.[suffix]), and Shuja Khalid (skhalid@cs.toronto.[suffix])
• Introduction
This assignment introduces you to Gaussian mixture modelling, and two basic tasks in speech technology: speaker identi cation, in which we try to determine who is talking, and speech recognition, in which we try to determine what was said.
The assignment is divided into two sections. In the rst, you will experiment with speaker identi cation by training mixtures of Gaussians to the acoustic characteristics of individual speakers, and then identify speakers based on these models. In the second section, you will evaluate two speech recognition engines.
The data come from the CSC Deceptive Speech corpus, which was developed by Columbia Uni-versity, SRI International, and University of Colorado Boulder. It consists of 32 hours of audio interview from 32 native speakers of Standard American English (16 male,16 female) recruited from the Columbia University student population and the community. The purpose of the study was to distinguish deceptive speech from non-deceptive speech using machine learning techniques on extracted features from the corpus.
Data are in /u/cs401/A3/data/; each sub-folder represents speech from one speaker and contains raw audio, pre-computed MFCCs, and orthographic transcripts. Further le descriptions are in Appendix A.
• Speaker Identi cation
Speaker identi cation is the task of correctly identifying speaker sc from among S possible speakers si=1::S given an input speech sequence X, consisting of a succession of d-dimensional real vectors. In the interests of e ciency, d = 13 in this assignment. Each vector represents a small 25 ms unit of speech called a frame. Speakers are identi ed by training data that are ascribed to them. This is a discrete classi cation task (choosing among several speakers) that uses continuous-valued data (the vectors of real numbers) as input.
Gaussian Mixture Models
Gaussian mixture models are often used to generalize models from sparse data. They can tightly constrain large-dimensional data by using a small number of components but can, with many more components, model arbitrary density distributions. Sometimes, they are simply used because the domain being modelled appears to have multiple modes.
Given M components, GMMs are modelled by a collection of parameters, = f!m=1::M ; m=1::M ; m=1::M g, where !m is the probability that an observation is generated by the mth component. These are subject
P
to the constraint that m !m = 1 and 0 !m 1. Each component is a multivariate Gaussian distri-bution, which is characterized by that component’s mean, m, and covariance matrix, m. For reasons
• Copyright c 2020, University of Toronto. All rights reserved.
1
of computational e ciency, we will reintroduce some independence assumptions by assuming that every component’s covariance matrix is diagonal, i.e.:
0
0
m[2]
0
1
B
m[1]
0
0
C
m =
...
B
0
0
m
[d]
C
B
C
@
A
~
for some vector m. Therefore, only d parameters are necessary to characterize a component’s (co)variance.
2.1 Utility functions [10 marks]
First, we implement three utility functions in /u/cs401/A3/code/a3 gmm.py. First, implement log b m x, which implements the log observation probability of xt for the mth mixture component, i.e., the log of:
exp "
1
d (xt[n] m[n])2
#
2
m[n]
bm (~xt) =
X
(1)
n=1
(2 )d=2
q
Qn=1 m[n]
d
Next, implement log p m x, which is the log probability of m given xt using model , i.e., the log of:
!mbm (~xt)
p (mj~xt; ) = PM (2)
k=1 !kbk (~xt)
Finally, implement logLik, which is the log likelihood of a set of data X, i.e.:
T
~
Xt
= log p (~xt; s)
(3)
log P X; s
=1
where
M
X
p (x~t; ) =
!mbm(x~t)
(4)
m=1
and bm is de ned in Equation 1. For e ciency, we just pass and precomputed bm(x~t) to this function.
2.2 Training Gaussian mixture models [5 marks]
Now we train an M-component GMM for each of the speakers in the data set. Speci cally, for each speaker s, train the parameters s = f!m=1::M ; m=1::M ; m=1::M g according to the method described in Ap-pendix B. In all cases, assume that covariance matrices m are diagonal. Start with M = 8. You’ll be asked to experiment with that in Section 2.4. Complete the function train in /u/cs401/A3/code/a3 gmm.py.
2.3 Classi cation with Gaussian mixture models [5 marks]
Now we test each of the test sequences we’ve already set aside for you in the main function. I.e., we check if the actual speaker is also the most likely speaker, s^:
argmax log P
~
s^ =
X; s
(5)
s=1;:::;S
Complete the function test in /u/cs401/A3/code/a3 gmm.py. Run through a train-test cycle, and save the output that this function writes to stdout, using the k = 5 top alternatives, to the le gmmLiks.txt.
2
2.4 Experiments and discussion [10 marks]
Experiment with the settings of M and maxIter (or if you wish). For example, what happens to classi cation accuracy as the number of components decreases? What about when the number of possible speakers, S, decreases? You will be marked on the detail with which you empirically answer these questions and whether you can devise one or more additional valid experiments of this type.
Additionally, your report should include short hypothetical answers to the following questions:
How might you improve the classi cation accuracy of the Gaussian mixtures, without adding more training data?
When would your classi er decide that a given test utterance comes from none of the trained speaker models, and how would your classi er come to this decision?
Can you think of some alternative methods for doing speaker identi cation that don’t use Gaussian mixtures?
Put your experimental analysis and answers to these questions in the le gmmDiscussion.txt.
• Speech Recognition [10 marks]
Automatic speech recognition (ASR) is the task of correctly identifying a word sequence given an input speech sequence X. To simplify your lives, we have ran two popular ASR engines on our data: the open-source and highly customizable Kaldi (speci cally, a bi-directional LSTM model trained on the Fisher corpus), and the neither-open-source-nor-particularly-customizable Google Speech API.
We want to see which of Kaldi and Google are the most accurate on our data. For each speaker in our data, we have three transcript les: transcripts.txt (the gold-standard transcripts, from humans), transcripts.Kaldi.txt (the ASR output of Kaldi), and transcripts.Google.txt (the ASR output of Google); see Appendix A.
Complete the le at /u/cs401/A3/code/a3 levenshtein.py. Speci cally, in the Levenshtein func-tion, accept lists of words r (Reference) and h (hypothesis), and return a 4-item list containing the oating-point WER, the number of substitutions, the number of insertions, and the number of deletions where
W ER = (numSubstitutions + numInsertions + numDeletions)
numRef erenceW ords
Assume that the cost of a substitution is 0 if the words are identical and 1 otherwise. The costs of insertion and deletion are both 1.
In the main function, iterate through each of the speakers, and iterate through each line i of their transcripts. For each line, preprocess these transcripts by removing all punctuation (other than [ and ]) and setting the text to lowercase. Output the following to stdout:
[SPEAKER] [SYSTEM] [i] [WER] S:[numSubstitutions], I:[numInsertions], D:[numDeletions]
where [SYSTEM] is either ‘Kaldi’ or ‘Google’.
Save this output and put it into asrDiscussion.txt.
On the second-to-last line of asrDiscussion.txt, in free text, summarize your ndings by reporting the average and standard deviation of WER for each of Kaldi and Google, separately, over all of these lines. If you want to be fancy, you can compute a statistical test of signi cance to see if one is better than the other, but you don’t need to.
On the last line of asrDiscussion.txt, add a sentence or two describing anything you observe about the types of errors being made by each system, by manually examining the transcript les.
3
• Bonus [up to 10 marks]
We will give up to 10 bonus marks for innovative work going substantially beyond the minimal requirements. These marks can make up for marks lost in other sections of the assignment, but your overall mark for this assignment cannot exceed 100%. You may decide to pursue any number of tasks of your own design related to this assignment, although you should consult with the instructor or the TA before embarking on such exploration. Certainly, the rest of the assignment takes higher priority. Some ideas:
Voice banking
We are running a large study or ‘normative’ data in which people from the general population donate their speech (and language) data so that we can learn subtle di erences in pathological populations. If you go to https://www.cs.toronto.edu/talk2me/, you can obtain 5 bonus points if you complete 10 sessions, each at least 1 day apart. There is no limit on your age, or rst language. Currently, only the Chrome and most recent Firefox browser are supported. Create a new username for this assignment, and indicate your username in your submission so that we can validate your submitted data.
Dimensionality reduction
Principal components analysis (PCA) is a method that converts some multivariate representation of data into a lower-dimensional representation by transforming the original data according to mutually orthogonal principal components.
Implement an algorithm that discovers a d d0 matrix W that transforms a d-dimensional vector, ~x into a d0 -dimensional vector ~y through a linear transformation based on PCA, where d0 < d. Repeat speaker identi cation using data that has been transformed by PCA and report on what you observe, e.g., for di erent values of d0 . Submit all code and materials necessary to repeat your experiments.
ASR with sequence-to-sequence models
Try to do better than Kaldi or Google by implementing:
Chiu C-C, Sainath TN, Wu Y, et al. (2017) State-of-the-art Speech Recognition With Sequence-to-Sequence Models. http://arxiv.org/abs/1712.01769.
Consider using open-source end-to-end ASR using TensorFlow, e.g., deepSpeech.
Truth-and-lie detection
Each of the utterances has been labelled as either truthful or deceitful (see Appendix A). Train and test models to tell these utterances apart using the provided data. E.g.,
Train a GMM for each of the Truth and Lie categories, using your code from Section 2. Try recurrent neural networks that read one MFCC vector at a time.
Extract engineered features, such as those extracted in Assignment 1, from the text transcripts and classify using discriminative models in scikit-learn. Are words more discriminative than the audio?
Consider how errors in ASR transcripts a ect those extracted features and therefore overall system accuracy in a manner not dissimilar to Zhou L, Fraser KC, Rudzicz F. (2016) Speech recognition in Alzheimer’s disease and in its assessment. In Proceedings of the Annual Conference of the Interna-tional Speech Communication Association, INTERSPEECH.
4
• General speci cation
We may test your code on di erent training and testing data in addition to those speci ed above. Where possible, do not hardwire directory names into your code. As part of grading your assignment, the grader may run your programs using test harness Python scripts. It is therefore important that your code precisely meets the speci cations and formatting requirements, including arguments and le names.
If your code uses a le or helper script, it must read it either from the directory in which that code is being executed (i.e., probably ‘pwd’), or it must read it from a subdirectory of /u/cs401 whose absolute path is completely speci ed in the code. Do not hardwire the absolute address of les in your home directory; the grader does not have access to that directory.
All your programs must contain adequate internal documentation to be clear to the graders. External documentation is not required.
This assignment is in Python 3.
5.1 Submission requirements
This assignment is submitted electronically. Submit your assignment on MarkUs. You should submit:
1. All your code for a3 gmm.py and a3 levenshtein.py (including helper les, if any).
2. The output le gmmLiks.txt
3. Your discussion les gmmDiscussion.txt and asrDiscussion.txt
4. The ID le available from the course web-site.
Do not tar or compress your les, and do not place your les in subdirectories.
• Using your own computer
If you want to do some or all of this assignment on your laptop or other computer, you will have to do the extra work of downloading and installing the requisite software and data. You take on the risk that your computer might not be adequate for the task. You are strongly advised to upload regular backups of your work to teach.cs, so that if your machine fails or proves to be inadequate, you can immediately continue working on the assignment at teach.cs. When you have completed the assignment, you should try your programs out on teach.cs to make sure that they run correctly there. A submission that does not work on teach.cs will get zero marks.
5
• Appendix: Details on CSC data set
Each utterance is represented by the following le types:
*.wav The original speech waveform sampled at 16kHz.
*.mfcc.py The Mel-frequency cepstral coe cients obtained from an analysis of the waveform, in numPy format. Each row represents a 25ms frame of speech and consists of 13 oating point values.
*.txt Label and orthographic transcription of each utterance, for each of Kaldi and Google ASR, and human gold-standard.
Participants were told that they were participating in a communication experiment which sought to identify people who t the pro le of top entrepreneurs in America. To this end, participants performed tasks and answered questions in six areas; they were later told that they had received low scores in some of those areas and did not t the pro le. The subjects then participated in an interview where they were told to convince the interviewer that they had actually achieved high scores in all areas and that they did indeed t the pro le. The interviewer’s task was to determine how he thought the subjects had actually performed, and he was allowed to ask them any questions other than those that were part of the subjects’ tasks. For each question from the interviewer, subjects were asked to indicate whether the reply was true or contained any false information by pressing one of two pedals hidden from the interviewer under a table.
Interviews were conducted in a double-walled sound booth and recorded to digital audio tape on two channels using Crown CM311A Di eroid headworn close-talking microphones, then downsampled to 16 kHz. Interviews were orthographically transcribed by hand using the NIST EARS transcription guidelines. Labels for local lies were obtained automatically from the pedal-press data and hand-corrected for alignment, and labels for global lies were annotated during transcription based on the known scores of the subjects versus their reported scores.
MFCCs were obtained using the python speech features module using default parameters, i.e., 25 ms windows, 13 cepstral coe cients, and 512 fast Fourier transform coe cients
Each transcript le has the same format, where the ith line is:
[i] [LABEL] [TRANSCRIPT]
where i corresponds to i.wav and i.mfcc.npy, [LABEL] is the Global Lie label, and [TRANSCRIPT] is the actual transcript orthography. Global Lie valence and the version of the pre-interview task for the utterance appears before the colon (e.g., T/H) and the section name appears after the colon (e.g., INTERACTIVE).
Global Lie valence is indicated as: T == Truth; LU == Lie Up (subject claims better performance than was actually achieved); and LD == Lie Down (subject claims worse performance). Task version is indicated as: H == Hard; and E == Easy. So, for example, T/H:INTERACTIVE indicates that the subject is telling the truth based on having performed the hard version of the Interactive task.
6
• Appendix: Training Gaussian mixture models
Input: MFCC data X, number of components M, threshold , and maxIter begin
Initialize ;
i := 0 ;
; improvement = 1;
prev L :=
while i =< maxIter and improvement >= do
ComputeIntermediateResults ;
L := ComputeLikelihood (X; ) ;
:= UpdateParameters ( ; X; L) ;
improvement := L prev L ;
prev L := L ;
i := i + 1 ;
end
end
Algorithm 1: GMM training algorithm.
For ComputeIntermediateResults, it is strongly recommended that you create two M T numPy arrays { one to store each value from Equation 1 and the other to store each value from Equation 2. In fact, we’ve set up the function logLik to encourage you to do this, to avoid redundant computations. You will use these values in both ComputeLikelihood and updateParameters, where the latter is accomplished thus:
T
(m ~x
; )
!^m =
Pt=1 p
T j t
P
T
t=1 p (m ~xt; )
~^m =
t=1 p (mj~xt
; ) ~xt
(6)
^m =
T
Pt=1 p (mj~xt ; ) ~xt2
~^2
P
T
j
t=1 p (m ~xt; )
P
T
j
m
In the third equation, the square of a vector on the right-hand side is de ned as the component-wise square of each dimension in the vector. Note that you don’t need to break up Algorithm 1 into separate functions as implied { it is only written that way above to emphasize the sequence of steps
7