$24
Written (30 pts)
Softmax (5pts) Prove that softmax is invariant to constant o set in the input, that is, for any input vector x and any constant c,
softmax(x) = softmax(x + c)
where x + c means adding the constant c to every dimension of x. Remember that
exi
softmax(x)i = Pj exj
Note: In practice, we make use of this property and choose c = maxi xi when computing softmax probabilities for numerical stability (i.e., subtracting its maximum element from all elements of x).
Sigmoid (5pts)Derive the gradients of the sigmoid function and show that it can be rewritten as a function of the function value (i.e., in some expression where only (x), but not x, is present). Assume that the input x is a scalar for this question. Recall, the sigmoid function is:
(x) =
1
1 + e x
Word2vec
(5pts) Assume you are given a predicted word vector vc corresponding to the center word c for skipgram, and the word prediction is made with the softmax function
exp(uvc)
y^o = p(ojc) = PW o
exp(uv )
w=1 w c
where o is the expected word, w denotes the w-th word and uw (w = 1, ..., W) are the \output"
word vectors for all words in the vocabulary. The cross entropy function is de ned as:
X
JCE(o; vc; U) = CE(y; y^) = yi log(^yi)
i
where the gold vector y is a one-hot vector, the softmax prediction vector y^ is a probability distribution over the output space, and U = [u1; u2; :::; uW ] is the matrix of all the output vectors. Assume cross entropy cost is applied to this prediction, derive the gradients with respect to vc.
(5pts) As in the previous part, derive gradients for the \output" word vector uw (including uo).
(5pts) Repeat a and b assuming we are using the negative sampling loss for the predicted vector vc. Assume that K negative samples (words) are drawn and they are 1,...,K respectively. For simplicity of notation, assume (o 2= f1; :::; Kg). Again for a given word o, use uo to denote its output vector. The negative sampling loss function in this case is:
K
Jneg-sample(o; vc; U) = log( (uovc))
kX
log( ( ukvc))
=1
iv. (5pts) Derive gradients for all of the word vectors for skip-gram given the previous parts and given a set of context words [wordc m; :::; wordc; :::; wordc+m] where m is the context size. Denote the \input" and \output" word vectors for word k as vk and uk respectively.
Hint: feel free to use F (o; vc) (where o is the expected word) as a placeholder for the JCE(o; vc:::)
or Jneg-sample(o; vc:::) cost functions in this part { you’ll see that this is a useful abstraction for
the coding part. That is, your solution may contain terms of the form @F (o;vc) Recall that for
@:::
skip-gram, the cost for a context centered around c is:
X
F (wc+j; vc)
m j m;j6=0
Coding (70 points)
(5pts) Given an input matrix of N rows and D columns, compute the softmax prediction for each row using the optimization in 1.(a). Write your implementation in utils.py.
(5pts) Implement the sigmoid function in word2vec.py and test your code.
(45pts)Implement the word2vec models with stochastic gradient descent (SGD).
(15pts) Write a helper function normalizeRows in utils.py to normalize rows of a matrix in word2vec.py. In the same le, ll in the implementation for the softmax and negative sampling cost and gradient functions. Then, ll in the implementation of the cost and gradient functions for the skip-gram model. When you are done, test your implementation by running python word2vec.py.
(15pts) Complete the implementation for your SGD optimizer in sgd.py. Test your implemen-tation by running python sgd.py.
(15pts) Implement the k-nearest neighbors algorithm, which will be used for analysis. The algorithm receives a vector, a matrix and an integer k, and returns k indices of the matrix’s rows that are closest to the vector. Use the cosine similarity as a distance metric (https: //en.wikipedia.org/wiki/Cosine_similarity). Implement the algorithm in knn.py. Print out 10 examples: each example is one word and its neighbors.
(15pts)Load some real data and train your own word vectors.
Use the StanfordSentimentTreeBank data to train word vectors. Process the dataset and use the sgd function and word2vec to generate word vectors. Visualize a few word examples. There is no additional code to write for this part; just run python run.py.
Note: The training process may take a long time depending on the e ciency of your implementation. Plan accordingly! When the script nishes, a visualization for your word vectors will appear. It will also be saved as word vectors.png in your project directory. In addition, the script should print the nearest neighbors of a few words (using the knn function you implemented in 2(g)). Include the plot and the nearest neighbors lists in your homework write up, and brie y explain those results.
Submission Instructions You shall submit a zip le named Assignment2 LastName FirstName.zip which contains:
a pdf(or jpg) le contains all your solutions for the Written part
a pdf(or jpg) le contains the word vector plot(vector.png), a brief report of your knn results. python les(sgd.py, word2vec.py, run.py, knn.py, utils.py)
Page 2