$29
Instructions
1. We will be using Gradescope to collect your assignments. Please read the following instructions for submitting to Gradescope carefully! Failure to follow these instructions may result in parts of your assignment not being graded. We will not entertain regrading requests for failure to follow instructions.
◦ For Section 1: Multiple Choice Questions, it is mandatory to use the LATEX template provided on the class webpage (https://www.cc.gatech.edu/classes/AY2020/cs7643_ fall/assets/ps0.zip). For every question, there is only one correct answer. To mark the correct answer, change \choice to \CorrectChoice
◦ For Section 2: Proofs, each problem/sub-problem is in its own page. This section has 5 total problems/sub-problems, so you should have 5 pages corresponding to this section. Your answer to each sub-problem should fit in its corresponding page.
◦ For Section 2, LATEX’d solutions are strongly encouraged (solution template available at https://www.cc.gatech.edu/classes/AY2020/cs7643_fall/assets/ps0.zip), but scanned handwritten copies are acceptable. If you scan handwritten copies, please make sure to append them to the pdf generated by LATEX for Section 1.
2. Hard copies are not accepted.
3. We generally encourage you to collaborate with other students. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity. Please list the students you collaborated with.
Exception: PS0 is meant to serve as a background preparation test. You must NOT collaborate on PS0.
1
• Multiple Choice Questions
1. (1 point) true/false We are machine learners with a slight gambling problem (very different from gamblers with a machine learning problem!). Our friend, Bob, is proposing the following payout on the roll of a dice:
payout =
$1=4
x 6= 1
(1)
$1
x = 1
where x 2 f1; 2; 3; 4; 5; 6g is the outcome of the roll, (+) means payout to us and (
) means
payout to Bob. Is this a good bet i.e are we expected to make money?
TrueFalse
2. (1 point) X is a continuous random variable with the probability density function:
4x + 4
1=2 x 1
p(x) =
4x 0
x 1=2
(2)
Which of the following statements are true about equation for the corresponding cumulative density function (cdf) C(x)?
[Hint: Recall that CDF is defined as C(x) = P r(X x).]
C(x) = 2x2 for 0 x 1=2
C(x) = 2x2 + 4x 3=2 for 1=2 x 1
All of the above None of the above
3. (2 point) A random variable x in standard normal distribution has following probability density
1
p(x) = p e
Evaluate following integral
x2
(3)
2
1
Z
p(x)(ax2 + bx + c)dx
(4)
[Hint: We are not sadistic (okay, we’re a little sadistic, but not for this question). This is not a calculus question.]
a + b + c c a + c b + c
2
4. (2 points) Consider the following function of x = (x1; x2; x3; x4; x5; x6):
f(x) = log
5
maxfx1; x2g x4
(x5 + x6)
+ 2
x3
1
where is the sigmoid function
(x) =
1
1 + e
x
Compute the gradient rxf( ) and evaluate it at at x^ = (5;
1; 6; 12; 7; 5).
2 00::0
3
2
0
3
2
0
3
2
0
3
6
157
7
6
0:157
7
6
0:031
7
6
0:468
7
00::065
0:065
0:013
0:195
6
131
7
6
0:131
7
6
0:026
7
6
0:390
7
6
7
6
7
6
7
6
7
6
0:846
7
6
0:314
7
6
0:062
7
6
0:937
7
6
7
6
7
6
7
6
7
6
0:846
7
6
0:314
7
6
0:062
7
6
0:937
7
4
5
4
5
4
5
4
5
5. (2 points) Which of the following functions are convex?
(5)
(6)
jjxjj1
2
mini aTi x for x 2 Rn
log (1 + exp(wT xi)) for w 2 Rd
All of the above
6. (2 points) Suppose you want to predict an unknown value Y 2 R, but you are only given a sequence of noisy observations x1:::xn of Y with i.i.d. noise (xi = Y + i).. If we assume the noise is I.I.D. Gaussian ( i N(0; 2)), the maximum likelihood estimate (y^) for Y can be given by:
A: y^ = argminy
in=1(y
xi)2
B: y^ = argminy
Pi=1 jy
xij
n
C: y^ = 1
i=1 xPi
&P
n
Both A
n
C
Both B & C
3
• Proofs
7. (3 points) Prove that
loge x x 1;8x > 0
(7)
with equality if and only if x = 1.
[Hint: Consider differentiation of log(x) (x 1) and think about concavity/convexity and second derivatives.]
4
8. (6 points) Consider two discrete probability distributions p and q over k outcomes:
k
k
Xi
X
(8a)
pi = qi = 1
=1
i=1
pi > 0; qi > 0;
8i 2 f1; : : : ; kg
(8b)
The Kullback-Leibler (KL) divergence (also known as the relative entropy) between these distributions is given by:
k
pi log
qi
(9)
KL(p; q) = i=1
X
pi
It is common to refer to KL(p; q) as a measure of distance (even though it is not a proper met-ric). Many algorithms in machine learning are based on minimizing KL divergence between two probability distributions. In this question, we will show why this might be a sensible thing to do.
[Hint: This question doesn’t require you to know anything more than the definition of KL(p; q) and the identity in Q7]
(a) Using the results from Q7, show that KL(p; q) is always non-negative.
5
(b) When is KL(p; q) = 0?
6
(c) Provide a counterexample to show that the KL divergence is not a symmetric function of its arguments: KL(p; q) 6= KL(q; p)
7
9. (6 points) In this question, you will prove that cross-entropy loss for a softmax classifier is convex in the model parameters, thus gradient descent is guaranteed to find the optimal parameters. Formally, consider a single training example (x; y). Simplifying the notation slightly from the implementation writeup, let
z = W x + b;
(10)
pj =
ezj
;
(11)
P
zk
y
(12)
k e
L(W) =
log (p )
Prove that L( ) is convex in W.
[Hint: One way of solving this problem is “brute force” with first principles and Hessians. There are more elegant solutions.]
8