$24
Submission Instructions
It is recommended that you complete this exercises in Python 3 and submit your solutions as a Jupyter notebook.
You may use any other language, as long as you include a README with simple, clear instructions on how to run (and if necessary compile) your code.
Please upload all les (code, README, written answers, etc.) to blackboard in a single zip le named ff irstnameg flastnameg DS5230=DS4220 HW 2:zip.
Exercise 1: Bayesian Linear Regression
In this assignment, you will need to derive some identities to express ridge regression as maximum a posteriori (MAP) estimates relative to an appropriate prior. You will also derive con dence intervals on weights in Bayesian linear regression, as well as con dence intervals on predictions. Please complete this part of the exercise as a PDF le (preferably typed up in LaTeX).
a. As discussed in class, ridge regression (i.e. L2-regularized linear regression)
minimizes the loss function:
L(w) = jjy Xwjj2 + jjwjj2
N
X
= (yn xnw)2 + ww:
n=1
The closed form solution for the weights w that minimize this loss is
w = (XX + I) 1Xy:
To demonstrate that this solution is indeed an extremum of the loss function, show that it satis es
rwL(w)
w=w = 2X(Xw
y) + 2 w = 0:
1
Show that ridge regression (i.e. L2-regularized linear regression) is equivalent to a MAP estimation problem in which a prior is placed on the regression weights
w = argmax p(w j y)
w
p(y j w) = N (y; Xw; 2I)
p(w) = N (w; 0; s2I)
Express the regularization parameter in terms of the constants and s, and b.
(note: this problem should be easy if you have reviewed the slides).
c. We will now consider the more general case where we place a multivariate Gaussian prior with parameters m0 and S0 on the weights
N (m0; S0):
Derive the marginal likelihood of y, which is de ned as
Z
p(yjX; m0; S0) = dw p(yjw; X)p(wjm0; S0);
by calculating the mean and covariance such that
yjX; m0; S0 N ( ; )
To do so, make use of the fact that for f = Xw, we can write
E[f] = E[Xw] = XE[w];
Cov[f] = Cov[f] = Cov[Xw] = XCov[w]X:
You will additionally need to make use of the fact that the covariance of two uncorrelated variables (which we will here conveniently denote f and ) is
Cov[f + ] = Cov[f] + Cov[ ]:
d. Using the result from the previous exercise, derive the predictive distribution at a new test point x , which is de ned as
Z
p(y jy; X; x ) = dw p(y jw; x )p(wjy; X):
Calculate the values f and such that
y jy; X; x N(f ; ):
To do so, use the standard identity on Gaussians (which we have already seen in class) that states that
when
j N
a + CB 1(
b); A
CB 1C ;
N
b
;
C
B
:
a
A
C
2
To check your work, substitute m0 = 0 and S0 = s2I into the result of the previous exercise. Express the solution for f = E[f(x )] as a function of x . What is this result equal to?
Can PCA uncover academic communities?
In this problem, you will try to apply Principal components analysis (PCA) techniques to the real world bibliographic dataset from Aminer (https://aminer.org/). The dataset (publications.txt) was downloaded and attached with this assignment. We will see if we can use latent semantic analysis (PCA on text) to discover structure in paper titles.
You may implement the following exercise using any language/system you choose. We recommend that you use the Jupyter docker instance with Spark (http://spark. apache.org/). The pyspark.ml.feature namespace implements many of the pre-requisite operations, which are documented here:
http://spark.apache.org/docs/latest/ml-features.html
In the rst part of the exercise, we will consider the dataset as a whole
Transform titles to word count vectors. Truncate your (sparse) vectors to the 1000 most frequent words and perform PCA with 50 components on the counts.
Plot the eigenvalues of the principal components. Calculate how many compo-nents are needed to explain 50% of the total variance?
Identify which words are important in each of the principal components. To do so, take the sum of squares of each of the component vectors to check how they are normalized. For each component, then print out the words for which the absolute value of the component is larger than 0.20 of the norm.
Make a scatter plot of some reasonably sized sample (1k-10k titles). Explain the structure (or lack thereof) you see based on your results from item b-c.
Run a preprocessing step to remove stop words (a list of stop words is provided which is identical to the list used in Spark). Rerun steps b-d and evaluate whether this has improved your representation.
One of the issues with text is that the variance in how often a word appears is often correlated with the base frequency with which a word appears. To account
3
for this, the word counts are often reweighted using the term frequency, inverse document frequency (TF-IDF) scheme:
tf-idf(t; d; D) := tf(t; d)idf(t; D)
Here tf(t; d) is a measure of how often a term t (i.e. a vocabulary word) occurs in document d (i.e. a title). Here we will simply use the word counts calculate in part a. of this exercise:
tf(t; d) = jft0 2 d : t0 = tgj
The inverse document frequency idf(t; D) is used to discount terms t that occur commonly across all documents D. Here we will use:
jDj + 1
idf(t; D) := log jfd 2 D : t 2 dgj + 1
Note that idf(t; D) = 0 for words that occur in all documents.
Calculate TF-IDF features for all titles and rerun the operations in parts b-d of this exercise. How have your results changed?
In the second part of this exercise, we will look at subsets of the data. To construct these subsets, rst construct two lists of titles:
Titles for machine learning papers published at ‘NIPS’ Titles for database papers published at ‘VLDB’
Merge the two sets of titles. Construct both word count vectors and TF-IDF features. Repeat steps b-d and compare word count results to TF-IDF results.
Now make a scatter plot of these two principal components, showing the titles from each subset in di erent colors. Again compare word counts and TF-IDF. Did PCA succeed in uncovering the di erences between the communities?
4