$19
1. In this problem you will implement transfer learning (TL) based on importance weighting, and compare it with supervised learning (SL). You will work with data from the TL_data folder. There are 5 files. They contain source training and test data, labeled and unlabeled target training data, and target test data. The target data had a covariate shift with respect to the source data. In all items below, you should use the AdaBoostClassifier from sklearn with default parameters.
a) Let’s start by estimating the classifier’s performance on this data with a regular SL problem. Train a classifier on the source training data. Report its accuracy on the source test data.
For parts (b)-(d) below, you will use standard SL techniques but applied to the TL problem (3 different approaches).
b) Use the classifier trained in item (a) to predict labels of the target test data. Report the accuracy.
c) Train a classifier only on the labeled target training data. Report its accuracy on the target test data.
d) Train a classifier on the union of source training and labeled target training data. Report its accuracy on the target test data.
e) Compare the results of (a)-(d). Explain any differences (and any lack of differences) in accuracy.
For parts (f)-(g), you will use TL techniques on the TL problem.
f) Let’s assume source and target domain features follow multivariate normal distributions with different parameters. Estimate the mean and covariance matrix of each domain. Hint: you can use sklearn’s GaussianMixture class. You can proceed in two ways.
i. Estimate the two means and two covariance matrices simultaneously. This can be done by letting the Gaussian mixture model estimator know there are 2 components in the mixture and providing the entire training data (source + labeled target + unlabeled target) to it.
ii. Estimate each mean and covariance matrix individually. This can be done by training two separate Gaussian mixture model estimators, each with one component density. One estimator will receive only the source training data and the other estimator will receive all the target training data.
Which method is likely to yield better results, i.e., means and covariance matrices closer to the true value? Justify your answer.
Provide the values for the mean and covariance matrix of each domain.
p. 1 of 5
g) Now that you have the parameters of each domain, you can compute the weight of each data samples as ! = "( !)/ #( !). Train a classifier on the union of source training and labeled target data using these weights. Report the accuracy on the target test data.
h) Compare results of (g) with results of (b)-(d). Explain any differences and any lack of difference.
2. EM for semi-supervised learning. Consider a 2-class semi-supervised learning problem in which there are l labeled samples and u = 1 unlabeled sample. (For example, think of being given l labeled samples, and then acquiring unlabeled samples one at a time.) There is 1 feature, and each class is modeled as a Gaussian:
p(x y = c, θ ) = N (x µc , σ c
2 ), c = 1,2
In the parts below, you will use EM to estimate the means µ1 and µ2 . You may assume the priors and variances are given constants. Generally the subscripts h and i will indicate unlabeled and labeled samples, respectively.
In this problem, parts (a)-(d) are to be done by hand. Part (e) can be done by hand or computer; part (f) is best done by computer.
a) Consider the t th iteration of EM. Derive the E step in terms of given quantities: that is, starting from
(t )
= ch
xh , θ
(t )
(t )
, ch = 1,2
p H D, θ
= p yh
= γ hc
h
show that:
π
(t )
2
(t )
ch
xh
− µc
h
γ
hch
=
exp −
2σ c
2
α t
2πσ
2
h
c
h
h
in which π c
t
t
) .
! p yh
= ch θ ( )
= p( yh = ch ). Also, find αh(
h
In parts (b)-(d), you will derive the M step formulas, also for the t th iteration of EM. b) First, show that
p. 2 of 5
l
p(D,H θ ) = p(xh yh = ch , θ )πch ∏ p(xi yi = ci , θ )πci
i=1
in which πci = p( yi = ci θ ) = p( yi = ci ) and similarly for πch .
c) Take ln p(D,H θ ) from your result of (b), plug in for the normal densities, and drop any additive terms that are constants of θ . Then plug in to the M equation:
• (t+1) = arg max
θ
• arg max
θ
EH D,θ(t ) {ln p(D,H θ )}
∑2 γ (t ) ln p(D,H θ )
ch =1 hch
and simplify to get:
2
(xh − µch )
2
l
(xi − µci )
2
θ
(t +1)
(t )
−
= argmax ∑ γ hc
−
σ c
2
+ ∑
2
θ
ch =1
h
i=1
σ c
h
i
in which a constant multiplicative factor of
1
has been dropped.
(Hint:
you
2π
may find it useful to use γ h(t1) + γ h(t2)
= 1.)
d)
Re-write your result of part (c) to express it in terms of
µ , µ
2
σ 2,σ
2 . (Hint:
1
1
2
you might
find
it
useful
to
use
the
indicator
function.)
Then
solve
for
θ (t +1) =
µ (t +1), µ
(t +1) T
.
(Hint: find
the
argmax
by
taking
∂
and setting
1
2
∂µ1
equal to 0; similarly for
µ2 .)
Let
l1 = number of labeled samples with label
c = 1 , and
l = umber of labeled samples with label c
= 2 .
(Note that γ
(t )
is
i
2
i
hc
h
constant of
µ
and
µ
2
because it used the (constant) estimates µ
(t ) and
µ
(t )
1
1
2
from the E step.)
e)
Given: π1 = π2 = 0.5, σ12 = σ 2
2 = 1; data as follows:
labeled data {(xi , yi )}il=1 = {(1,1),(2,1),(4,2
)};
unlabeled sample xh = 3.
p. 3 of 5
Suppose the values for θ at the beginning of the t th iteration of EM are:
µ
(t ) = 1.5 , µ
(t ) = 4.0 .
1
2
(i)
Calculate the responsibilities γ
(t )
and γ
(t )
from the E step (using part (a));
h1
h2
(ii) Calculate the new mean estimates µ1(t +1) and µ2(t +1) from the M step (using part (d) result).
Tip: While not required for part (e), you may find it useful to do the calculations by computer, so that your code can be used for part (f) also.
f) Run more iterations (by computer), until µ1(t +1) and µ2(t +1) converge (until they change only a small amount from one iteration to the next – choose a suitable threshold). Plot µ1(t +1) and µ2(t +1) vs. t, as well as γ h(t1) and γ h(t2) vs. t. (You are
not required to compute p (Dθ (t ) ) in this problem.) Give your final values for µ1(t +1) , µ2(t +1) , γ h(t1) , and γ h(t2) .
3. In this problem you will explore semi-supervised learning using S3VM, and compare to supervised learning. Throughout this problem, use the qns3vm code available at the course’s page under Week 12 with parameters kernel_type=’Linear’ and ‘lam=1.0’ (c.f. Discussion 12 for more information).
Note: if you get a “PendingDeprecationWarning” that halts the execution of the code, add the line:
“warnings.filterwarnings('ignore', category=PendingDeprecationWarning)”
at the start of your code. The SVM parameters should also be set to kernel=’linear’ and C=1.0.
Use the data inside the SSL_data folder. Load the data files named ssl_train_data and test_data. On each of them, the first 10 columns are the features, i.e., $%&!' and $()$, and the last column represents the true label, i.e., $%&!' and $()$. There is a total of 200 training samples and the classes are ! = {0,1}. Note that the qns3vm code expects classes {−1, 1} and adjust accordingly.
(a) To get an estimate of the best-case scenario, let’s start with a dataset that is entirely labeled. Train an SVM classifier on the entire train data and compute its accuracy on the test data.
(b) Now let’s assume a scenario where only a few samples of the training data are labeled. Select only the first 2 samples of the training set (you can note that the
p. 4 of 5
dataset was built in a way that the first 2 samples always contain samples of each class), train an SVM classifier only on those first 2 samples, and report the accuracy on the test data for = [1, … ,10]. Note that the test set does not change size.
(c) Next, let’s repeat the scenario from (b), but let’s make use of the unlabeled data. Train an S3VM model on the entire data (2 labeled samples and 200 − 2 unlabeled samples), and the report the accuracy on the test data, for = [1, … ,10].
(d) Plot your results of (b) and (c) on a single plot, showing accuracy (percent correct classification on test set) vs. * = 2 .
(e) Interpret your result of (d).
a. In what ways, if any, are they what you expected? Explain why you expected them to be so.
b. In what ways, if any, are they different than what you expected? Explain what you expected that is different, and hypothesize why the difference.
p. 5 of 5