$23.99
1 Introduction
d
We have n-many, d-dimensional vectors, {x1 , x2 , . . . , xn }, where ∀i, xi ∈ R , and d is very large. We wish to find n-many, k-dimensional vectors,
{y1 , y2 , . . . , yn } where ∀i, yi ∈ Rk , and k d is comparatively smaller than d. Theorem 2.11 of the text as the following form –
c 2
Theorem 1.1 (Johnson-Lindenstrauss Lemma) For any ∈ (0, 1), and any integer n, let k ≥ 3 ln n, with c as defined in Theorem 2.9. For any set of n- many vectors {x1 , x2 , . . . , xn } where xi ∈ Rd , the random projection f : Rd →
k
R defined in section 2.7 of the text has the property that for all pairs xi and
2n
xj , with probability at least 1 − 3 ,
√ √
(1 − )
kkxi − xj k ≤ kf (xi ) − f (xj )k ≤ (1 + )
kkxi − xj k
(Note, I have replaced the text’s | • | with my k • k to denote length of the vector-argument).
2 Discussion
For this programming assignment we will take a slightly different form/version of the JL-Lemma. As before, we would like a mapping f : Rd → Rk (where k d) such that the distance between any two d-dimensional points x1 , x2 ∈ Rd is more-or-less preserved when we look at f (x1 ), f (x2 ) ∈ Rk . We are going to rewrite the requirement of theorem 1.1 as equation 1 shown below, where for some (small) ∈ R, and ∀x1 , x2 ∈ Rd , we want
(1 − )kx1 − x2 k ≤ kf (x1 ) − f (x2 )k ≤ (1 + )kx1 − x2 k (1) As suggested by the text, you make k-many calls to a routine that generates
d-dimensional Gaussian RVs with zero-mean and an identity-covariance matrix. Let us suppose this process gets us the vectors {u1 , u2 , . . . , uk }, where each
<
ui ∈ d . You re-scale each ui ∈ <d
according to
r d ui
ui ←
k × ku1 k
This is needed to preserve the expected-norm (Why?1 ). You stack the k-many,
k
members of {uT }
on top of each other to get the (k × d) matrix A. For each
i i=1
xi ∈ <n ,
you define f (xi ) = Axi (∈ Rk ).
There is a version of Johnson-Lindenstrauss Lemma that says that if
1
k = O
2
n
ln( ) , δ
then ∀x1 , x2 ∈ Rn equation 1 will be satisfied with probability at least (1 − δ). I leave it as an exercise to you to show that theorem 1.1 and the above version are theoretically similar. As noted in class, the neat thing about the JL-Lemma is that this bound on k does not depend on d (the dimension we want to reduce/scale-down). There are ways of gaining small efficiencies when it comes to producing the random map f (•) alluded to above. But that is for you to discover after you have done some experimentation.
2
If you spend enough time experimenting with this, you will see that while the (ln n)-term is nice, the 1 -term can be a killer. For example, if = 0.01 (i.e.
1% error), we will need k ≈ ln n × 104 . If this is to have any practical-value/use we should have a d ln n × 104 . If = 0.1 (i.e. 10% error), we will need k ≈ ln n × 100. On the flip-side, keep in mind that the error-bounds in the formal theorem is an upper bound (i.e. it is the worst-case) – you may find that the JL-Lemma in practice might not be as bad as the worst-case. It is for you to experiment and find out – as a part of this programming exercise.
3 Data for Verification
i=1
The Compass site should contain a data file JL Data1 that contains data for a (10000 × 1000) matrix that represents the set {xi }1000 , where each xi ∈ R10,000 . We want to represent each xi ∈ R10,000 with a yi ∈ Rk , where k < 10, 000. As you can gather, n = 1, 000 for this data set. The plan is to have you try a bunch of different values for k (say, k ∈ {200, 300, 400, 500}) and see if the bounds in the statement of the JL-Lemma holds for this data-set.
Thrombin Data
This is not necessary for the programming assignment. I am just providing this to you to work with a bigger dataset. I know practically nothing about drug- discovery. All I know is that there is a small role for the JL-Lemma in reducing the dimension for a standard search that is part of the process. My superficial understanding is that every time a new drug (i.e. chemical compound) is to be considered, you have to check if there are others in your list that have a similar “performance” – this involves checking the nearest neighbor of the new compound in the list. I am not even sure about all this. That said, we have a “real-world” data set that we can for testing the JL-Lemma.
1 A very careful reading of my notes, the lectures, and the text, should tell you why!
You can try the training-data for Prediction of Molecular Bioactivity for Drug Design at this link. Each row of this CSV file represents a chemical- compound (i.e. a component of a drug). The first-item in each row is either an “A” or an “I” (representing Active /Inactive ). This is followed by 139,531 many comma-seperated 0/1 values representing location on the Thrombin molecule (i.e. d = 139, 531). There are 1909-many chemical compounds (i.e. n = 1909). The Compass website contains some Python-Code I wrote to process this data set to output a file that can be read the code you would have written for this programming assignment. You can explore if each 139,531-dimensional binary vector can be represented by a shorter k-dimensional vectors that meet all the requirements of the JL-Lemma. Do not be surprised if you see some weird/in- teresting results ,,.
4 Requirements
What I need from you:
1. *.cpp and *.h that verifies the JL-Lemma by taking a bunch of values at the command-line.
(a) First Command Line Variable - d, (b) Second Command Line Variable - k, (c) Third Command Line Variable - n, (d) Fourth Command Line Variable - , (e) Fifth Command Line Variable - δ,
(f ) Sixth Command Line Variable - input-filename (that contains the relevant data), and
(g) Seventh Command Line Variable - number of trials, which is the num- ber of statistical-trials where the JL-Lemma is experimentally/sta- tistically verified.
I have provided a hint on Compass for anyone that needs it. To verify the JL- Lemma, you will implement what is shown below in pseudo-code.
1: no of hits = 0;
2: for int i = 1; i ≤ no of trials; i++ do
3: pick a pair of vectors x1 , x2 ∈ Rd ;
4: compute y1 = Ax1 and y2 = Ax2 ;
5: if (1 − )kx1 − x2 k ≤ ky1 − y2 k ≤ (1 + )kx1 − x2 k then
6: no of hits++;
7: end if
9: Check if
8: end for
no of hits
no of trials
≥ (1 − δ)
I am looking for an output that is along the lines of what is shown in figures 1
and 2. I want you to try a bunch of values for k, and see if the specifications
of the JL-Lemma is verified for all of them. A single page statement of your experiments will suffice.
Figure 1: A sample output.
2: A sample output on Thrombin-Data (run on my iMac).
5