Assignment #2: Experimental Verification of the Johnson-Lindenstrauss LemmA Solution

Starting from:

~~$29.99~~

$23.99

Home

1    Introduction

d

We have n-many,   d-dimensional   vectors,   {x1 , x2 , . . . , xn }, where ∀i, xi   ∈ R , and d is very large.    We wish to find n-many,   k-dimensional   vectors,

{y1 , y2 , . . . , yn } where ∀i, yi ∈ Rk , and k       d is comparatively smaller than d. Theorem 2.11 of the text as the following form –

c 2

Theorem 1.1 (Johnson-Lindenstrauss Lemma) For any     ∈ (0, 1), and any integer n, let k ≥ 3   ln n, with c as defined in Theorem 2.9. For any set of n- many vectors {x1 , x2 , . . . , xn } where xi ∈ Rd , the random projection f : Rd →

k

R defined in section 2.7 of the text has the property that for all pairs xi and

2n

xj , with probability at least 1 − 3 ,

√                                                                             √

(1 − )

kkxi − xj k ≤ kf (xi ) − f (xj )k ≤ (1 + )

kkxi − xj k

(Note, I have replaced the text’s | • | with my k • k to denote length of the vector-argument).

2    Discussion

For this programming assignment we will take a slightly different form/version of the JL-Lemma. As before, we would like a mapping f : Rd → Rk (where k       d) such that the distance between any two d-dimensional points x1 , x2 ∈ Rd is more-or-less preserved when we look at f (x1 ), f (x2 ) ∈ Rk . We are going to rewrite the requirement of theorem 1.1 as equation 1 shown below, where for some (small)    ∈ R, and ∀x1 , x2 ∈ Rd , we want

(1 − )kx1 − x2 k ≤ kf (x1 ) − f (x2 )k ≤ (1 + )kx1 − x2 k                 (1) As suggested by the text, you make k-many calls to a routine that generates

d-dimensional Gaussian RVs with zero-mean and an identity-covariance matrix. Let us suppose this process gets us the vectors {u1 , u2 , . . . , uk }, where each

<

ui ∈    d . You re-scale each ui ∈ <d

according to

r d       ui

ui ←

k × ku1 k

This is needed to preserve the expected-norm (Why?1 ). You stack the k-many,

k

members of {uT }

on top of each other to get the (k × d) matrix A. For each

i   i=1

xi ∈ <n ,

you define f (xi ) = Axi (∈ Rk ).

There is a version of Johnson-Lindenstrauss Lemma that says that if

1

k = O

2

n

ln(   )   , δ

then ∀x1 , x2   ∈ Rn equation 1 will be satisfied with probability at least (1 − δ).   I leave it as an exercise to you to show that theorem 1.1 and the above version are theoretically similar.   As noted in class, the neat thing about the JL-Lemma is that this bound on k does not depend on d (the dimension we want to reduce/scale-down). There are ways of gaining small efficiencies when it comes to producing the random map f (•) alluded to above. But that is for you to discover after you have done some experimentation.

2

If you spend enough time experimenting with this, you will see that while the (ln n)-term is nice, the   1 -term can be a killer. For example, if   = 0.01 (i.e.

1% error), we will need k ≈ ln n × 104 . If this is to have any practical-value/use we should have a d       ln n × 104 .   If    = 0.1 (i.e.   10% error), we will need k ≈ ln n × 100.   On the flip-side, keep in mind that the error-bounds in the formal theorem is an upper bound (i.e. it is the worst-case) – you may find that the JL-Lemma in practice might not be as bad as the worst-case. It is for you to experiment and find out – as a part of this programming exercise.

3    Data for Verification

i=1

The Compass site should contain a data file JL Data1 that contains data for a (10000 × 1000) matrix that represents the set {xi }1000 , where each xi ∈ R10,000 . We want to represent each xi ∈ R10,000 with a yi ∈ Rk , where k < 10, 000. As you can gather, n = 1, 000 for this data set. The plan is to have you try a bunch of different values for k (say, k ∈ {200, 300, 400, 500}) and see if the bounds in the statement of the JL-Lemma holds for this data-set.

Thrombin Data

This is not necessary for the programming assignment. I am just providing this to you to work with a bigger dataset. I know practically nothing about drug- discovery. All I know is that there is a small role for the JL-Lemma in reducing the dimension for a standard search that is part of the process. My superficial understanding is that every time a new drug (i.e.   chemical compound) is to be considered, you have to check if there are others in your list that have a similar “performance” – this involves checking the nearest neighbor of the new compound in the list. I am not even sure about all this. That said, we have a “real-world” data set that we can for testing the JL-Lemma.

1 A very careful reading of my notes, the lectures, and the text, should tell you why!

You can try the training-data for Prediction of Molecular Bioactivity for Drug Design at this link.   Each row of this CSV file represents a chemical- compound (i.e. a component of a drug). The first-item in each row is either an “A” or an “I” (representing Active /Inactive ). This is followed by 139,531 many comma-seperated 0/1 values representing location on the Thrombin molecule (i.e. d = 139, 531). There are 1909-many chemical compounds (i.e. n = 1909). The Compass website contains some Python-Code I wrote to process this data set to output a file that can be read the code you would have written for this programming assignment. You can explore if each 139,531-dimensional binary vector can be represented by a shorter k-dimensional vectors that meet all the requirements of the JL-Lemma. Do not be surprised if you see some weird/in- teresting results ,,.

4    Requirements

What I need from you:

1. *.cpp and *.h that verifies the JL-Lemma by taking a bunch of values at the command-line.

(a) First Command Line Variable - d, (b) Second Command Line Variable - k, (c) Third Command Line Variable - n, (d) Fourth Command Line Variable - , (e) Fifth Command Line Variable - δ,

(f ) Sixth Command   Line Variable - input-filename (that contains the relevant data), and

(g) Seventh Command Line Variable - number of trials, which is the num- ber of statistical-trials where the JL-Lemma is experimentally/sta- tistically verified.

I have provided a hint on Compass for anyone that needs it. To verify the JL- Lemma, you will implement what is shown below in pseudo-code.

1: no of hits = 0;

2: for int i = 1; i ≤ no of trials; i++ do

3:       pick a pair of vectors x1 , x2 ∈ Rd ;

4:       compute y1 = Ax1 and y2 = Ax2 ;

5:       if   (1 − )kx1 − x2 k ≤ ky1 − y2 k ≤ (1 + )kx1 − x2 k then

6:            no of hits++;

7:       end if

9: Check if

8: end for

no of hits

no of trials

≥ (1 − δ)

I am looking for an output that is along the lines of what is shown in figures 1

and 2. I want you to try a bunch of values for k, and see if the specifications

of the JL-Lemma is verified for all of them.   A single page statement of your experiments will suffice.

Figure 1: A sample output.

2: A sample output on Thrombin-Data (run on my iMac).

5