Starting from:
$29.99

$23.99

Assignment #2: Experimental Verification of the Johnson-Lindenstrauss LemmA Solution

1    Introduction

 



d
 
We  have  n-many,   d-dimensional   vectors,   {x1 , x2 , . . . , xn }, where  ∀i, xi   ∈ R ,  and  d  is  very  large.    We  wish  to  find  n-many,   k-dimensional   vectors,

{y1 , y2 , . . . , yn } where ∀i, yi ∈ Rk , and k       d is comparatively smaller than  d. Theorem  2.11 of the text  as the following form –

 



c 2
 
Theorem 1.1  (Johnson-Lindenstrauss  Lemma)  For  any     ∈  (0, 1),  and  any integer  n,  let k ≥  3   ln n,  with c as defined in Theorem  2.9.  For  any set of n- many  vectors  {x1 , x2 , . . . , xn } where xi ∈ Rd , the random  projection  f : Rd →

k

R  defined in section  2.7 of the text has the property  that  for all pairs  xi  and



2n
 
xj , with probability at least 1 −  3  ,

√                                                                             √

(1 −  )


kkxi  − xj k ≤ kf (xi ) − f (xj )k ≤ (1 +  )


kkxi  − xj k

 

(Note,  I have  replaced  the  text’s  | •  | with  my k • k to  denote  length  of the vector-argument).

 

 

2    Discussion

 

For this programming assignment we will take a slightly different form/version of the  JL-Lemma.  As before,  we would like a mapping  f : Rd  →  Rk  (where k       d) such that the distance  between any two d-dimensional  points x1 , x2  ∈ Rd is more-or-less preserved  when we look at  f (x1 ), f (x2 ) ∈ Rk .  We are going to rewrite  the  requirement  of theorem  1.1 as equation  1 shown below, where for some (small)    ∈ R, and ∀x1 , x2  ∈ Rd , we want

(1 −  )kx1  − x2 k ≤ kf (x1 ) − f (x2 )k ≤ (1 +  )kx1  − x2 k                 (1) As suggested by the text,  you make k-many calls to a routine  that generates

d-dimensional  Gaussian  RVs with zero-mean and an identity-covariance matrix. Let  us  suppose  this  process  gets  us  the  vectors  {u1 , u2 , . . . , uk }, where  each




 
ui ∈    d . You re-scale each ui ∈ <d


according  to

r d       ui 

ui ←


k × ku1 k

 

 

 

 

 

This is needed to preserve  the expected-norm (Why?1 ).  You stack  the k-many,

k

members  of {uT }


on top of each other  to get the (k × d) matrix  A.  For each

i   i=1

xi ∈ <n ,


you define f (xi ) = Axi (∈ Rk ).

There  is a version of Johnson-Lindenstrauss Lemma  that says that if

 

  1

k = O

 2


 

n

ln(   )   , δ

 

then  ∀x1 , x2   ∈  Rn  equation  1 will be satisfied  with  probability at  least  (1 − δ).   I leave it  as an  exercise  to  you  to  show that theorem  1.1 and  the  above version  are  theoretically similar.   As noted  in class,  the  neat  thing  about  the JL-Lemma  is that this  bound  on k does not  depend  on d (the  dimension  we want to reduce/scale-down). There  are ways of gaining  small efficiencies when it comes to producing  the  random  map  f (•)  alluded  to above.  But  that is for you to discover after  you have done some experimentation.



 2
 
If you spend  enough  time  experimenting with  this,  you will see that while the (ln n)-term is nice, the   1 -term  can be a killer.  For example, if   = 0.01 (i.e.

1% error),  we will need k ≈ ln n × 104 . If this is to have any practical-value/use we should  have  a d        ln n × 104 .   If    = 0.1 (i.e.   10% error),  we will need k ≈ ln n × 100.   On  the  flip-side,  keep  in mind  that the  error-bounds in the formal theorem  is an upper bound (i.e.  it is the worst-case)  – you may find that the  JL-Lemma  in practice  might not  be as bad  as the  worst-case.  It is for you to experiment and find out – as a part  of this programming exercise.

 

 

3    Data for  Verification

 



i=1
 
The  Compass  site should contain  a data  file JL Data1 that contains  data  for a (10000 × 1000) matrix  that represents the set {xi }1000 , where each xi ∈ R10,000 . We want to represent each xi ∈ R10,000 with a yi ∈ Rk , where k < 10, 000. As you can gather,  n = 1, 000 for this data  set.  The plan is to have you try a bunch of different values for k (say,  k ∈ {200, 300, 400, 500}) and  see if the  bounds  in the statement of the JL-Lemma  holds for this data-set.

 

Thrombin Data

 

This is not necessary for the programming assignment. I am just providing  this to you to work with  a bigger dataset.  I know practically nothing  about  drug- discovery.  All I know is that there is a small role for the JL-Lemma in reducing the  dimension  for a standard search that is part  of the  process.  My superficial understanding is that every  time  a new drug  (i.e.   chemical  compound)  is to be considered,  you  have  to  check  if there  are  others  in your  list  that have  a similar “performance” – this involves checking the nearest  neighbor  of the new compound  in the  list.  I am not  even sure about  all this.  That said, we have a “real-world”  data  set that we can for testing  the JL-Lemma.

 

1 A very  careful reading of my  notes, the  lectures, and  the  text, should tell  you  why!

 

 

 

 

 

You  can  try  the  training-data for Prediction of Molecular  Bioactivity  for Drug  Design  at  this  link.   Each  row  of this  CSV  file represents a  chemical- compound  (i.e.  a component of a drug).  The first-item  in each row is either  an “A” or an “I” (representing Active /Inactive ).  This is followed by 139,531 many comma-seperated 0/1  values  representing location  on the  Thrombin  molecule (i.e.  d = 139, 531).  There  are 1909-many chemical compounds  (i.e.  n = 1909). The Compass  website contains  some Python-Code I wrote to process this data set to output a file that can be read  the  code you would have written for this programming assignment.  You can explore if each 139,531-dimensional  binary vector  can be represented by a shorter  k-dimensional  vectors  that meet  all the requirements of the  JL-Lemma. Do not  be surprised  if you see some weird/in- teresting results  ,,.

 

 

4    Requirements

 

What  I need from you:

 

1.  *.cpp and  *.h that verifies the  JL-Lemma  by taking  a bunch  of values at the command-line.

 

(a)  First  Command  Line Variable  - d, (b) Second Command  Line Variable  - k, (c)  Third  Command  Line Variable  - n, (d)  Fourth Command  Line Variable  -  , (e)  Fifth  Command  Line Variable  - δ,

(f )  Sixth  Command   Line  Variable  - input-filename (that  contains  the relevant data), and

(g)  Seventh Command  Line Variable - number of trials, which is the num- ber  of statistical-trials where  the  JL-Lemma  is experimentally/sta- tistically  verified.

 

I have provided  a hint on Compass  for anyone  that needs it.  To verify the  JL- Lemma, you will implement what is shown below in pseudo-code.

1:  no of hits = 0;

2:  for  int i = 1; i ≤ no of trials; i++ do

3:       pick a pair of vectors  x1 , x2  ∈ Rd ;

4:       compute  y1  = Ax1 and y2  = Ax2 ;

5:       if   (1 −  )kx1  − x2 k ≤ ky1  − y2 k ≤ (1 +  )kx1  − x2 k  then

6:            no of hits++;

7:       end if



9:  Check if
 
8:  end for

 no  of  hits

no  of  trials


≥ (1 − δ)

I am looking for an output that is along the lines of what  is shown in figures 1

and  2.  I want  you to try  a bunch  of values  for k, and  see if the  specifications

of the  JL-Lemma  is verified for all of them.   A single page statement of your experiments will suffice.

 

 

 

Figure  1: A sample output.

 

 2: A sample output on Thrombin-Data (run  on my iMac).

5

 

More products