$24
Instructions: Submit a single PDF le containing your solutions, plots, and analyses. Make sure to thoroughly explain your process and results for each problem. Also include your documented code and a link to a public repository with your code (such as GitHub/GitLab). Make sure to list all collaborators and references.
Learning Conjunctions in the SQ Model: In this problem, you will see a simple example of a machine learning algorithm that works in the statistical query (SQ) model, and will evaluate how it performs under both local and centralized DP.
The set-up is that we are given a data set D = ((x1; y1); : : : ; (xn; yn)) where xi 2 f0; 1gd and yi 2 f0; 1g. We want a (local or centralized) di erentially private algorithm M(D) that outputs
^ f g ^
a subset S 1; : : : ; d of the variables such that the conjunction of the x-variables in S predicts the y variable well.
Speci cally, following Valiant's PAC model, we will measure utility as follows. Suppose that D consists of n iid draws from an (unknown) distribution P on f0; 1gd f0; 1g. Furthermore, suppose that there is an (unknown) set S f1; : : : ; dg such that
3
^
Pr 4y = x[j]5 = 1:
(x;y) P
j2S
That is, y can be perfectly predicted by some conjunction.1 (Note that here and below, the notation (x; y) refers to a single labelled example drawn from P, not a dataset of n such values. We will use D = ((x1; y1); : : : ; (xn; yn)) for the dataset.)
^
Then the goal of M is to output S that minimizes the expected classi cation error:
n ^
2
(x;y)
P
2
6
j^S^
33
D P ;S M(D)
4
4
55
:
E
Pr
y =
x[j]
2
(This is an instantiation of the \risk minimization" problem described in the April 8 lecture, though you will not need anything from that lecture for this homework problem.)
The SQ algorithm is based on estimating the following quantity for each variable j = 1; : : : ; d:
Pr [x[j] = 0 y = 1]:
pj = (x;y)
P
^
As shown in section, we have:
For each j 2 S, we have pj = 0.
1This is known as the \realizable" case of learning, as opposed to the \agnostic" case, where no conjunction classi es perfectly, but the goal is to nd one that does as well as possible.
1
ii
^
^
If S
S, then the false positive rate of S is 0.
iii
The false negative rate of S^ is at most Pj2S^ pj:
Based on these observations, an SQ algorithm M for learning conjunctions works as follows:
1
Using the dataset D, obtain an estimate p^j of pj for j = 1; : : : ; d.
2
^
Output S = fj 2 [d] : p^j tg for an appropriate (small) threshold t.
You should do the following:
Describe and implement centralized and local DP versions of the above SQ algorithm, dividing the privacy budget equally among each of the d estimates p^j. Keep the threshold t as a free parameter that you can choose.
For both the centralized and local DP algorithms you give in Part 1a, analytically estimate
^ j j
Pr[S + S] as a function of t, n, ", and S . Feel free to use a normal approximation for the binomial distribution, and to describe your estimate in terms of the CDF of a standard normal distribution. Use this as a guide to determine a setting of t to ensure that
^
Pr[S + S] :1 as a function of n, d, and " in each of the two settings, using the fact that jSj d.
An interest group targets a particular online political advertisement to users of a social media platform based on a conjunction of demographic characteristics. The users who are targeted with the advertisement are recorded. You get di erentially private access (at = 1) to the user data and attempt to learn what demographic combination was being focused on.
In the dataset CaPUMS5Full.csv are a number of potential demographic characteristics that might have been used for targeting: (sex, married, black, asian, collegedegree, employed, militaryservice, uscitizen, disability, englishability). The vari-able targeted records those individuals who received the ad. Use your DP algorithms to learn a classi er to predict targeted=1 (which we have arti cially created) from these vari-ables. Demonstrate whether your centralized and local algorithms would succeed. Also, use a bootstrap to compare the false positive rates and false negative rates of the two methods as a function of n.
If useful, also included in this dataset is the variable blackfemale that you can use to test if your implementation correctly picks up black and female as the constituent parts. There is also a test dataset, hw4testdata.csv, that contains x1 x10, and y=x1^x2^x3.
2