HW 2: Differential Privacy Foundations Solution

Starting from:

~~$30~~

$24

Home

Instructions: Submit a single PDF le containing your solutions, plots, analyses, and documented code. Also include a link to a public repository with your code (such as GitHub/GitLab). Make sure to list all collaborators and references.

1. Mechanisms: Consider the
following mechanisms M that takes a dataset x
2
[0; 1]n and returns

n

an estimate of the mean x = (P
i=1 xi)=n.

i M(x) = [x + Z]10, for Z Lap(2=n), where for real numbers y and a b, [y]ba denotes the \clamping" function:

8

a if y < a

<

[y]ba = y if a y b :

:b if y b

M(x) = x + [Z]1 1, for Z Lap(2=n).

(

1 w.p. x

M(x) = :

0 w.p. 1 x:

iv M(x) = Y where Y has probability density function fY given as follows:

fY (y) =
8 01 e njy xj=10dy
if y 2 [0; 1]:

<
e njy xj=10

R
2

:

0
if y = [0; 1]:

(This is an instantiation of a continuous version of the exponential mechanism.)

Which of the above mechanisms meet the de nition of ( ; 0)-di erential privacy for a nite value of , and what is the smallest value of (possibly as a function of n) for which they do?

For those that do not meet the de nition, calculate the smallest value of (again possibly as a function of n) for which they satisfy ( ; ) di erential privacy for a nite value of .

Describe how you would modify the algorithms to have tunable privacy parameters (and in case of mechanisms that require it) and tunable data domain [a; b] (rather than [0; 1]).

Which of these algorithms do you consider to be \best" for releasing a mean and why? (There is not a single \right" answer for this problem.)

1

Evaluating DP Algorithms with Synthetic Data: Consider a dataset x 2 Nn drawn from a Poisson process, which has probability distribution Pr[xi = k] = 10ke 10=k! for natural numbers k (where we consider k = 0 to be a natural number and de ne 0! = 1).

Write a data generating process (DGP) function that generates a dataset x 2 Nn according the above Poisson process.

Pick one of your di erentially private mechanisms from question 1 (generalized to allow for

arbitrary choices of and data range [a; b] as parameters) that releases an estimate of x. Implement this mechanism as a function which is given a vector of values x 2 [a; b]n and an and makes a di erentially private release. You can assume the sample size n is public knowledge. To apply your mechanism to unbounded data x 2 Rn, you will have to clamp x to a chosen range [a; b]. For simplicity, we will x a = 0 and only consider the e ect of varying b.

Recall the discussion on clamping from class; if the range is large, the sensitivity increases, so noise increases and utility drops. However, if you clip the values too aggressively the answer will be biased, and again utility will drop. For n = 200 and = :5, plot the root

mean squared error as a function of the upper bound b. Identify the approximate optimal value b of b for this data distribution.

Suppose we have an actual (not synthetic) dataset x 2 Nn for which we want to release a di erentially private mean, and we don't know the underlying distribution of x. Again,

we need to select the parameter b and want to do so in a way that minimizes the error. A natural idea is to use a (nonparametric) bootstrap1 to generate many datasets that are \similar" to x in place of the data-generating process above, and optimize the choice of b as above. Once we nd an optimal value b , we then do our di erentially private release on the dataset x itself.

Explain why this approach is not safe in general and may violate di erential privacy.

Propose some alternative methods for determining a good upper bound b for a given sen-sitive dataset x, while continuing to satisfying the de nition of di erential privacy.

Regression: Consider a dataset where each of its n rows is a pair of real numbers (xi; yi), where xi is drawn from a Poisson distribution as in Problem 2, and yi is then a noisy linear function of xi, speci cally:

yi = xi + + i;i N(0; 2)
(1)

for unknown parameters ; ; .

One simple way to estimate the parameters and without privacy is by a simple linear regression, using the following estimators:

^ =
Sxy
=

in=1(xi x)(yi y)

(2)

Sxx
P
in=1(xi

x)2

^ = y ^x;
P

(3)

(a) In class and section, we saw an algorithm and implementation to produce a di erentially

^ 2

private version of . Augment this implementation to also produce a di erentially private

In a nonparametric bootstrap, we generate new datasets by sampling with replacement from x itself.

See /examples/wk3 dp foundations/laplaceRegressionRelease.ipynb and laplaceRegressionRelease.r for the regression implementations shown in section, building on the code from class.

2
^

version of ^, so that the overall method for computing both ^ and is -DP, for an input parameter . If your algorithms use several basic di erentially private algorithms as subroutines, divide the overall privacy budget of among them evenly. You can clamp both the xi's and yi's using reasonable values of your choosing (perhaps following your choice from Problem 2 for x). Describe the reasoning behind your choice in your write up.

Evaluate the performance of your algorithm using a Monte Carlo simulation with synthetic data as in Problem 2 Parts (a){(c), for the parameters = = = = 1 and n = 1000. Measure utility by the mean-squared residuals:
1 n

yi ^xi ^
2
n i=1
:

X

(The non-private method for simple linear regression described above is chosen to minimize this quantity, hence the term \least-squares regression".) Plot and compare the distri-butions of mean-squared residuals you get with your di erentially private simple linear regression and with a non-private simple linear regression.

Now, run experiments to see if there is a di erent partition of your privacy budget in your algorithm that yields better utility. Use a grid search3 to explore di erent partitions of and see if you nd one that is convincingly better (in terms of mean-squared residuals) than an equal partition under the given data distribution. Show and explain your results.

DP vs. Reconstruction Attacks: Suppose M : f0; 1gn ! Y is an ( ; )-DP mechanism and
A : Y ! f0; 1gn is an adversary that is trying to reconstruct the sensitive bits in the dataset

x 2 f0; 1gn from the output M(x). Suppose the dataset is a random variable X = (X1; : : : ; Xn) consisting of n iid draws from a Bernoulli(p) distribution, for a known value of p. Prove that the expected fraction of bits that the adversary successfully reconstructs is not much larger than

the trivial bound of maxfp; 1 pg (which can be achieved by guessing the all-zeroes or all-ones dataset). Speci cally:

E [#fi 2 [n] : A(M(X))i = Xig=n] e maxfp; 1 pg + :

(Hint: write the quantity inside the expectation as an average of indicator random variables, and for each i, consider running M on the dataset X(i) where we replace the i'th row of X with the xed value 0.)

Final Project Next Step: You will have received comments on your initial project sketch

from homework 1. Using these comments, and rereading the \Final Project Guidelines" (http: //seas.harvard.edu/~salil/cs208/spring19/project-guidelines.pdf) document on the course website, re ne your topic ideas and, if you wish, seek out 1-2 collaborators on Piazza. Your group should submit a revised project proposal (or two) approximately a half-page long. (All group members should submit the same proposal.) Include at least one potential dataset you might use for experiments (unless you're doing a purely theoretical project), three citations to related works that either describe the use case (these could be popular press articles), the privacy risks, and/or methods you might build on. Also ask concrete questions where you could use additional pointers or guidance from us.

Grid search is a term from optimization and machine learning that refers to an exhaustive search through the hyperparameter space discretized into a grid (to make the search nite).

3