Starting from:
$30

$24

HW 3: The Centralized Curator Model Solution

Instructions: Submit a single PDF le containing your solutions, plots, and analyses. Make sure to thoroughly explain your process and results for each problem. Also include your documented code and a link to a public repository with your code (such as GitHub/GitLab). Make sure to list all collaborators and references.




Tails, Trimming, and Winsorization: In all of the parts below, the dataset is x 2 f0; 1; : : : ; Dgn. In all of the implementation parts, you should write code that takes as input D 2 N, n 2 N, x 2 f0; 1; : : : ; Dgn, and " 0.



Prove that the following algorithm for estimating a Trimmed mean is "-DP and implement
it in code:

M(x) = :9n
0
P:05 xi P:95
xi
1
+ Lap
"n
;
1


@
X


A




D






































where P:05 and P:95 are the 5th and 95th percentile of the dataset. That is, we are applying




the Laplace mechanism after removing the bottom and top 5% of the dataset. (Hint: Think about Lipschitz constants.)




(b) Prove that for large enough n, the analogous algorithm for the Winsorized mean is not "-DP:

M(x) = n


n
[xi]P::
05
+ Lap
"n
;
i=1
1


X
P
95




D


























where [x]ba is de ned as in Problem Set 2. In Winsorization, we clamp points rather than dropping them. (In class on 3/11, we incorrectly referred to dropping points as Winsoriza-tion.) Again, it may be useful to rst think in terms of Lipschitz constants.




In class, we saw how to use the exponential mechanism to an estimate of the median, P:5.



Describe and implement a version of the exponential mechanism that releases an estimate of the tth percentile Pt of a dataset x 2 f0; : : : ; Dgn any desired t 2 [0; 100]. (A direct implementation of the exponential mechanism would require explicitly calculating weights for each of the D + 1 possible outputs, which can be too slow for large values of D such as in the parts below. One way to solve this is to bin the elements into xed, coarser intervals.




Alternatively, you can sample more quickly from the output distribution of the exponential mechanism by noting that if you sort the elements of the dataset xi1 xi2 xin , then all elements of each interval between xij and xij+1 have the same weight, so you can sample by choosing an interval with probability proportional to the sum of weights within it and then sampling uniformly from that interval. Feel free to use either solution below.)




Implement the following "-DP algorithm for estimating a Trimmed mean of a dataset: use
^ ^

your algorithm from Part 1c to get "=3-DP estimates P:05 and P:95 of the 5th and 95th

^ ^

percentiles, drop all datapoints that lie outside the range [P:05; P:95], and then use the







1



Laplace mechanism to compute an ("=3)-DP mean of the trimmed data. That is, your code should compute




:9n
0
^
Xi
^


i
1


0:9"n
!


1














^
^


M(x) =


@




x


A
+ Lap
3(P:95
P:05)
:


i:P:05
:95


































x


P














































(e) Determine whether or not the following analogue for a Winsorized mean is "-DP: use Part 1c

^






^
of the 5th and 95th percentiles, and output
to get "=3-DP estimates P:05
and P:95


n
n
i
P^:05
!
^
^
!




"n
M(x) =
1




[x
]P^:95
+ Lap
3(P:95
P:05)
:


















Xi




















=1


















You do not need to formally prove your answer, but you should at least provide an informal explanation.




The dataset MaPUMS5full.csv provides the 5% PUMS Census le for Massachusetts. For



" = 1 and D = 1; 000; 000, compare the RMSE between DP means and the actual means for each PUMA in Massachusetts,1 for DP means calculated using (i) the ordinary Laplace mechanism for a mean (remembering to clamp your data to the range!) and (ii) the algo-rithm from Part 1d. Also show box-and-whisker plots of the DP released means for each PUMA by these algorithms, noting the true means. You should probably order these by mean income, or perhaps skew of income, or anything you think reveals an interesting pattern. Give an intuitive explanation of the kinds of datasets on which algorithm (i) is likely to perform better than algorithm (ii) and vice-versa. Describe any modi cations you might propose would increase the utility (at the same level of privacy preservation) for data similar to this income example.




Composition: Suppose you have a global privacy budget of " = 1 (and are willing to tolerate = 10 9) and you want to release k count queries (i.e. sums of Boolean predicates2) using the
Laplace mechanism with an individual privacy loss of "0. By basic composition, you can set

p




"0 = "=k. Using the advanced composition theorem, you can set "0 = "= 2k ln(1= ). We have provided you with code from PSI for the \optimal" composition theorem for di erential privacy that calculates the largest value of "0 that ensures global ("; )-DP as a function of ", , and k.3 For each of these choices, plot (on the same graph) the standard deviation of the Laplace noise added to each query as a function of k, and nd the smallest values of k where the advanced and optimal composition theorems strictly improve upon the basic composition theorem.




Synthetic Data: Expanding the template from class, and using again MaPUMS5Full.csv, create a DP three-way histogram4 release of income, education and age. You do not need to graph this histogram, just compute the release for each binned combination of the variables. From this, you should be able to generate synthetic data of these three variables. Run a linear regression






1You can assume that the N in each PUMA is public information.

2A Boolean predicate is a function that returns a 0 or a 1. An example of a count query might be the sum of bits for all college students.

See the function update parameters in /examples/wk5 centralized/psiExamples.r or psiExamples.ipynb.






That is, a histogram representation counting the occurrences of having all possible combinations of the three binned variables.












2



as a post-process on your synthetic data, predicting income from education and age5 using the equation:










Incomei = 0 + 1Educationi + 2Agei + i;
i N(0; 2)
(1)
Let
=
f
0 1 2 g
~






; ;
be the coe cients in the full sensitive data, while ~ the DP release we
generate. The mean-squared error of a DP release of can be decomposed into the contributions of bias and variance as:




MSE( ~) = bias( ~)2
+ var( ~) = (E[~])2


(2)
+ E[( ~ ~)2]
For this calculation, we are taking the (sensitive) regression coe cients on the entire dataset as the true values of . Show the contributions to MSE of the bias and variance of the DP-regression coe cients.6




As a baseline to decide if these squared bias and error terms are large, we can compute the MSE due simply to sampling, by bootstrapping with replacement new datasets in which we compute

^ ^

new (sensitive) regression estimates on the bootstrapped data and compute MSE( ). How do the bias and variance terms due to creating DP-releases compare to the this numerical estimate of the error introduced by sampling?




BONUS: Using your developed understanding of di erential privacy, and the described use case in the Gaboardi et al. PSI paper, reexamine the deployed instance of the PSI budgeting tool, available at http://psiprivacy.org. Provide any feedback that you think would make the interface easier for the intended non-expert \data owner" user to budget a DP-release, or would otherwise improve the system. (Note: Insightful, considered feedback will receive 1/2 point bonus, and feedback that strikes us a revelatory or particularly intriguing idea will receive 1 point bonus and a note of thanks in a future paper draft.)



Final Project: By April 9, submit a couple of pages giving a detailed description of what your nal project will look like. You should be able to clearly state your research questions, brie y articulate how your project relates to what has been done in the past, describe the approach you are taking, give your timeline for completing various aspects of the project, and discuss your fallback plan in case you don't obtain the results that you're hoping to obtain.






















































5You will likely nd that log(income) has a more linear relationship with your other two variables, so feel free to shift from income to log(income) if you prefer. However, you will need to decide how to treat zero values in income; one option is to clip the lower bound of income to some small positive value.

6To numerically compute the expectations, simply repeat your simulation many times and average.




3

More products