Practical Differential Privacy w/ Apache Beam

#apachebeam #privacy #googlecloud #go

One of the most durable techniques to protect user privacy is through differential privacy. In a previous post, we explored how to build an Apache Beam pipeline that extracted and counted ngrams from HackerNews comments. Today, we'll take the same pipeline and upgrade it with some differential privacy goodness using Privacy-on-Beam from Google's differential privacy library.

Step 1: Identifying the User

Counterintuitively, adding differential privacy to a Beam pipeline requires you to have an ID for each user. You don't need to know who the user is, you just need access to a stable identifier for them in your dataset. Good starting choices include that autoincrementing user_id field or a username/email address. The ID you pick should map as closely as possible to the entity whose privacy you are trying to protect.

💁Tip: To err on the safe side, consider hashing or encrypting this ID to prevent yourself from accidentally logging it or debugging with it.

Since we're using HackerNews comments, the author field is a pretty good choice. We'll start with some changes to grab the author for each comment and propagate that user along through the ngram extraction.

// CommentRow models 1 row of HackerNews comments.
type CommentRow struct {
    Author string `bigquery:"author"`
    Text   string `bigquery:"text"`
}

// AuthorNgram represents an ngram and it's author.
type AuthorNgram struct {
    Author string
    Ngram  string
}

const query = `SELECT author, text
FROM ` + "`bigquery-public-data.hacker_news.comments`" + `
WHERE time_ts BETWEEN '2013-01-01' AND '2014-01-01'
AND author IS NOT NULL AND text IS NOT NULL
LIMIT 1000
`

func main() {
 // ...

    authorNgrams := beam.ParDo(s, func(row CommentRow, emit func(AuthorNgram)) {
        for _, gram := range ngram(row.Text, 1, 2, 3) {
            emit(AuthorNgram{Author: row.Author, Ngram: gram})
        }
    }, rows)

 // ...
}

Step 2: Setup Privacy Budget

In differential privacy-land, epsilon and delta are the main ways of controlling how much can be learned about any specific user. Bigger numbers = less privacy. For our pipeline, we'll pick sample values of epsilon = 4 and delta = 0.0001.

Why 4 and 10^-4? I don't know. Apple uses an ε=4 according to it's Differential Privacy Overview. I'd like to write a post on how to pick these numbers once I learn more.

// Configure differential privacy parameters.
epsilon := float64(4)   // ε = 4
delta := math.Pow10(-4) // Δ = 1e-4.
spec := pbeam.NewPrivacySpec(epsilon, delta)

Step 3: Make Private Data

Apache Beam pipelines use a PCollection as the primary container for data. Privacy-on-Beam introduces a new container, the PrivatePCollection, which acts like a PCollection but knows how to preserve privacy along the way.

Using the PrivacySpec from Step 2, and the PCollection<AuthorNgrams> from Step 1, we can build a PrivatePCollection by letting the library know which field has our user id, in this case, a reference to the Author field of the AuthorNgram struct. Passing the string name of a struct field feels a bit weird, but whatever.

pgrams := pbeam.MakePrivateFromStruct(s, authorNgrams, spec, "Author")

Step 4: Do Stats

In our previous pipeline, counting the ngrams was as simple as stats.Count(s, ngrams). Now that we have a PrivatePCollection there's a bit more work involved.

First, we need to simplify the data to just the ngram, converting our PrivatePCollection<AuthorNgram> to a PrivatePCollection<string>. Behind the scenes, the PrivatePCollection will keep track of the author. We need to call the ParDo function from the privacy-on-beam package for this transform, not the usual beam one. It works the same.

ngrams := pbeam.ParDo(s, func(row AuthorNgram, emit func(string)) {
    emit(row.Ngram)
}, pgrams)

With our private pcollection of ngrams we're now ready to Count them. Privacy-on-beam implements its own stat functions which are where all the real magic happens.

When calling pbeam.Count we'll also need to pick two more privacy parameters controlling the count behavior:

How many partitions (aka ngrams) a user can contribute.
How many times a user can contribute to one partition (use the same ngram).

To keep it simple, let's say that a user can contribute up to 700 different ngrams and can contribute 2 times to each ngram. In practice, this means if User A makes 5 comments saying "great idea", only 2 of them will be counted. If User B writes enough comments to contribute 701 unique ngrams, 1 of them will be randomly dropped. These parameters help remove outliers from the data which reduces the amount of noise you see in the output.

counts := pbeam.Count(s, ngrams, pbeam.CountParams{
    MaxPartitionsContributed: 700,
    MaxValue: 2,
})

Fin

With that, your upgrade is complete! The counts returned can be used very close to the old pipeline, it's a PCollection<string, int64> that you can write to a text file, upload to BigQuery, further manipulate, etc. Unlike the first pipeline though, the ngrams here are differentially private... we'll never know who wrote the HackerNews comments which contributed to them.

You can find the end-to-end code on Github, and a diff showing all the changes.

⚠️Warning: In my experience running differentially private pipelines generally takes longer and requires more compute resources. Expect longer runtimes and, if you run this on Dataflow, more instances.

Top comments (1)

Brian Michalski • Dec 3 '20 • Edited

If anyone knows of good resources discussing how to pick epsilon and delta I'd love pointers, let me know!