The Right to be Forgotten is giving people the right to have their information removed from the internet. On the surface this is a great thing and I'm a big supporter of it. However, when it comes to Open Source projects, things can get messy very quickly.
Many people realize that interacting with an Open Source project means that basically everything is open to the world not just the source code. However, so people don't realize this and hopefully this post will help explain this a bit better.
When it comes to Right to be Forgotten we only have few places were we have user data as part of us running Pidgin. These are our mailing list archives which have been replaced by Discourse, our issue tracker, and our old developer WIKI.
Open Source mailing lists have historically been archived by the project and other mailing list archives out on the internet. This is done for a number of reasons, but at the core the idea is that there are a lot of discussion and answers to support questions and having them available is acceptable.
However, people have requested use to forget them in our mailing list archives. There are a few options here to do this randomize their Personally Identifiable Information (PII) or just deleting any emails that mention them.
The former is a ton of work and at the time of this writing I'm not aware of a tool that does this. The later of course is destructive and depending on how active the requestor was on the mailing list can have a huge impact of the value of the archives. But deleting everything from the user is generally the course that's taken because there's also a possibility they posted PII in the body of their messages which the former wouldn't detect unless it was the email address or whatever.
But there's a bigger issue here. Remember those other archives out there on the internet, the requestor has to send a request to all of them. As far as I know there's no index of mailing list archives which means deleting one or two doesn't really matter in the grand scheme of things.
That said, there's another sleeping dragon here. Many people just archive mailing lists in their email accounts. With a few scripts they can extract all the emails from their email account into a format where they can just mirror it on a website adding yet another site that the requestor has to find and request to be forgotten.
So basically, if you think you might ever want to be forgotten, please stay off of the mailing lists :) I know this is unrealistic, but people need to be made aware of it.
That said, we have replaced our mailing lists with a Discourse instance at discourse.imfreedom.org. Discourse has an option for users to anonymize themselves which effectively gives them all the power to be forgotten without actually getting the project involved.
When they anonymize themselves all of their posts stick around but their username and stuff won't have any reference to them. But like email above, if the user put PII in a post it will remain unless it is manually removed.
So next up is the issue tracker. Previously we were using Edgewall Trac for issues and documentation via it's WIKI functionality. We made this read-only years ago because it is abandon-ware now and wasn't scaling to the needs of the project.
When we did this, we redirected all of the URLs relating to tickets/issues to our new issue tracker. Our new issue tracker is JetBrains YouTrack and similar to Discourse, users can anonymize themselves. Again, this is just account details and does nothing about the contents of posts.
But more importantly to note, is that when we imported all of the issues into YouTrack we only copied a few developer accounts. We did not copy any user accounts so any a user would be mentioned we instead used a user named Ghost
. This was specifically done to respect people's privacy by not copying their data from one service to another.
Semi-recently, we even went through and crawled our Trac instance and saved it all to static HTML. What you see today on developer.pidgin.im is the result of that crawl. This means we were able to shut Trac down completely and since Trac and it's backing database are gone, users can't delete their accounts.
However, there is still the WIKI to worry about and that's actually where the motivation for this whole post came from. But before we get to that, lets discuss what information is stored in the WIKI.
We are still serving the read-only versions of the WIKI on developer.pidgin.im because we haven't yet migrated them to the new site. We haven't migrated them because we determined that pidgin.im isn't the right place for all this information and that we should create a new developer.pidgin.im but we haven't gotten around to doing it yet.
So the WIKI can still mention user accounts, their edit history, and all of the edits they made. Now generally speaking the list of contributors here is much smaller than the one creating issues, but it's not zero. Luckily for us so far, no one that has edited the WIKI has requested to be forgotten.
So back to the impetus of this post.
we received a right to be forgotten email request to security (at) pidgin.im. It turns out that email address was not configured as we intended, but our provider figured that out and forwarded the email on to me.
The requestor was aware of our move to YouTrack but apparently did no investigation to whether or not any of their data was on there. I can't fault them too much here as I looked and apparently I never documented anything about the migration to YouTrack.
So knowing that all of the issues had already been anonymized and that we're redirecting the links from developer.pidgin.im to YouTrack, I knew we were in the clear there. That just left the wiki.
So I have a copy of the entire static dump we created of developer.pidgin.im. It is 30gb of uncompressed text and 1,142,949 files. It contains everything except the tickets that we redirect away from as I crawled the live site directly which had the redirects in place.
grim@spectre:~/d/trac-static$ du -sh
30G .
grim@spectre:~/d/trac-static$ find -type f | wc -l
1142949
So to check the WIKI, I have to grep
through all of it. Even on my beefy machine, this takes a fair amount of time as you can see from the output below.
grim@spectre:~/d/trac-static$ time grep -ri RequestorsUserName * | grep -v 'developer.pidgin.im/query' | grep -v "developer.pidgin.im/timeline" | cut -d: -f1 | sort | uniq -c
real 2m21.859s
user 0m38.984s
sys 0m29.105s
I'm ignoring the query
and timeline
paths because they're related to tickets and weren't redirected until after I created the dump.
At any rate, you can see that we did not find any hits for the requestor.
So while the requestor was able to help us find our misconfigured email address, everything else was left to us to determine if any of their data was even visible publicly.
Now I completely understand the desire to be forgotten, but Pidgin like most Open Source projects is completely run by volunteers and when you're forcing them to do work when you haven't do any it takes them away from working on the project and that's not good for anyone.
Top comments (0)