Don’t release anonymized datasets

There is no thing as an anonymized dataset. Anybody propagating this idea even tacitly is doing a disservice to the informed debate on privacy. Here’s a round up with some recent cases.

Re:publica

Just today Berlin visualization outfit Open Data City published a visualization of the devices that were connected to their access points during the Re:publica conference earlier this month. The visualization is a neat display of the ebb and flow of people in the various rooms during the event.

It is also a good attempt to change the discourse about data protection in Germany. The discourse tends to be locked in the full stop stance where absolutely ‘nothing is allowed’ without a ton of waivers. Because of that hassle, a lot of things which could be useful are not implemented. A more relaxed approach and a case by case decision on things would be better. In the case of Re:publica there does not seem to be any harm in making this visualization or in releasing the data (here find it on Fusion Tables where I uploaded it).

What I find to be a disservice to the general debate is the application of ‘pseudonymized’ data where the device ids have been processed with a salt and hash. The identifying characteristics have been removed but the ids are still linked across sessions making it possible to link identities with devices and figure out who was where exactly when during the conference.

To state again: at a professional conference such as Re:publica there would in all likelihood be no harm done if the entire dataset would be de-anonymized. The harm done is the pretense that processing a dataset in this way and then releasing it with the interlinkage across sessions is a good idea.

Which brings me to my next point.

Equens

Yesterday the Dutch company, Equens, that processes all payment card transactions announced a plan to sell these transactions to stores. Transactions would be anonymized but still linked to a single card. This would make it trivial for anybody with a comprehensive secondary dataset (let’s say Albert Heijn or Vodafone) to figure out which real person belongs to which anonymized card. That last fact was not reported in any of the media coverage of this announcement which is also terrible.

After a predictable uproar this plan was iced, but they will keep on testing the waters until they can implement something like this.

Today Foursquare released all real-time checkin data but with suitable anonymization. They publish only the location, a datetime and the gender of the person checking in. That is how this should be done.

License plates

Being in the business of opening data we at Hack de Overheid had a similar incident where a dataset of license plates was released where the plates had been md5′ed without a salt. This made it trivial to find out whether a given license plate was in that dataset.

This was quickly fixed. Again this is not a plea against opening data —which is still a good idea in most cases— but a plea for thinking about the things you do.

AOL search data

The arch-example of poorly anonymized search data is of course still the AOL search data leak from back in 2006. That case has been extensively documented, but not extensively learned from.

Memory online is frightfully short as is the nature of the medium but it becomes annoying if we want to make progress on something. Maybe it would be better altogether to lose the illusion that progress on anything can be made online.

For the privacy debate it would be good to keep in mind that the increasingly advanced statistical inference available means that almost all anonymization is going to fail. The only way around this is to not store data unless you have to or to accept the consequences when you do.

Who owns the future?

In Conversation: Jaron Lanier and James Bridle On Who Owns the Future? from The School of Life on Vimeo.

I have just watched the above conversation between Jaron Lanier and James Bridle in Conway Hall organized by the School of Life. The event was to mark the occasion of Lanier’s new book “Who Owns The Future?” (Guardian review) and the conversation focused on some interesting ideas from it. I will probably not read the book itself, but I think the things said in the video above can be taken by themselves and though they are provocative they do not motivate me to give Lanier any money.

The main issue is that Lanier signals some interesting problems (He’s not alone. Om Malik just posted this about Data Darwinism), he makes some terrible comparisons and posits solutions that are wholly unconvincing.

Problems

Laniers big idea is that those with the biggest computers on the network (and the largest collection of brains to program those computers) are in danger of becoming the rentiers of big data. They will be able to out-compute everybody else and figure out what Gibson called the ‘order flow’ in his Blue Ant trilogy: the best set of actions given the circumstances.

That is an interesting if not exactly novel idea. It serves as a jumping off point into some outright crazy ideas about intellectual property. Lanier compares the contraction created by the current austerity measures with what is happening in the music industry. This is a ridiculous comparison. Even if it did hold, then whatever is happening is an overdue correction to a situation that was unsustainably overleveraged.

In the same vein he waves around the scarecrow that ‘the economy will shrink’. A notion that will undoubtedly play well with the same audience that is inclined to buy his book. Rhetoric about shrinking economies is almost always a phantom. Economic shrinkage may very well be in our near future and does not necessarily need to be a bad thing.

Lanier’s point that people are forced into an informal economy is valid but it speaks more to the failure of social systems than anything else. The social democratic contract that may be inconceivable for Americans is working quite well in Europe. It may need updating both for changing demographics and the digital age, but I don’t think many people here would trade it for what Lanier is peddling. Like I mentioned in my data tax post, we don’t have the problem of musicians who can’t pay their medical bills.

Solutions

The proposed solutions are even more problematic (though if you’re so inclined you might term them ‘thought provoking’).

Lanier seems overly influenced by the music industry and by the concept of private copyright. I would assert that the music industry with its track record is not something worth emulating. The sky is not falling in the music industry. They are facing a long overdue re-evaluation of their social contract because their carrier of value has lost its excludability. There are still lost of people making music and thriving.

Lanier seems to roughly comprehend how a just society should work: ‘For society to be democratic, income needs to be distributed in a way that is roughly a bell curve.’ but at the same time he seems to be confused how it should be implemented: ‘Socialism needs to be off the table in the information age.’

The bidirectional reference networks that Lanier proposes that preserve the context and provenance of data sound fantastic. There are however real reasons why we are doing the ‘profoundly dumb thing we are doing’ instead. His network sounds awfully similar to the idea of the semantic web, where everything online will work perfectly if only we would do it The Right Way (which we of course never will).

His solution to ‘Become as aware as possible of how you fit in other people’s computation schemes.’ is a good idea. It is the same algorithmic literacy pointed to in work by Kevin Slavin, Douglas Rushkoff and James Bridle himself.

I’m afraid that Lanier’s rhetoric of a ‘more honest accounting’ will play particularly well in Germany where similar words are already being used to take Google to court. Germany passed a Leistungsschutzrecht (ancillary copyright for publishers) because they figured out that large American companies were making outlandish amounts of money based on the work of large German publishing houses.

The conversation of a fair distribution of wealth in a power-law based networked economy is one we need to have. I doubt though if this particular book is a good starting point for such a conversation. Lanier’s cultural foundations point us towards a solution that is at best unrealistic and tries to extrapolate the problematic private notion of copyright to society as a whole.

The data tax I wrote about yesterday is an approach from a more public point of view. That would focus more on personal data and the revenue generated from such a tax would go into government so it would be subject to democratic controls. Ideas that won’t fly well with Lanier’s Silicon Valley crowd, but maybe that’s all the better.

Taxing data is not crazy

There are some interesting similarities between a recent proposal commissioned by the French government and the book out by Jaron Lanier just now “Who Owns The Future?”

Both analyses signal the dominance of corporate actors in a big data world and both suggest new methods of taxation as a potential solution to the problem. An article over at Forbes explains the commission’s proposal by Nicolas Colin and makes a lot of sense.

The French report has been received with predictable knee-jerk responses across the tech world. It is true that governments have not been very good at regulating the internet. But not regulating the internet is not a solution. We could hope for representation that is competent when it comes to the digital world.

The companies that create the internet should not cry foul. They have a track record of evading taxes more than contributing their fair share back to society.

I’ll tackle Lanier’s position in another post. I just watched the conversation he had with James Bridle in Conway Hall and noticed some errors in Lanier’s ideas: they require a fully functional semantic web, they seem overly informed by private copyright practice and complementarily they take a weak government for granted.

How you would enforce such a law is another question entirely, but it cannot go further off the mark than how large companies manage to evade taxes right now. It may in fact be a lot fairer to tax data at the point of collection/use.

If you don’t bother to read the article above, I can sum it up in two key points below:

Data is hazardous waste material and as such its production and storage should be discouraged (the CO2 tax was given as an example in the Forbes article). Cory Doctorow compared personal data breaches to nuclear disasters, because the fallout is so tremendously hard to contain and control. Whoever collects large amounts of personal data treats the privacy damage caused by breaches as an externality. As such the storage of such data should be discouraged with a tax.

Data is capital and should be taxed as all capital is. Storage, mining and arbitrage using data can generate revenue for sophisticated market actors (those that Lanier terms as those with ‘the biggest computer on the network’). Data is a value adding asset that generates wealth and more data for those who already have it. If we don’t want a situation where a small group of people get richer at the expense of everybody else, we should tax it.

So data is both capital and hazardous. We tax many things with either of those properties so we should definitely tax something that has both.

Hosting on Heroku with functioning MX records

It seems to be not completely obvious how to host a website on heroku while at the same time also maintaining e-mail delivery. You would think that this is a very common situation and it would be well documented but unfortunately it is not.

We got a DNSimple account because that’s the way that heroku allows naked domains to function. DNSimple sets up the ALIAS record for you rather easily, but what it doesn’t do is warn you if you have both MX and CNAME records on something. What happens is that the CNAME record always takes precedence as a redirect so your e-mails are then routed to proxy.heroku.com. Something that is undesirably and that DNSimple should warn against.

What turns out to be the best solution is to set ALIAS records for both your apex domain and your subdomains (as proposed here). This way you don’t need a CNAME record anymore that can interfere with other settings. Heroku in their documentation advise you to use a CNAME record, so I’m going to ask them if there are any problems with using an ALIAS for all web routing.

The other option would be to purchase another plan for Zerigo which seems to be heroku’s preferred solution for this issue right now. Again this is rather poorly documented and we would have liked to be informed about that before we chose for the DNSimple option.

Update: Heroku replied with the following.

Great question. The ALIAS record, created by DNSimple, is basically a bunch of magic that does a combination of what CNAMEs and A Records do, but does it behind the scenes. You can read more about the ALIAS records here: http://blog.dnsimple.com/zone-apex-naked-domain-alias-that-works/

That said, DNSimple would likely be better quipped to answer a question like this. I don’t see any reason why you couldn’t use ALIAS records in place of CNAMEs. There might be a slight difference in performance between the two, but I’m not certain enough about that to say for sure.

After which I asked the same question over at DNSimple on their blog. That comment is awaiting moderation and an answer but I’ll post that here as soon as it appears.

Watersnake, a simple voting app

My small project during Swhack was to create a django version of a delegated voting system partially inspired by Liquid Feedback and the manyfold problems that system has. In particular that it is written on such an esoteric stack that it is near impossible to get running without root on a Linux machine and let’s not even discuss the maintenance. What is even worse is that it makes it nearly impossible for outsiders to join the project and contribute to it significantly.

In this interview about Liquid Democracy you can read quite clearly how the technical mandate drives the direction of the project. Something that may not be very desirable if you think of it as a democracy-centric issue and not a technology-centric one.

So to see how hard it would be to write something similar in vanilla django. It’s easy to hate on django but you can find tons of people who can work on this in just about every major city, the framework and the documentation are mature and many parts of the framework can be called excellent.

I thought putting something together that at its core implements a delegated voting engine should be doable in an afternoon and it was. What took the most time was playing around with the settings of the testrunner which I hadn’t really used before. So the watersnake app in this project does majority voting on single proposals with support for delegation.  To see it work you have to run the tests, but building this out into a full fledged (web) app that can be deployed to heroku with a single command is technically trivial (and also time consuming).

This wasn’t a stretch to implement right now because I’m also doing some other projects which border on collaborative writing/decision making/filtering. As always, technology is neither the problem or the solution, but certain technical systems grant different socio-technical affordances than others. I will probably not work on this unless there is a clear demand, but I thought it would be useful to debunk the idea that building such a system needs to be difficult or complex.

Week 308

Besides the immense amount of things we did over at Hubbub last week, I also spent a lot of time doing various other things which sort of amazed me to be honest.

Giving this another go with my improved German skills #digiges

Tuesday I went to the Netzpolitische Abend here in c-base where Janneke Slöetjes of Bits of Freedom was one of the speakers. It was great fun catching up with what they’ve been busy with and the activist’s life.

And on Saturday Jan Lehnardt and I organized the first Swhack Berlin, a commemorative hackathon to do the things that we would normally only talk about. A round-up of the things we did is still forthcoming, but everybody is super-busy of course. It was a lot of fun and I was pleasantly surprised even by the 10+ people who showed up and got busy. We’ll do another one sometime in the near future.

Swhack Berlin

So this Saturday Jan Lehnardt and I are having a small hackathon here in Berlin in remembrance of Aaron Swartz and to in one small way continue doing the work that needs to be done on the internet, in government and especially where those two meet.

We have done a lot of what we used to call ‘civic hacking’ in the past, a phrase that has been used so often by now that I’m slightly sickened when using it. But there is still a lot to be done and both resistance against the movement and co-optation are growing. In Germany, where I live now, things are still in a pre-dormant state. The internet is in a rather sorry state here and people are good at complaining but less so at changing things.

Saturday’s hackathon is meant to focus efforts and do random stuff. The stuff you normally never get around to doing because of the day-to-day business. I have some rather unorthodox ideas to change things but I could use some help. So join us!

29C3: Long live the protocoletariat

I followed the last CCC from a distance reading the Twitter fallout and keeping track of the live streams while getting work done in an empty Berlin. Besides the various controversies playing out, there were some good talks. What I found to be the best of the event was “Long live the protocoletariat” by Eleanor Saitta (@dymaxion) and Smári McCarthy (@smarimc) about a topic that is very near to the things I am thinking about: institutions and networks and all of the opportunities and problems associated with them. The presentation in the first thirty minutes of this video is well worth watching. Pull quotes below are paraphrases.

I have been to CCC once and didn’t feel the need to go again. I have been long disheartened by the odd turn that political consciousness has taken within that particular technological crowd. The combination of information/privacy fundamentalism with a total disdain for normal users is something which is normal in the open source world but not something I can support.

It is refreshing then to hear two people at CCC who pursue an agenda that I think is important in a manner that make sense and is constructive. Briefly the things from the talk that I found noteworthy.

They treat the various levels of obscurity and disfunctionality built into Liquid Feedback but on the whole they do agree that it is a functional system that needs some bug fixing.

Liquid Feedback seems to have been sparked by a blog post some years ago is a good example of the primate of the developer. Because of limitations in development capacity, whoever builds these things builds the definitive version. It remains definitive until somebody builds a better one (or if the problem goes away). We don’t get the option of more consideration, or better design or any of the other things we would want. We get whatever time a volunteer can spare to hack something together that works. This also makes that often we are in local optima because there already is an implementation that is perceived to be ‘good enough’.

People who have the time to solve problems don’t have problems. Those with real problems are too busy coping with their problems to be able to solve them generically. —Smári McCarthy

“Don’t confuse math problems with human problems.” —Eleanor Saitta

An interesting next step is their demand of more thorough thinking from those aspiring to politics. They warn against an information politics that says: ‘We just want our current way of living without the bad things.’ I agree —and many others with me— that idealism needs a clear and functional vision of an alternative world with an implementation plan to get there.

What then follows is a comparison between institutions and networks. I think it is very interesting to think about the importance of these two and why they have such trouble to deal with each other. What we are doing at Hack de Overheid is one attempt at bridging a network with a bunch of Dutch institutions. We should come up with more translator services and adapter structures to make the two work together.

They then treat the protocolization of institutions. How an institution can be decomposed in process and substance. How the symbolic language that an institution accepts can be codified as an automaton and then be translated into a peer to peer communications protocol. One problem of such a protocol is that it lacks institutional memory and tacit knowledge. Networks consist of nodes that adhere to the protocol (by definition) and are in effect interchangeable which means they don’t have to remember over the whole.

Memory and knowledge are essential for the proper functioning of all organizations and that functionality needs to be coded in some way into the networked version. I’m reading James C. Scott right now and he talks at length about the high modernist folly of laying down ‘thin and brittle’ structures that do not work. Such structures have not been tested or used enough and lack the pliability and adaptations that are necessary for proper functioning.

Saitta and McCarthy propose to build institutions that only do long-term memory and let the process execution be handled by the network.

They then identify the open problems that still need work:

  1. Mapping the complexity classes and executive processes of institutions
  2. A language for protocolization of executive processes
  3. A decentralized but collectivized and compellable taxation protocol for an anonymous crypto currency
  4. Better tools for network-instution interactions
  5. A concept of network jurisprudence and mercy

The complexity theoretical treatment of social institutions is something that rather tickles my fancy. On university we never got to solve anything but the most theoretical of problems during those courses. I recently found some complexity theoretical treatments of games (“Classic Nintendo Games are (NP-)Hard”) and I look forward to even broader applications.

To stay in the vein of games, the problems stated in 1. and 2. are things that have a lot in common with what we do when we build games. The design of games consists of many similar information theoretical problems. Games may also be good staging grounds if you want to replace the nation state. The first thing that comes to mind to model these interactions is Joris Dormans’s Machinations, a finite state machine modeling tool.

Anyway it looks like there are tons of important and interesting problems still to be solved to which we as game practitioners might be able to contribute as well.

There are philosophical problems that we need to solve but they need to be directed towards the real world. —Eleanor Saitta

After the talk there follow a series of somewhat odd questions. The replies fortunately more than make up for it:

You need to have a sufficiently complete philosophical understanding of why your ideas make sense and how they are coherent and how they encompass [agriculture]. Otherwise your [privacy] arguments are going to fall flat. —Smári McCarthy

Instead we should build alternate structures. We are going to build this thing over here and it’s a much better way to run things. That can sort of infect into the world and obsolete other things. —Eleanor Saitta

That last one should be the golden test of activism: are you just complaining or are you doing something to actually make things better? If not, why not?

SZ: Echoes of chatter

I’m sitting in the train and get passed a link to a piece from Süddeutsche Zeitung about the internet and its sharing culture. This being my more-or-less favorite German newspaper, I dig into it expecting it to yield a solid piece of thought that will cause me to reflect on my online behaviour.

The real result is a lot less positive. It ends on this note:

Wir müssen nichts mehr erfinden, denn Google und Facebook lehren uns, dass neue Ideen leicht zu haben sind. Es könnte sogar sein, dass fügsame, gelehrige Kopisten jetzt erfolgreicher sind als diejenigen, die innovativ sind.

Some old dude quotes selectively and writes about a subjective divide between digital and analog like you would find in the eighties. And it quotes an interview with Geert Lovink from 2007 that superficially treats ‘blogging’.

The piece opines that because of connectivity we will not be able to pay attention to what is important or come up with original thoughts ourselves. But it turns out that the Süddeutsche has fallen prey to that disease itself. Here as almost everywhere, German writing about the internet follows a predictable course that fails to illuminate.

Week 288: settling in and Munich

Coffee station if anybody fancies a cup

Monday I was given a Clever coffee maker and a Hario grinder to be able to make slow coffees at the office. Thanks Kars and Lea for being so attentive. I also made a start moving my books over but more and more having a professional physical library is feeling like a huge dead weight.

I would like to have these books in digital form but I’m sure as hell not going to pay for them all again at ebook markups. No way in hell. Bittorrent seems like a better option.

We’re very proud of Beestende being a game that actually does what it promises and we submitted it to the Dutch Game Awards.

A trailer for a reality show that I participated in about a year ago was released under the title Heetsel. Doing anything for tv or tv-like media feels intensely surreal and judging from the final edit that surreality is conveyed quite well by the delivered product.

I published the video and brief write-up of my NEXT Berlin talk about love and gamification over at Hubbub.

From the 14th floor the Alps are visible

On Wednesday I did random administrative stuff and prepped my visit to Munich the next day.

Munich is relaxed

On Friday I had coffee with Chris Eidhof at the new Barn which is a stunning large venue with a roaster and a very large coffee desk. The coffee is the same quality we’re used to but it’s policies are a bit more restrictive. I won’t talk about the online tumult caused by this, but I hope they can sort it out quickly and then focus again on what they do best: brewing awesome coffee.

Nice place but it could use a touch of warmth

And finally I had a cup with Mustafa at the Five Elephant. Mustafa is an all-star programmer who has recently moved to Berlin to build a startup. Another too little publicized —soon to be— success story in the local scene.

OMG it's full of kites!