One of the most misunderstood topics in privacy is what it means to
provide “anonymous” access to data. One often hears references to
“hashing” as a way of rendering data anonymous. As it turns out,
hashing is vastly overrated as an “anonymization” technique. In this
post, I’ll talk about what hashing is, and why it often fails to provide
effective anonymity.
What is hashing anyway? What we’re talking about is technically
called a “cryptographic hash function” (or, to super hardcore theory
nerds, a randomly chosen member of a pseudorandom function family–but I
digress). I’ll just call it a “hash” for short. A hash is a
mathematical function: you give it an input value and the function
thinks for a while and then emits an output value; and the same input
always yields the same output. What makes a hash special is that it is
as unpredictable as a mathematical function can be–it is designed so
that there is no rhyme or reason to its behavior, except for the iron
rule that the same input always yields the same output. (In this post
I’ll use a hash called SHA-1.)
With that out of the way, let’s consider whether hashing a Social
Security Number renders it “anonymous”. If you hash my SSN, the result
is b0254c86634ff9d0800561732049ce09a2d003e1. (Let’s call this the “b02
value” for short.) That looks nothing like my SSN–but that in itself
does not make the value “anonymous”. The key question is whether a
person who gets the b02 value can figure out what my SSN is.
How might an analyst who has the b02 value try to determine my SSN?
One approach that doesn’t work is to try to run the hash function
backward–or as a mathematician would say, to find its inverse. Many
functions can be run backward. Consider the function that adds 17 to
its input. To run that function backward, you just subtract 17.
The hash has an inverse (of a sort) but nobody knows what it is, and as
far as anyone knows it is not feasible to find the inverse. So a smart
analyst will give up on the invert-the-hash approach.
But there is another trick available to the analyst–and this trick
will work. The analyst simply guesses my SSN–he enumerates all of the
possible nine-digit SSNs and hashes each one. When he hashes my
correct SSN, the result will be equal to the b02 number, so he will know
that he guessed right. You might think it would take a long time to
run through all of the possible SSNs, but computers are very fast–there
are “only” one billion possible SSNs, so your laptop can hash all of
them in less time than it takes you to get a cup of coffee.
A clever analyst would do it even faster. He would hash all of the
possible SSNs in advance, and build an index that allowed him to recover
the SSN from its corresponding hash value in the blink of an eye.
Hashing the SSN would offer no protection at all against an analyst who
had built such an index.
It should be clear by this point that hashing an SSN does not render
it anonymous. The same is true for any data field, unless it is much,
much, much harder to guess than an SSN–and bear in mind that in practice
the analyst who is doing the guessing might have access to other
information about the person in question, to help guide his guessing.
Does this means that hashing always fails, and is never a good way to
scrub data? Almost, but not quite. There are more advanced uses of
hashing that can offer some protection in some settings. But the casual
assumption that hashing is sufficient to anonymize data is risky at
best, and usually wrong.
[In case you’re wondering, the b02 value is not really the hash of my
SSN. It is the hash of the text string “my SSN”. There is no way I
would publish the hash of my actual SSN.]
Source : http://techatftc.wordpress.com/2012/04/22/does-hashing-make-data-anonymous/