meta-roj

This site is currently broken

Friday, December 5, 2003

privacy and data mining

rakesh agrawal stole my stuff!

ok, not really. but i think i can see this person in my future.

i’ve been working on a few things that involve the preservation of privacy in a large collection of data that can still be analyzed. i ran through a couple ideas – generally:

hashing, where the data is manipulated permanently before it’s analyzed (but that can destroy relevant information)
black-box queries, where you can ask a question, but you don’t get to see the raw data (but that can put a real screw to reproducing results, and so confirming valid work)
compartmentalization, where only data important to the analysis is made available (but that means multiple analyses might piece together private information)
randomization, where the data is randomized as a set, and statistically-relevant results are still valid (but this generally means a big raw data set)

well, i didn’t really come to any conclusions, except that any of these methods might be useful depending on the circumstances. in the particular circumstances i’m thinking on, the randomization approach seems the most useful.

…off into the wild internet i go and amazingly enough, it’s been done.

here is an interview version. this [pdf] is one from acm. (see his page for papers).

the combination of protecting individual privacy and building an enormous database that can be combed (well, raked) for trends and historic comparisons is critical to improving my diet. i’m glad i don’t have to invent this wheel.

so this is all old news to me – why bring it up? rakesh was recently honored by scientific american as one of the top 50 contributors and contributions to science and technology. so he’s going to be a really popular guy now.

i just thought i’d get a number now… save me a place!

posted by roj at 12:28 am