On-line merchants and retailers routinely ask users personal questions, such as age, income or marital status, in order to better gauge who their customers are for marketing purposes. The trouble with these surveys, according to IBM, is that users routinely lie due to concerns over privacy.
To overcome this, Dr. Rakesh Agrawal and Dr. Ramakrishnan Srikant, researchers at IBM's Almaden Research Center, have developed a system called Privacy-Preserving Data Mining, which relies on the notion that one's personal data can be protected by being scrambled or randomised prior to being communicated to Web sites. "Our research institutionalises the notion of fibbing on the Internet and does so to preserve the overall reality behind the data," said Dr. Agrawal.
IBM claims that by applying this technique, a retailer can generate highly accurate data models without ever seeing personal information.
For example, a survey could ask users to input their income between a range of EUR50,000 and EUR150,000 per year. But before that information is transmitted to the Web merchant, IBM's software would add or subtract a randomisation parameter of -EUR30,000 to +EUR30,000. The merchant sets this randomisation parameter.
Subsequently one user who earns EUR100,000 could transmit a figure of EUR85,000, while another could report the amount as EUR105,000, even though they both earn the same amount in reality. No record is kept of either user's true salary. On a per-user basis, the survey results are useless because the data is often inaccurate. But when enough users are surveyed, IBM's software can apply algorithms to compensate for the data scrambling.
"The beauty of this research is that retailers and other Web businesses are able to extract the valuable demographic information they need without necessarily knowing the underlying personal consumer data," said Harriet P. Pearson, IBM's chief privacy officer.
And the new technology comes as retailers are facing increasing pressure to stop collecting information on users. According a March 2002 survey from the Progress and Freedom Foundation think tank, commercial Internet sites are collecting less information on visitors.
That survey said that among the 100 most popular domains in the US, the proportion collecting personal information fell from 96 percent in May 2000 to 84 percent in December 2001. The proportion of domains using third-party cookies has also declined from 78 percent in May 2000 to 48 percent by the end of last year.
According to Dr. Agrawal, the Privacy-Preserving Data Mining research has a wide range of potential applications, from medical research and building disease prediction models using randomised individual medical histories to e-commerce and accurate promotions using randomised demographics of individual users.