Big data overload: Even Google can get it wrong
Monday, April 30, 2012/
As dull and dry as it appears data management can land leading companies in hot water.
Take Google. The global search-engine-company-turned-verb has been getting heat from regulators about its policies and processes for harvesting emails, passwords and other sensitive personal information from customers in Australia and around the world.
Google was fined $US25,000 ($Au23,900) for obstructing the inquiry even though the company was found not to have broken any laws and Google’s shiny reputation was tarnished by the bad publicity.
Of course $25,000 isn’t much, you retort.
But what if your data management processes cost your company $5 million every year or perhaps $20 million a year. That is the nightmare data storage cost facing leading companies as stored data proliferates into terabytes (1000 gigabytes).
Medium-sized companies typically store between 10 terabytes and 50 terabytes, large ones struggle under the weight of up to 250 terabytes and for very large companies the data burden can be a petabyte (1024 terabytes) and more. It is becoming unimaginable.
And then there are the privacy issues. Thousands of companies are risking privacy law breaches a survey released today by the National Association for Information Destruction has found.
The survey of more than 400 companies found just over half had formal policies.
Inadequate processes can lead to leaks – intentional or accidental – of personal information such as phone numbers, email addresses and even credit card details.
The survey found many companies are unconcerned but privacy commissioners are worried and are starting a national awareness campaign during Privacy Awareness Week from April 29-May 5.
Sydney company Nuix reckons companies can usually cut storage in half by identifying, indexing and removing redundant, outdated or trivial data.
Nuix today released a white paper called Defensible Deletion – Quantifying the Benefits which identifies the cost of data storage and the risk of data mismanagement, outlining costs and risks of data management strategies.
Not all data is equal and the worst kind is unstructured data, especially those rabbit-rate breeders: emails.
The trick is to find all duplicate emails and identify all the duplicate attachments, which can be in as many as 400 different formats, according to Nuix chief executive Eddie Sheehy.
“We understand the algorithm about how data is stored and we have the software that can understand that the email is here and here is a PDF that has the same content. These are called near duplicates,” he says.
Blue chip names are turning to Nuix to dig their way out of the data mound and they include the Australian Taxation Office, the Australian Securities and Investments Commission, the National Australia Bank and the Defence Department in Australia plus overseas notables the Securities and Exchange Commission, US Homeland Security and Barclay’s Bank.
Its biggest job, for a Wall Street bank, was indexing 3.1 billion emails – 300 terabytes of data.
“They are telling the world they are saving tens of millions of dollars and we charged them a fraction of that,” says Sheehy, who said the typical return on investment is six months.
After all duplicate documents have been indexed, searched and checked they can be safely deleted.
Removing outdated data, once indexed and searchable, is a simple task according to Sheehy.
“We just look for the date in your document retention policy beyond which documents can be deleted. We find all documents beyond that date, search them, index and check them and then, with the all clear, delete them.”
Paradoxically the “trivial” data is the hard stuff.
“This is the stuff that keeps directors awake at night, it is the private information retained when it should be deleted, it is where one employee is abusing another or it is pornography,” Sheehy says.
“I have seen organisations where 5% of employees were exchanging pornographic files.”
Once the data is indexed and reduced companies can search it to identify risks, to respond to legal or regulator demands for information and then, more proactively, they can use it to identify trends.
“For example using a ‘sentiment’ analysis of a group of sales staff in one state you might find that the sales staff in Queensland are more positive this quarter, providing a good leading indicator for management,” Sheehy said.
Nuix has patents pending on the indexing software that forms the nub of the services it provides. The company has doubled staff and revenue each year for the past five years to reach an unconfirmed annual revenue of $30 million.