Swimming in the data lake

Swimming in the data lake

We’re awash with data and businesses have to figure out how not to drown in it.

Last week Yahoo! closed down its directory pages ending one of the defining services of the 1990s internet and showing how the internet has changed since the first dot com boom.

The Yahoo! Directory was victim of a fundamental change in how we manage data as Google showed it wasn’t necessary to tag and label every piece of information before it could be used.

Yahoo!’s Directory was a classic case of applying old methods to new technologies – in this case carrying out a librarian’s function of cataloguing and categorising every web page.

One problem with that way of saving information is you need to know part of the answer before you can start searching; you need to have some idea of what category your query comes under or the name of the business or person you’re looking for.

That pan was exploited by the Yellow Pages where licensees around the world harvested a healthy cash flow from businesses forced to list under a dozen different categories to make sure prospective customers found them.

With the arrival of Google that way of structuring information came to an end as Sergey Brin and Larry Page’s smart algorithm showed it wasn’t necessary to pigeonhole information into highly structured databases.

Rather than being structured, data is now becoming ‘unstructured’ and instead of employing an army of clerks to categorise information it’s now the job of computers to analyse that raw information and pick out what we need for our businesses and lives.

As information pours into companies from increasingly diverse sources, a flood that’s becoming so great it’s being referred to as the ‘data lake’, it’s become clear the battle to structure data is lost.

At the Splunk Conference in Las Vegas this week, the term ‘data lake’ is being used a lot as the company explains its technology for analysing business information.

Splunk, along with services like IBM’s Watson, is one of the companies capitalising on businesses’ need to manage unstructured data by giving customers the tools to analyse their information without having first to shoehorn it into a database.

“Thanks to Google we got to look at data a different way,” says Splunk’s chief executive and chairman Godfrey Sullivan. “You don’t have to know the question before you start the search.”

It’s always dangerous applying simple labels to computing technologies but some terms, like ‘cloud computing’, don’t do a bad job of describing the principles involved and so it is with the ‘data lake’.

Rather than a nice, orderly world where everything can be pigeonholed, we know have a fluid environment where it wouldn’t be possible to label everything even if we wanted to. A lake is a good description of the mass of data pouring into our lives.

The web was an early example of having to manage that data lake and Google showed how it could be done. Now it’s the turn of other companies to apply the principles to business.

Google fatally damaged both Yahoo! and the Yellow Pages, other companies that are stuck in the age of structured data are going to find the future equally dismal.  Don’t drown in that data lake.

Paul travelled to Las Vegas as a guest of Splunk.

Paul Wallbank is the publisher of Networked Globe, his personal blog Decoding The New Economy charts how our society is changing in the connected century.


Notify of
Inline Feedbacks
View all comments