June 8, 2014

Where is my Data Lake?

Growing interest in gaining value out of the data stored in the enterprise systems and the freely available data on the World Wide Web, has led IT industry to invent a new term called Data Lake. The term Data Lake or Business Data Lake has been aggressively adopted by major big data software vendors, such as Pivotal and Hortonworks. It often dominates their marketing and sales pitches. Other vendors such as Cloudera call this concept Data Hub with minor scope changes.

The Industry Analyst from Gartner, Merv Adrian, has named this concept Data Reservoir (Dam) because the stored water flows out of reservoir under control. ‘Lake’ in Data Lake is not a good term because ‘Lake’ does not have a controlled outflow of water, in his view.

There is a general consensus about the need for a centralised data store for the enterprise to create new business insights. However, the debate ensues about the scope, governance and structure required to build and maintain the data lake. In this blog, I will call this concept, Data Lake, despite of the limitation of this metaphor.

Let’s examine the common data management issues, which the concept of data lake is supposed to address and solve.

The enterprise IT landscape has evolved over a period of time with the implementation of COTS solutions and homegrown systems. Previous efforts to integrate data from these disparate systems have often resulted in stovepipe solutions designed to meet the needs of individual departments. This approach offered flexibility to each department about how to model the data for their own consumption. However, looking down at these solutions from the enterprise level, this approach appears sub-optimal because multiple databases keep the copies of same data. As a result, data takes more storage space. Flexibility in data modeling at departmental level comes at the cost of data consistency at the enterprise level.

For reliable data storage, large organisations use SAN (Storage Area Networks) that are expensive because of proprietary hardware and software stack required to build them. Duplication of data in SAN not only leads to even higher storage costs but also the higher cost of experts to manage SAN. Sometimes the same data can be replicated 70 times in SAN to meet different business needs.

The unstructured data in the enterprise, such as emails, contracts & proposals, lie un-utilised because the traditional Enterprise Data Warehouse (EDW) technology was designed to manage the data that can be neatly organised in rows and columns in 3rd or higher normal forms. However, human to human communication often leaves a trail in the form of unstructured data such as emails. Not using this data to make important decisions means a significant piece of important information might be missing that can positively influence the quality of decision.

The devices such as servers, networking components, access control systems and even printers generate enormous amount of logging data, which can be used to monitor cost or do preventive controls. Enterprises are not even aware that they generate massive amounts of machine data that can be put to work to achieve certain organisational goals.

Many enterprises are aware that a lot of data is available on the World Wide Web, which can give them important insights about the customer behaviour. But the lack of data integration and management tools combined with sheer cost of storing this data in SAN made analysing this data not feasible.

Traditional ETL development that is used to build the data warehouse takes typically couple of weeks to months before the data is available for reporting and analysis. This is unacceptable to the business because they want to get insight faster.

The concept of Data Lake expands the often unfulfilled vision of Enterprise Data Warehouse and proposes building a new data platform with a set of capabilities to store and analyse massive data volumes, typically ranging from several hundred terabytes to several hundred petabytes at a small fraction of cost of building and maintaining a traditional EDW.

So the short answer to the question posed in the title of this blog is that you have to build your Data Lake and fill it with Data.

A few useful links to explain the Data Lake concept are here:

To be continued….

Kudos

Where is my Data Lake?

Now read this

Choosing a Commercial R Distribution Over Open Source R