About Us | Login | Follow CITO Research:

Databases for the Data Lake

data lake
Written by Dan Woods | December 02, 2011 | 0 comments

In a series of articles collected in the problem statement "Preparing for Big Data" we have outlined the concept of a data lake, a vision for a much wider, less organized form or storing and managing data for business intelligence purposes.

In a data warehouse, data goes through a structured process for extraction, transformation, and loading and is then stored in its cleaned and consolidated form in info cubes or in other SQL structures. The flow in a data lake is less structured and less predictable. In a data lake, operational data from any available source arrives and its stored in a number of ways.

At one extreme in a data lake is streaming data that is only available to monitor. Then perhaps there are files containing operational data such as web log, transaction logs, network activity logs, call detail records, lists of exceptions, sensor data and so on.

This operational data may have a fixed structure or a variable structure that needs parsing. One of the first tasks in a data lake is to find rapid ways to evaluate and understand the structure of the operational data.

Traditional system administration tools like Perl are giving way to purpose built tools such as Splunk. Although Python programming remains popular for munging data. But once you have started to understand what a dataset is telling you, there is usually a need to replicate the sort of intermediate storage that info cubes represent.

Technology like Hadoop or Data Rush can be used to distill the operational data into a condensed or organized form. The question then is how to store it in its partial form.

During the experimentation that takes place during the evaluation process there is a need for flexibility. You may start over several times as you are figuring out what sort of questions a data set can answer.

One solution is to Splunk's capability for summary indexes, which represents indexed and summarizes forms of data to allow for rapid calculation of huge data sets. Another solution is the NoSQL databases.

MarkLogic is a good example of a commercial database. Cassandra is an open source example. These databases have flexible schemas that can efficiently store either extracted fields or quantities that summarize data.

ThingWorx takes a different approach and uses a graph database and allows extensive tagging and grouping of data. Any of these solutions represent an emergent info cube if you will.

Once the value of the operational data is understood and the way in needs to be indexed or summarized is clear, then programming systems that allow for high volume processing like Data Rush or Hadoop can be used to process the data into a usable form that can be delivered in any number of ways, either into a traditional info cube or data warehouse, or into a SQL or NoSQL database or directly in a format to be used to an analysis tool like SAP Business Objects, IBM Cognos, Oracle Hyperion, SAS, QlikView, Tibco Spotfire, or Tableau.

The patterns through which data is evaluated for operational purposes and flexible NoSQL databases are used are still emerging. I am looking for examples of companies that have a lifecycle for evaluating operational data and storing it in flexible intermediate forms. Please comment on this story on the problem statement mentioned at the beginning of this story.