Wouldn’t it be great if everyone could find important insights from data, just like Hilary Mason and DJ Patil? Yes, it would, but this idea doesn’t have a hope of working unless the citizens in question are armed with the right tools and a sober, realistic vision for how they are supposed to find value in data. Just throwing technology and data at a group of aspiring citizen data scientists won’t work. Here’s what will.
What Is a Citizen Data Scientist?
First of all, we must agree on a definition and a goal for the role of citizen data scientist. Gartner Analyst, Alexander Linden has been promoting the term recently in research and blogs. In a Q&A titled "Why Should CIOs Consider Advanced Analytics?” Linden defines data science as follows:
"Data science is the discipline of extracting nontrivial knowledge from often complex and voluminous data to improve decision making. It involves a variety of core steps, including business and data understanding and data modeling."
Linden predicts that citizen data scientists will be extensions of a central data science lab that will extend out into the organization:
"Lots of our most advanced clients are experimenting with the notions of chief data officers (CDOs) or chief analytics officers (CAOs). Sometimes the CDO/CAO will directly command a (virtual) data science lab. We think that those labs must be orchestrated virtually, with the (citizen) data scientists distributed throughout the organization."
Gartner predicts that the number of citizen data scientists will grow five times faster through 2017 than the number of data scientists. If Gartner and Linden are right, a whole lot of data science will flood the world of business.
In my view, the role of a citizen data scientist is to create the same value as a data scientist but using a simpler set of tools. The data scientist is like the video game designer. The citizen data scientist is the person who plays the game. In the most ambitious vision, the output from a citizen data scientist should be visualizations, new kinds of KPIs, simple models created using specialized tools, and reports that make a difference in running a business, although in practice it can take a while to make all of these things possible, and some, like modeling, may be performed mostly by data scientists.
But the citizen data scientist won’t be programming in Python and or using Java to create custom Hadoop applications in MapReduce or Cascading. The citizen data scientist plays a simpler game and requires simpler tools that are powerful enough to create advanced results.
The question is: Who is going to create the video game infrastructure? Data scientists are not in the business of creating configurable software systems. They find signals and create ways of making data useful. For a citizen data scientist to thrive, the way data science is done will have to reach a new level of simplicity and end-to-end integration. We will return to this question later.
Citizen Data Science: The Video Game
Here is a vision for what must be addressed by a simpler set of video-game like tools that will allow a citizen data scientist to succeed:
Data Supply Chain: Without data, a citizen data scientist doesn’t have much to do. Without an expanding collection of data, the impact of a citizen data scientist will be limited. In the past, data was assembled in a highly curated manner to create a data warehouse and also in a highly informal manner to assemble data for use in spreadsheets. Both approaches had their problems. A citizen data scientist needs a large and growing repository of data that comes from internal and external sources. My vision for this is a data supply chain in which raw materials are assembled from a variety of sources and then delivered into a repository, usually called a data lake, that can be used by citizen data scientists to find raw materials. Some of the ETL processes that were part of creating a data warehouse will be performed to ensure that the data arriving is of high quality and integrated in obvious ways so that collections of related data show up as one data set.
Data Catalog: In the world of the data warehouse, the volume of data did not grow rapidly. The contents of the data warehouse were well known, a form of tribal knowledge. The citizen data scientist lives in the world of the data lake. The amount of data is huge and growing. A data catalog that describes each data set—tells where it came from, how it was transformed, and provides any notes offered by those who have used the data set—is a crucial element for success because it allows data to be easily found and contextually understood.
Data Profiling and Cleansing: Once you have located a data set, what does it mean? What can it tell you? In the past, you could use queries to pound away at data, inspect it directly, and get an idea of what it contained. In the world of big data and unstructured data, this is no longer a viable option. To get an idea of what is inside a huge or unstructured data set, you need machine learning and advanced analytics suited to the purpose. Citizen data scientists need an environment that guides them through this process and suggest ways to cleanse and explore the data using advanced technology. In this way, citizen data scientists can discover the signals present in a data set on their own.
Shareable Data Objects: The data that is landed in a data lake is the foundation. As this data is understood with respect to various applications, the need for common objects will appear. For example, one object could collect every key element about a customer into one data set. Another may assemble all the data about a geography or a product. These data sets become shareable objects that are used over and over. It is likely that several levels of shareable data objects will be needed, in addition to data governance that allows for management and access control. There must be also mechanisms to create data pipelines to refresh shareable objects as new data arrives.
Data Lineage: For both the data that arrives from the data supply chain and the shareable data objects, a citizen data scientist must be able to find out where the data came from and how the data set was constructed. The system for capturing lineage should also capture comments from citizen data scientists. This information, which should be part of the data catalog, replaces the tribal knowledge that was used in the era of the data warehouse. It allows data scientists to work independently and also to capture and share thoughts about what has been discovered when they studied a data set. The data lineage must be carried all the way through the process of making use of data and be available to the end-users who are using data in dashboards, visualizations, and reports.
Expand Upon Models: One of the most challenging tasks of a data scientist is to expand upon a model. Usually such models are equations that show how a set of independent variables can predict a dependent variable. In the past, the creation of such models was a virtuoso activity. But in the modern world, a variety of new technology allows machine learning and advanced analytics to assist citizen data scientists in creating models. Providing such assistance is a key point of leverage to make citizen data scientists maximally productive.
Creation of Visualizations and Dashboards: Presenting what has been discovered in a dashboard, visualization, or report is an art form. Citizen data scientists will vary in their ability to practice this art. For that reason and to make everyone as productive as possible, citizen data scientists require a simplified environment for creating dashboards, visualizations, and reports that provides guidance and assistance. In this way, data can be presented in proven patterns and citizen data scientists don’t have to re-invent a new approach each time.
Is assembling all of this a tall order? Absolutely. But without any one of these elements, a program of citizen data science will face a key bottleneck. You don’t have to address these challenges all at once, but to make your citizen data scientists worthy of the name, they need to be fully equipped.
That’s where companies like Platfora come in. Unlike data scientists, Platfora is in the business of creating easy to use software to allow citizen data scientists to control powerful technology for managing, understanding, and making use of data. Platfora’s solution addresses the concerns just laid out. While Platfora’s technology won’t turn data science into World of Warcraft, it does provide a system that will allow citizen data scientists to be as effective as their knowledge and talents allow.