Share with your friends










Submit

Analytics Magazine

Data Lakes: The biggest big data challenges

Why data lakes are an important piece of the overall big data strategy.

data lakes Prashant Tyagi Haluk DemirkanBy Prashant Tyagi (left) and Haluk Demirkan

In today’s complex business world, many organizations have noticed that the data they own and how they use it can make them different than others to innovate, to compete better and to stay in business [1]. That’s why organizations try to collect and process as much data as possible, transform it into meaningful information with data-driven discoveries, and deliver it to the user in the right format for smarter decision-making [2]. Big data analytics has become a key element of the business decision process over the last decade. With the right analytics, data can be turned into actionable intelligence that can be used to help make businesses maximize revenue, improve operations and mitigate risks.

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using hands-on database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, visualization and many other things. According to Demirkan and Dal [2], big data has the following six “V” characteristics:

  • Volume (data at rest): terabytes, exabytes, petabytes and zettabytes of data
  • Velocity (data in motion): capturing and processing streaming data in seconds or milliseconds to meet the need or demand.
  • Variety (data in many forms): structured, unstructured, text, multimedia, video, audio, sensor data, meter data, html, text, emails, etc.
  • Veracity (data in doubt): uncertainty due to data inconsistency and incompleteness, ambiguities, latency, deception, model approximations, accuracy, quality, truthfulness or trustworthiness.
  • Variability (data in change): the differing ways in which the data may be interpreted; different questions require different interpretations. Data flows can be highly inconsistent with periodic peaks.
  • Value (data for co-creation): The relative importance of different data to the decision-making process.
According to Gartner, 60 percent of companies say they don’t have the skills to make the best use of their data. Photo Courtesy of 123rf.com | Bruce Rolff

According to Gartner, 60 percent of companies say they don’t have the skills to make the best use of their data.
Photo Courtesy of 123rf.com | Bruce Rolff

IDC predicts revenue from the sales of big data and analytics applications, tools and services will increase more than 50 percent, from nearly $122 billion in 2015 to more than $187 billion in 2019 [3]. Even though 73 percent of companies intend to increase spending on analytics and making data discovery a more significant part of their architecture, 60 percent feel they don’t have the skills to make the best use of their data [4]. Given an abundance of knowledge and experience, combined with successful data and analytics-enabled decision support systems, big data initiatives come with high expectations, and many of them are doomed to fail.

Research predicts that half of all big data projects will fail to deliver against their expectations [5]. When Gartner asked what the biggest big data challenges were, the responses suggest that while all the companies plan to move ahead with big data projects, they still don’t have a good idea as to what they’re doing and why [6]. The second major concern is not establishing data governance and management [7] (see Table 1).

DATA-RELATED CHALLENGES FOR BIG DATA

Slow on boarding and integrating data
Transforming data from growing variety of technologies
Custom coded ETL
ETL processes are not usable
Wait for need of a defined data set, wait for data on boarding
Poorly recorded data provenance
Data meaning lost in translation
Data transformations tracked in spreadsheets
Post on boarding maintenance and analysis cost is high
Recreating lineage is manual and time consuming
Help consume difficult target data
Optimization favors known analytics not suited for new requirements
One size fits all canonical view rather than fit for purpose views
Lacks a conceptual model to easily consume target data
Difficult to identify what data is available, getting access and integration
Industrializing the big data environment which is difficult to manage
Data Silos lead to inconsistency and sync issues
Conflicting objectives of opening access yet managing security and governance
Rapid biz change invalidate data org and analytics optimizations
Managing the integration / interaction with multiple DM tech in Big Data Environment

Table 1: The unique data-related challenges for big data.

The Data Lake Journey

In the context of governance and management of big data, the term “data lake” has been widely discussed in recent years. Is a data lake a logical data warehouse to manage the six Vs of big data? How do data lakes relate to data warehouses? How do data lakes help evolve the data management architecture and strategy in organizations?

The data lake concept has been well received by enterprises to help capture and store raw data of many different types at scale and low cost to perform data management transformations, processing and analytics based on specific use cases. The first phase of the data lake growth was in consumer web-based companies. The data lake strategy has already shown positive results for these consumer businesses by helping increase speed and quality of web search, web advertising (click stream data) and improved customer interaction and behavior analysis (cross-channel analysis). This led to the next phase for data lakes, which was to augment enterprise data warehousing strategies.

A traditional data warehouse supports batch workloads and simultaneous use by thousands of concurrent users performing basic reporting and advanced analytics with a pre-defined data model. However, a lot of cleaning and other work is required before the data is properly captured and ready for modeling. A data lake, on the other hand, is meant to provide faster ingestion of raw data and to execute batch processing at scale on the data.

The warehouse schema means data must be captured in the code for each program accessing the data. Given the capability of a data lake to ease the ingestion and transformation processes, it became a natural choice for migrating the extract, transformation and loading (ETL) workloads off traditional warehouses in order to provide a scale-out ETL for enterprises and big data. This makes a data lake suitable for data ingestion, transformation, federation, batch-processing and data discovery.

The implementation characteristics of a data lake, namely inexpensive storage and schema flexibility, make it ideal for insight discovery. However, these traits do not necessarily translate to a high-performance, production-quality analytical platform. Making new insights available to the broadest possible audience requires data optimization, greater maturity of analytical models and semantic consistency. As new insights are discovered, the work passes from the data science team to the data engineering team. Data engineers take the new questions and optimize for new answers. They refine and optimize the raw data, as well as the analytical models. Existing data integration processes can be used, or new processes can be built.

Data Lake Implementation Examples

GE Predix is an industrial data lake platform that provides rigid data governance capabilities to build, deploy and manage industrial applications that connect to industrial assets, collect and analyze data, and deliver real-time insights for optimizing industrial infrastructure and operations. Following are sample use cases from two major industries:

Aviation: Two factors in the aviation industry are key in determining the profitability for airline businesses: 1) the accuracy in prediction of fuel price fluctuations, and 2) predictive maintenance that can improve “time on wings” (the time an aircraft is actually in flight). The aviation group at GE analyzes data from more than several thousand engines per day to identify sub-optimal performance parts. This identification needs to be done almost instantly so it is a use case for real-time analytics.

The data collected by sensors deployed on these engines capture physical parameters such as time, temperature, pressure, etc. The correlation of this data and identification of insights through these correlations are accomplished by creating and using analytical models. An example of this is the analytics for large historical data using physics-based models such as numerical propulsion simulation system (NPSS) to check engine deterioration and other engine utilization algorithms. The use of a data lake enabled 10 times faster analytics capabilities, which meant that these algorithms and models could be executed in days (instead of months).

The company learned that the hot and harsh environments in places such as the Middle East and China clogged engines, causing them to heat up and lose efficiency, thus driving the need for more maintenance. GE learned that if it washed the engines more frequently, they stayed much healthier. This information can save a customer an average of $7 million in jet airplane fuel annually because the engines are more efficient [8].

Electrical power: The electrical power sector has seen a significant increase in system complexity over the past several years. Data diversity and volume is on the rise, and so is the cost of traditional data management. This required a shift in data strategy to create a pervasive culture of data-driven insights at scale. For example, more than 20,000 users are on the big data platform for this vertical, which generates or handles upward of several millions transactions a day.

Power companies provide analytics on historical data extraction services upon customer request. These data requests typically are for a number of tags for a number of units over time and are executed at a fleet level. In order to serve the demand, the team needs to ingest all historical data from thousands of gas and steam turbines into a data lake from current operational systems. This is done by implementing a daily ingestion process to keep the big data platform and data lake current within 24 hours of operational data, which is used for generating the analytics. The use of a data lake has helped reduce clock time and touch time to extract data from weeks to hours. It also helped reduce the load on operational systems by freeing them of the analytics burden, thereby reducing the data kept in expensive SAN storage systems from over a decade to only a few months. This allows operational systems to run “in-memory” with improved performance.

DATA WAREHOUSES VS. DATA LAKES

DATA WAREHOUSES CONTROL vs DATA LAKES FLEXIBILITY
Structured, processed Data Structured/ semi-structured/unstructured, raw
Schema-on-write Processing Schema-on-read
Well-defined schema Schema No schema (schema on read)
Expensive for large data volumes Storage Designed for low-cost storage
Less agile, fixed configuration Agility Highly agile, configure & reconfigure as needed
Well-defined usage Usage Future, experimental usage
Clean, trusted data Raw data, frictionless ingestion
Mature Control Maturing
Mature Security Maturing
Mature Governance Maturing
Mature Quality Maturing
Mature Human resource Maturing
High Cost Low
Low Flexibility High
Mature Reliability Maturing
Mature Performance Maturing
Low Reusability High
Single model of the truth Integration and end user exercise
Upfront data preparation Late data integration
Skills: Heavy IT reliance (Large IT teams: DBAs, data architects, ETL developers, BI developers, DQ developers, data modelers, data stewards) & less technical analysts Skills: Self service. More technical analysts, and IT manages the cluster and ingestion, but no IT involvement when working with data (data scientists, developer)
Features in data lakes that add value:
  • Common data model: Data is glued together by the business meaning rather than physical structures dictated by underlying technologies. Semantic models relate data by business meaning; is that being facilitated by data lake?
  • Flexible data model: No need to redesign database to extend or revise data model
  • Connecting external data: External data can be sourced with real-time values and delivered at query time
  • Federation and virtualization: Choice for which data to copy and which to leave in the source of truth. Models for all businesses supported by this data copy
  • Data definition: Less change management as expertise is captured in data definition

Table 2: Data warehouses vs. data lakes.

Staying on Course to Success

So what should your organization do to stay on course for a successful big data journey? Given the above discussion, the real questions become: Should traditional data warehouses continue to support operationalization and re-use of the data and provide more upstream business value capabilities like in-memory database, graph database, NO-SQL reports and other analytics capabilities? Or should the lake be expanded to become a data discovery platform and retire the data warehouses? The goal is to converge these platforms while adding more capabilities for handling all enterprise use cases exploiting IT and transactional data and thereby helping migration to cloud-based services.

With that in mind, the data lake should not be designed as just a big data platform based on Hadoop; it should be designed using multiple technologies. These technologies will help offload and retire data warehouses by providing enterprise data warehouse-like capabilities to handle advanced workloads and data discovery. However, the strategy should not be to replace the data warehouses completely but only to move the analytics capabilities to the lake. The transactional/operational workloads for reporting and closing books, etc., should stay on traditional warehouses. That will help reduce the footprint on the more expensive warehouses and use them for what they are best suited for.

Data governance is extremely important for success of big data projects. Based on the value of data, data lakes can be structured as: 1) governed data (e.g., key business data is being understood for ownership, definition, business rules and quality), 2) lightly governed data (e.g., data is being understood in regard to definition and lineage, but not necessarily controlled with respect to quality or usage), and ungoverned data (e.g., data is being only understood in regard to definition and location. Ungoverned data may or may not physically exist in the data lake and may exist only in the data catalog as metadata pointers to external data) [9].

Based on the governance of data, the term “data reservoir” is now being used to describe the managed, transformed, filtered, secured, portable and potable data (fit for consumption of data). For every type of data governance, metadata is critically important for the success of big data projects.

Data Ingestion
Model driven
Semantic tagging
On-demand query
Streaming
Scheduled batch load
Self service
Data Management
Data movement
Data provenance
Types (In memory, NoSQL, MR, columnar, graph, Semantic, HDFS)
Data flow (governed data, lightly governed data, ungoverned data)
Query Management
Semantic search
Data discovery
Analytics directed to best query engine
Capture and share analytics expertise
Query data, metadata and provenance
Data Lake Management
Models (Biz unit data optimized to assist analytics)
Data assets catalog (ontologies, taxonomies)
Workflow (processes, schedules, provenance capture)
Access management (AAA, group/role/rule/user-based authorization)
Metadata

Table 3: Four components of data lakes.

Organizations need to look at strategically investing in data lake architectures and implementations. A coordinated effort around a data lake can help bring the data strategy around big data and analytics together. Leveraging the data lake for rapid ingestion of raw data that covers all the six Vs and enable all the technologies on the lake that will help with data discovery and batch analytics. In order to complement the capabilities of data lakes, an investment needs to be made for data extracted from the lake, as well as in platforms that provide real-time and MPP capabilities. It is this entire eco system that needs to be put in place and executed to work in synergy, which will lead to all the promised benefits of the big data ecosystem. The data lake is a key piece of the overall data strategy, and not the one size solution for all data needs. Data lakes need to have four primary components – data ingestion, data management, query management and data lake management (Table 3).


Prashant Tyagi (prashant.tyagi@ge.com) is a director of IoT and analytics platforms at GE Software. Previously, he led the strategy and execution for Cisco’s Smart Services. He has a MBA from the Indian Institute of Management in Bangalore and a master’s degree in computer science from Clemson University.

Haluk Demirkan (haluk@uw.edu) is a professor of Service Innovation and Business Analytics at the Milgard School of Business, University of Washington-Tacoma, and a co-founder and board director of the International Society of Service Innovation Professionals. He has a Ph.D. in information systems and operations management from the University of Florida. He is a longtime member of INFORMS.

REFERENCES

  1. Demirkan, H. and Delen, D., 2013, “Leveraging the Capabilities of Service-Oriented Decision Support Systems: Putting Analytics and Big Data in Cloud,” Decision Support Systems and Electronic Commerce, Vol. 55, No. 1, pp. 412-421.
  2. Demirkan, H. and Dal, B., 2014, “Why Do So Many Analytics Projects Still Fail? Key considerations for deep analytics on big data, learning and insights,” Analytics magazine, pp. 44-52, July-August 2014; http://analytics-magazine.org/the-data-economy-why-do-so-many-analytics-projects-fail/
  3. http://www.informationweek.com/big-data/big-data-analytics/big-data-analytics-sales-will-reach-$187-billion-by-2019/d/d-id/1325631
  4. http://data-informed.com/gartner-researchers-predictive-analytics-to-gain-traction-in-business/
  5. http://www.forbes.com/sites/bernardmarr/2015/03/17/where-big-data-projects-fail/#50ee6465264e
  6. http://readwrite.com/2013/09/18/gartner-on-big-data-everyones-doing-it-no-one-knows-why
  7. http://www.ibmbigdatahub.com/blog/10-mistakes-enterprises-make-big-data-projects
  8. Winig, L., 2016, “GE’s Bıg Bet On Data and Analytıcs,” MIT Sloan Management Review, Feb. 18; http://sloanreview.mit.edu/case-study/ge-big-bet-on-data-and-analytics/
  9. https://infocus.emc.com/william_schmarzo/data-lake-data-reservoir-data-dumpblah-blah-blah/

Related Posts

  • 83
    FEATURES Putin vs. Western analysts Russia’s new approach to extending its influence necessitates new approaches to assessment. By Douglas Samuelson Making analytics work through practical project management Making analytics work: Why consistently delivering value requires effective project management. By Erick Wikum Crowdsourcing – Using the crowd: curated vs. unknown Using…
    Tags: data, big, management, lakes
  • 65
    International Data Corporation (IDC) recently released a worldwide Big Data technology and services forecast showing the market is expected to grow from $3.2 billion in 2010 to $16.9 billion in 2015. This represents a compound annual growth rate (CAGR) of 40 percent or about seven times that of the overall…
    Tags: data, big
  • 64
    INFORMS member Brenda L. Dietrich, an IBM Fellow, vice president and leader of IBM’s data science group, was recently profiled by Forbes in an article headlined, “Meet 9 Women Leading The Pack In Data Analytics.” Dietrich is also an INFORMS Fellow and a member of the National Academy of Engineering.…
    Tags: data, science, management
  • 64
    May/June 2014 By (l-r) Pramod Singh, Ritin Mathur, Arindam Mondal and Shinjini Bhattacharya “There is a big data revolution. However, what is revolutionary is not the quantity of data alone. The big data revolution is that now we can do something with the data.” – Professor Gary King, Harvard University…
    Tags: data, management, big
  • 63
    The healthcare analytics industry is making great strides. As part of my work I talk to many data analytics companies who report they are very busy with implementation projects. One large company told me that its implementation staffs are booked until end of the first quarter of 2017. This is…
    Tags: data, management, science

Analytics Blog

Electoral College put to the math test


With the campaign two months behind us and the inauguration of Donald Trump two days away, isn’t it time to put the 2016 U.S. presidential election to bed and focus on issues that have yet to be decided? Of course not.


Save



Headlines

Study: Salaries for early career data scientists decrease for first time

Salaries for early career data scientists decreased year over year for the first time in four years as did the percentage of early career data scientists with a Ph.D. while demand for data scientists continued to increase, according to a recently released Burtch Works’ 2017 salary study of data scientists. Salaries for more experienced data scientists generally held steady or increased slightly depending on an individual’s focus area, responsibility and geographic base, according to the report. Read more →

Generous health insurance plans encourage overtreatment, but may not improve health

Offering comprehensive health insurance plans with low deductibles and co-pay in exchange for higher annual premiums seems like a good value for the risk averse, and a profitable product for insurance companies. But according to a forthcoming study in a leading scholarly marketing journal, the INFORMS journal Marketing Science, such plans can encourage individuals with chronic conditions to turn to needlessly expensive treatments that have little impact on their health outcomes. This in turn raises costs for the insurer and future prices for the insured. Read more →

UPCOMING ANALYTICS EVENTS

INFORMS-SPONSORED EVENTS

CONFERENCES

2017 INFORMS Healthcare Conference
July 26-28, 2017, Rotterdam, the Netherlands

CAP® EXAM SCHEDULE

CAP® Exam computer-based testing sites are available in 700 locations worldwide. Take the exam close to home and on your schedule:


 
For more information, go to 
https://www.certifiedanalytics.org.