Data Lakes: The biggest big data challenges
Why data lakes are an important piece of the overall big data strategy.
By Prashant Tyagi (left) and Haluk Demirkan
In today’s complex business world, many organizations have noticed that the data they own and how they use it can make them different than others to innovate, to compete better and to stay in business . That’s why organizations try to collect and process as much data as possible, transform it into meaningful information with data-driven discoveries, and deliver it to the user in the right format for smarter decision-making . Big data analytics has become a key element of the business decision process over the last decade. With the right analytics, data can be turned into actionable intelligence that can be used to help make businesses maximize revenue, improve operations and mitigate risks.
Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using hands-on database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, visualization and many other things. According to Demirkan and Dal , big data has the following six “V” characteristics:
- Volume (data at rest): terabytes, exabytes, petabytes and zettabytes of data
- Velocity (data in motion): capturing and processing streaming data in seconds or milliseconds to meet the need or demand.
- Variety (data in many forms): structured, unstructured, text, multimedia, video, audio, sensor data, meter data, html, text, emails, etc.
- Veracity (data in doubt): uncertainty due to data inconsistency and incompleteness, ambiguities, latency, deception, model approximations, accuracy, quality, truthfulness or trustworthiness.
- Variability (data in change): the differing ways in which the data may be interpreted; different questions require different interpretations. Data flows can be highly inconsistent with periodic peaks.
- Value (data for co-creation): The relative importance of different data to the decision-making process.
IDC predicts revenue from the sales of big data and analytics applications, tools and services will increase more than 50 percent, from nearly $122 billion in 2015 to more than $187 billion in 2019 . Even though 73 percent of companies intend to increase spending on analytics and making data discovery a more significant part of their architecture, 60 percent feel they don’t have the skills to make the best use of their data . Given an abundance of knowledge and experience, combined with successful data and analytics-enabled decision support systems, big data initiatives come with high expectations, and many of them are doomed to fail.
Research predicts that half of all big data projects will fail to deliver against their expectations . When Gartner asked what the biggest big data challenges were, the responses suggest that while all the companies plan to move ahead with big data projects, they still don’t have a good idea as to what they’re doing and why . The second major concern is not establishing data governance and management  (see Table 1).
DATA-RELATED CHALLENGES FOR BIG DATA
|Slow on boarding and integrating data|
|Transforming data from growing variety of technologies|
|Custom coded ETL|
|ETL processes are not usable|
|Wait for need of a defined data set, wait for data on boarding|
|Poorly recorded data provenance|
|Data meaning lost in translation|
|Data transformations tracked in spreadsheets|
|Post on boarding maintenance and analysis cost is high|
|Recreating lineage is manual and time consuming|
|Help consume difficult target data|
|Optimization favors known analytics not suited for new requirements|
|One size fits all canonical view rather than fit for purpose views|
|Lacks a conceptual model to easily consume target data|
|Difficult to identify what data is available, getting access and integration|
|Industrializing the big data environment which is difficult to manage|
|Data Silos lead to inconsistency and sync issues|
|Conflicting objectives of opening access yet managing security and governance|
|Rapid biz change invalidate data org and analytics optimizations|
|Managing the integration / interaction with multiple DM tech in Big Data Environment|
Table 1: The unique data-related challenges for big data.
The Data Lake Journey
In the context of governance and management of big data, the term “data lake” has been widely discussed in recent years. Is a data lake a logical data warehouse to manage the six Vs of big data? How do data lakes relate to data warehouses? How do data lakes help evolve the data management architecture and strategy in organizations?
The data lake concept has been well received by enterprises to help capture and store raw data of many different types at scale and low cost to perform data management transformations, processing and analytics based on specific use cases. The first phase of the data lake growth was in consumer web-based companies. The data lake strategy has already shown positive results for these consumer businesses by helping increase speed and quality of web search, web advertising (click stream data) and improved customer interaction and behavior analysis (cross-channel analysis). This led to the next phase for data lakes, which was to augment enterprise data warehousing strategies.
A traditional data warehouse supports batch workloads and simultaneous use by thousands of concurrent users performing basic reporting and advanced analytics with a pre-defined data model. However, a lot of cleaning and other work is required before the data is properly captured and ready for modeling. A data lake, on the other hand, is meant to provide faster ingestion of raw data and to execute batch processing at scale on the data.
The warehouse schema means data must be captured in the code for each program accessing the data. Given the capability of a data lake to ease the ingestion and transformation processes, it became a natural choice for migrating the extract, transformation and loading (ETL) workloads off traditional warehouses in order to provide a scale-out ETL for enterprises and big data. This makes a data lake suitable for data ingestion, transformation, federation, batch-processing and data discovery.
The implementation characteristics of a data lake, namely inexpensive storage and schema flexibility, make it ideal for insight discovery. However, these traits do not necessarily translate to a high-performance, production-quality analytical platform. Making new insights available to the broadest possible audience requires data optimization, greater maturity of analytical models and semantic consistency. As new insights are discovered, the work passes from the data science team to the data engineering team. Data engineers take the new questions and optimize for new answers. They refine and optimize the raw data, as well as the analytical models. Existing data integration processes can be used, or new processes can be built.
Data Lake Implementation Examples
GE Predix is an industrial data lake platform that provides rigid data governance capabilities to build, deploy and manage industrial applications that connect to industrial assets, collect and analyze data, and deliver real-time insights for optimizing industrial infrastructure and operations. Following are sample use cases from two major industries:
Aviation: Two factors in the aviation industry are key in determining the profitability for airline businesses: 1) the accuracy in prediction of fuel price fluctuations, and 2) predictive maintenance that can improve “time on wings” (the time an aircraft is actually in flight). The aviation group at GE analyzes data from more than several thousand engines per day to identify sub-optimal performance parts. This identification needs to be done almost instantly so it is a use case for real-time analytics.
The data collected by sensors deployed on these engines capture physical parameters such as time, temperature, pressure, etc. The correlation of this data and identification of insights through these correlations are accomplished by creating and using analytical models. An example of this is the analytics for large historical data using physics-based models such as numerical propulsion simulation system (NPSS) to check engine deterioration and other engine utilization algorithms. The use of a data lake enabled 10 times faster analytics capabilities, which meant that these algorithms and models could be executed in days (instead of months).
The company learned that the hot and harsh environments in places such as the Middle East and China clogged engines, causing them to heat up and lose efficiency, thus driving the need for more maintenance. GE learned that if it washed the engines more frequently, they stayed much healthier. This information can save a customer an average of $7 million in jet airplane fuel annually because the engines are more efficient .
Electrical power: The electrical power sector has seen a significant increase in system complexity over the past several years. Data diversity and volume is on the rise, and so is the cost of traditional data management. This required a shift in data strategy to create a pervasive culture of data-driven insights at scale. For example, more than 20,000 users are on the big data platform for this vertical, which generates or handles upward of several millions transactions a day.
Power companies provide analytics on historical data extraction services upon customer request. These data requests typically are for a number of tags for a number of units over time and are executed at a fleet level. In order to serve the demand, the team needs to ingest all historical data from thousands of gas and steam turbines into a data lake from current operational systems. This is done by implementing a daily ingestion process to keep the big data platform and data lake current within 24 hours of operational data, which is used for generating the analytics. The use of a data lake has helped reduce clock time and touch time to extract data from weeks to hours. It also helped reduce the load on operational systems by freeing them of the analytics burden, thereby reducing the data kept in expensive SAN storage systems from over a decade to only a few months. This allows operational systems to run “in-memory” with improved performance.
DATA WAREHOUSES VS. DATA LAKES
|DATA WAREHOUSES CONTROL||vs||DATA LAKES FLEXIBILITY|
|Structured, processed||Data||Structured/ semi-structured/unstructured, raw|
|Well-defined schema||Schema||No schema (schema on read)|
|Expensive for large data volumes||Storage||Designed for low-cost storage|
|Less agile, fixed configuration||Agility||Highly agile, configure & reconfigure as needed|
|Well-defined usage||Usage||Future, experimental usage|
|Clean, trusted data||Raw data, frictionless ingestion|
|Single model of the truth||Integration and end user exercise|
|Upfront data preparation||Late data integration|
|Skills: Heavy IT reliance (Large IT teams: DBAs, data architects, ETL developers, BI developers, DQ developers, data modelers, data stewards) & less technical analysts||Skills: Self service. More technical analysts, and IT manages the cluster and ingestion, but no IT involvement when working with data (data scientists, developer)|
|Features in data lakes that add value:|
Table 2: Data warehouses vs. data lakes.
Staying on Course to Success
So what should your organization do to stay on course for a successful big data journey? Given the above discussion, the real questions become: Should traditional data warehouses continue to support operationalization and re-use of the data and provide more upstream business value capabilities like in-memory database, graph database, NO-SQL reports and other analytics capabilities? Or should the lake be expanded to become a data discovery platform and retire the data warehouses? The goal is to converge these platforms while adding more capabilities for handling all enterprise use cases exploiting IT and transactional data and thereby helping migration to cloud-based services.
With that in mind, the data lake should not be designed as just a big data platform based on Hadoop; it should be designed using multiple technologies. These technologies will help offload and retire data warehouses by providing enterprise data warehouse-like capabilities to handle advanced workloads and data discovery. However, the strategy should not be to replace the data warehouses completely but only to move the analytics capabilities to the lake. The transactional/operational workloads for reporting and closing books, etc., should stay on traditional warehouses. That will help reduce the footprint on the more expensive warehouses and use them for what they are best suited for.
Data governance is extremely important for success of big data projects. Based on the value of data, data lakes can be structured as: 1) governed data (e.g., key business data is being understood for ownership, definition, business rules and quality), 2) lightly governed data (e.g., data is being understood in regard to definition and lineage, but not necessarily controlled with respect to quality or usage), and ungoverned data (e.g., data is being only understood in regard to definition and location. Ungoverned data may or may not physically exist in the data lake and may exist only in the data catalog as metadata pointers to external data) .
Based on the governance of data, the term “data reservoir” is now being used to describe the managed, transformed, filtered, secured, portable and potable data (fit for consumption of data). For every type of data governance, metadata is critically important for the success of big data projects.
|Scheduled batch load|
|Types (In memory, NoSQL, MR, columnar, graph, Semantic, HDFS)|
|Data flow (governed data, lightly governed data, ungoverned data)|
|Analytics directed to best query engine|
|Capture and share analytics expertise|
|Query data, metadata and provenance|
|Data Lake Management|
|Models (Biz unit data optimized to assist analytics)|
|Data assets catalog (ontologies, taxonomies)|
|Workflow (processes, schedules, provenance capture)|
|Access management (AAA, group/role/rule/user-based authorization)|
Table 3: Four components of data lakes.
Organizations need to look at strategically investing in data lake architectures and implementations. A coordinated effort around a data lake can help bring the data strategy around big data and analytics together. Leveraging the data lake for rapid ingestion of raw data that covers all the six Vs and enable all the technologies on the lake that will help with data discovery and batch analytics. In order to complement the capabilities of data lakes, an investment needs to be made for data extracted from the lake, as well as in platforms that provide real-time and MPP capabilities. It is this entire eco system that needs to be put in place and executed to work in synergy, which will lead to all the promised benefits of the big data ecosystem. The data lake is a key piece of the overall data strategy, and not the one size solution for all data needs. Data lakes need to have four primary components – data ingestion, data management, query management and data lake management (Table 3).
Prashant Tyagi (firstname.lastname@example.org) is a director of IoT and analytics platforms at GE Software. Previously, he led the strategy and execution for Cisco’s Smart Services. He has a MBA from the Indian Institute of Management in Bangalore and a master’s degree in computer science from Clemson University.
Haluk Demirkan (email@example.com) is a professor of Service Innovation and Business Analytics at the Milgard School of Business, University of Washington-Tacoma, and a co-founder and board director of the International Society of Service Innovation Professionals. He has a Ph.D. in information systems and operations management from the University of Florida. He is a longtime member of INFORMS.
- Demirkan, H. and Delen, D., 2013, “Leveraging the Capabilities of Service-Oriented Decision Support Systems: Putting Analytics and Big Data in Cloud,” Decision Support Systems and Electronic Commerce, Vol. 55, No. 1, pp. 412-421.
- Demirkan, H. and Dal, B., 2014, “Why Do So Many Analytics Projects Still Fail? Key considerations for deep analytics on big data, learning and insights,” Analytics magazine, pp. 44-52, July-August 2014; http://analytics-magazine.org/the-data-economy-why-do-so-many-analytics-projects-fail/
- Winig, L., 2016, “GE’s Bıg Bet On Data and Analytıcs,” MIT Sloan Management Review, Feb. 18; http://sloanreview.mit.edu/case-study/ge-big-bet-on-data-and-analytics/
- 83FEATURES Putin vs. Western analysts Russia’s new approach to extending its influence necessitates new approaches to assessment. By Douglas Samuelson Making analytics work through practical project management Making analytics work: Why consistently delivering value requires effective project management. By Erick Wikum Crowdsourcing – Using the crowd: curated vs. unknown Using…
- 65International Data Corporation (IDC) recently released a worldwide Big Data technology and services forecast showing the market is expected to grow from $3.2 billion in 2010 to $16.9 billion in 2015. This represents a compound annual growth rate (CAGR) of 40 percent or about seven times that of the overall…
- 64May/June 2014 By (l-r) Pramod Singh, Ritin Mathur, Arindam Mondal and Shinjini Bhattacharya “There is a big data revolution. However, what is revolutionary is not the quantity of data alone. The big data revolution is that now we can do something with the data.” – Professor Gary King, Harvard University…
- 64INFORMS member Brenda L. Dietrich, an IBM Fellow, vice president and leader of IBM’s data science group, was recently profiled by Forbes in an article headlined, “Meet 9 Women Leading The Pack In Data Analytics.” Dietrich is also an INFORMS Fellow and a member of the National Academy of Engineering.…
- 63The healthcare analytics industry is making great strides. As part of my work I talk to many data analytics companies who report they are very busy with implementation projects. One large company told me that its implementation staffs are booked until end of the first quarter of 2017. This is…