Many organizations have taken the plunge with data lakes. But many of these efforts are still just treading water. But there’s good news. Apache Spark can help businesses realize the promise of big data lakes.
McKinsey & Co. notes that data lakes are appealing because “data are loaded in ‘raw’ formats rather than reconfigured as they enter company systems.” That means they can be used for more than just basic capture. However, the firm adds, integrating data lakes with other elements of the technology architecture, setting use rules, and finding appropriate talent can be challenges.
Meanwhile, author and consultant Dan Woods in a January article for Forbes has this to say about the subject: “With the data lake, while companies can store massive amounts and varieties of data, they have been unable to effectively manage that data and allow a large number of people with moderate expertise levels to explore the data, come up with useful queries, extract the signal through some regular production process that becomes part of the way a business runs.
Some companies, like Netflix, have managed to operationalize a data lake …. But in most other businesses, the data lake got stuck at the proof of concept stage. That is why in general, the data lake is now in need of salvation. The point of saving the data lake is to understand how we go from having a repository of data with signals to operationalizing that information to provide value to the business.”
One of the challenges is that many companies with data lakes are using expensive, proprietary solutions for data ingestion, integration, and transformation related to them. But now many of these same operations are putting a toe in the waters with Apache Spark. And that’s a good thing, because Apache Spark is a strong distributed computing framework that handles end-to-end analytics, data processing, and machine learning requirements.
In a webinar next week, Anand Venugopal, AVP and business head at StreamAnalytix, will offer more details on why Apache Spark is the answer.
Venugopal will present information about cloud-based IoT use cases with event-time, late-arrival, and watermarks. He’ll talk about Python-based predictive analytics running on Spark in cloud environments. And he’ll offer information on visual interactive development of Apache Spark Structured Streaming pipelines.
He’ll also talk about using Apache Spark related to on-premises data lakes. That conversation will explore on-premises advanced monitoring of Spark pipelines. He’ll also discuss data quality and ETL with Apache Spark using pre-built operators in on-premises environments.
Last year around this time Ian Pointer for InfoWorld wrote “From its humble beginnings in the AMPLab at U.C. Berkeley in 2009, Apache Spark has become one of the key big data distributed processing frameworks in the world. Spark can be deployed in a variety of ways, provides native bindings for the Java, Scala, Python, and R programming languages, and supports SQL, streaming data, machine learning, and graph processing. You’ll find it used by banks, telecommunications companies, games companies, governments, and all of the major tech giants such as Apple, Facebook, IBM (News - Alert), and Microsoft.”