Dremio Continues to Reduce the Zone of Confusion Between Data Lakes and Data Warehouses with New Dart Initiative Release
Dremio, the SQL Lakehouse Platform company, today achieved another milestone in closing the gap between cloud data lakes and cloud data warehouses. Today's release marks the second delivery in the company's Dart Initiative, which enables customers to run all mission-critical SQL workloads directly on the cloud data lake.
This press release features multimedia. View the full release here: https://www.businesswire.com/news/home/20211021005387/en/
Figure 1: By providing scale-out metadata collection and storage, this Dart Initiative release allows Dremio to significantly reduce the time needed to collect and store metadata. As you can see here, the performance gains realized are higher the larger the dataset. This allows Dremio to greatly improve data freshness, while continuing to provide interactivity on the data lake. (Graphic: Business Wire)
Dremio embarked on the Dart Initiative in June 2021 to help companies run a greater range of mission-critical BI workloads directly on the data lake, delivering over 2x faster performance and drastically improved resource efficiency over previous Dremio versions. This subsequent Dart Initiative release introduces several more enhancements, including over 5x faster SQL expression processing over previous versions.
According to the 2020 Gartner (News - Alert)® Market Guide for Analytics Query Accelerators report, "Analytics query accelerators seek to shrink the performance impact of the zone of confusion. Put another way, they are trying to move the "line of good enough" to the point where the data lake can provide sufficient optimization on the data to make it suitable for an increasing percentage of workloads."1 With the Dart Initiative, Dremio seeks to leapfrog a "good enough" notion of data lakes, and make them the clear and obvious choice for BI and analytics workloads in the enterprise.
"It's clear that the data lake can already support BI workloads of the most mission-critical nature. Three of the Fortune Five companies that already run Dremio in production today are doing just that," said Tomer Shiran, founder and Chief Product Officer at Dremio. "We want to push the boundaries of what's possible in the data lakehouse and deliver the best BI experience for our customers. To that end, the Dart Initiative has been chipping away at the Zone of Confusion between data lakes and warehouses in critical areas such as query performance and acceleration, SQL coverage, and trnsactionality."
Here are some of the key innovations of the Dremio Dart Initiative Fall 2021 release.
Scale-out Metadata Collection and Storage
Achieving near-instantaneous query startup times has been out of reach for traditional query engines, which must perform a significant amount of work to parse, plan, and gather dataset metadata for each query before it can be executed. In contrast, Dremio enables interactive performance directly on data lake storage by drastically reducing the amount of computation required at runtime. Dremio's ability to efficiently compute, store, and leverage metadata plays a major role in enabling this.
This Dart Initiative release delivers near real-time metadata refresh for datasets, ensuring users are leveraging the most current or near real-time version of data, and receiving timely visibility into recent schema and data changes. Dremio has achieved data freshness through carefully refactoring metadata processing to become a parallel, executor-based process, with metadata now stored and managed in Apache Iceberg tables.
Parallelizing metadata processing across executors and leveraging capabilities and best practices from Iceberg makes all metadata operations much faster and more scalable, and in turn gives rise to a variety of benefits for users. In addition to the benefits mentioned, this enhanced metadata management approach enables Dremio to deliver metadata refresh times up to 20x faster than previous versions of Dremio, while governing them with the same workload management capabilities as queries, such as engine routing, priority, and concurrency controls. As demonstrated in Figure 1, performance improves as the dataset size increases. Data freshness effectively leads to more accurate insights and business decisions for enterprises across a variety of use cases, including customer experience and loyalty, marketing campaign optimization, operational efficiency, and customer 360.
Hardware-Optimized Query Processing
Dremio is an in-memory engine powered by Apache Arrow2, an open source columnar standard for in-memory computing that was co-created by Dremio. Gandiva, a component of Arrow, is an LLVM-based toolkit that enables vectorized execution directly on in-memory Arrow buffers, by generating code to evaluate SQL expressions that fully leverage the pipelining and SIMD capabilities of modern CPUs. This Dart Initiative release enables Dremio to dramatically accelerate expression processing rates by over 5x, ultimately providing a significant performance increase for end users.
Expanded SQL Coverage and Data Lakehouse Support
The Summer 2021 Dremio Dart Initiative empowered companies to run an even broader set of enterprise SQL workloads on Dremio by vastly expanding SQL coverage to include additional functions, operators, and SQL grammar constructs. The Fall 2021 Dart Initiative release extends the SQL coverage introduced through the prior Dart release, with functions such as Pivot/Unpivot and filtered aggregates. Risk analysis in insurance, maximizing revenue in travel and transportation, improving clinical trials in pharma, and enabling credit risk assessment in banking are among the many use cases that benefit from the expanded SQL coverage via this Dart release.
Aside from broadening the scope of SQL workloads, this Dart release also expands Dremio's support for open-source table formats. Table formats, such as Apache Iceberg and Delta Lake, enable companies to perform inserts, updates, and deletes with transactional consistency, and time travel, directly on data lake storage. Table formats have surged in popularity as these features were previously only supported by data warehouses. With this release, companies can now run interactive BI workloads on both of the leading lakehouse table formats, Apache Iceberg and Delta Lake.
"At Telenav, we embrace the vision of 'Data is the New Oil'. Data Engineers and Data Scientists are refiners of the digital age. We are executing this vision using a Smart Data Lake and creating semantically enriched data products with flexible visualizations for connected cars and other initiatives. Dremio is the foundation to manifest this vision at Telenav, enabling us to drive value directly from our cloud data lake AWS S3, instead of using data copies, ETL, and manual processes, and thereby lowering costs and reducing complexity. We are able to leverage performance benefits of the Dremio Dart Initiative inherently to close the gap between data lakes and warehouses for BI workloads."
- Kumar Maddali, VP of Product Development, Telenav
Dremio is the SQL Lakehouse Platform company, enabling companies to leverage open data architectures. Dremio's SQL Lakehouse Platform simplifies data engineering and eliminates the need to copy and move data to proprietary data warehouses or create cubes, aggregation tables and BI extracts, providing flexibility and control for data architects and data engineers, and self-service for data consumers. Founded in 2015, Dremio is headquartered in Santa Clara, CA (News - Alert). Investors include Cisco Investments, Insight Partners, Lightspeed Venture Partners, Norwest Venture Partners, Redpoint Ventures, and Sapphire Ventures. For more information, visit www.dremio.com. Connect with Dremio on GitHub, LinkedIn, Twitter, and Facebook.
1Gartner, "Market Guide for Analytics Query Accelerators", Adam Ronthal, Merv Adrian, Henry Cook, 9 December 2020 GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission
2 Originally Dremio's internal memory format, Apache Arrow is now one of the most popular open-source projects with over 20M monthly downloads.
Keynote Presentation - Open to all Badge Holders
Cybersecurity, Privacy and Data Breaches from a Business Lawyer's Perspective
â€œTime Machineâ€ - The Power of Petabyte Image Datalakes