Lakehouse databricks paper

8/6/2023

Apache Spark provides an underlying platform.Īlthough building on top of the data lake, the features described and the products mentioned focus heavily on the ingestion, management, and use of highly structured data, as is the case with a data warehouse. In this case, the complementary/competing open-source products/projects mentioned are: Delta Lake (an open-source storage layer supporting ACID transactions), Apache Hudi (a platform for streaming data with incremental data pipelines), and Apache Iceberg (an open table format for very large analytic datasets). Rather, a product or set of products is suggested as implementing the concept. The first four points are directly related to the goal of supporting modern data warehousing from a data lake base, while the remainder relate more to “big data” applications.Īs is often the case with vendor-defined architectures, no architectural pattern is offered for the lakehouse. Support for diverse data types ranging from structured to “unstructured” data.Support for diverse workloads, including data science, AI, SQL, and analytics.Storage decoupled from compute, for scaling to more concurrent users and larger data sizes.Open and standardized storage formats, such as Parquet, with a standard API supporting a variety of tools and engines, including machine learning.End-to-end streaming to serve real-time data applications.Support for using BI tools directly on the source data, reducing staleness and latency, and eliminating the copies of data found in a combined data lake and warehouse solution.Enforcement, evolution, and governance for schemata such as star and snowflake.ACID transactions for concurrent data read and write supporting continuous update.The lakehouse therefore attempts to eliminate or, at least, greatly reduce the common approach of copying subsets of that data to a separate data warehouse environment to support traditional BI that data lake tools struggle to do.Īccording to the original paper defining the architecture, a lakehouse has the following features: According to Databricks, the data lakehouse is a platform that “combines the best elements of data lakes and data warehouses-delivering data management and performance typically found in data warehouses with the low-cost, flexible object stores offered by data lakes.” It builds on a data lake foundation because, according to the authors of a recent lakehouse Q&A, they often contain more than 90% of the data in the enterprise. The data lakehouse concept was introduced early in 2020 by Databricks, a company founded in 2013 by the original creators of Apache Spark™, Delta Lake and MLflow. Data in a lake is therefore often inconsistent-the so-called data swamp-and it is up to the users to manage its meaning (schema on read) and-a more difficult proposition-quality. The solid blue feeds to the data lake denote minimum levels of transformation of the incoming data and are operated independently, often by data scientists or business departments. These functions are largely IT controlled.

The black arrows into and out of the enterprise data warehouse (EDW) denote strong transformation function (schema on write) and are arranged in funnel patterns to imply consolidation of data in the EDW. The heavy black arrows show bulk data transfer between data stores, whereas the light blue dashed arrows show access to data stores from applications. The different types of data stores-relational, flat file, graph, key-value, and so on-have different symbols, and it is clear that a data lake contains many different storage types, while the data warehouse is fully relational. Different arrow and widget designs represent different types of function and storage.

0 Comments

Lakehouse databricks paper

Leave a Reply.

Author

Archives

Categories