A Hybrid of Data Warehouse and Data Lake

There’s a great deal of controversy in the industry these days around data lakes versus data warehouses. For many years, a data warehouse was the only game in town for enterprises to process their data and get insight from it. But over time, the options in the market lagged behind the needs of buyers, and frustration ensued.  

The requirements of the internet changed the game as the sheer scale of data generated by the early pioneers of the web could not be contained within the confines of data warehouses of the day. This gave rise to Hadoop and then Spark, which use a data lake architecture to solve for the scale problem, often referred to as “Big Data” solutions.

If you’re not familiar with the difference between a data lake and a data warehouse, you’re not alone. There’s lots of confusion and there are permutations that make one resemble the other, with the ideal solution probably being a hybrid of the two. 

A data warehouse architecture is built around a database, with a managed repository of data that’s stored and processed within the system, whether that’s contained in a single box or as a distributed architecture across many boxes. Data needs to be ingested to a data warehouse, where it’s stored and optimized for processing by the system. 

A data lake architecture eliminates the ingest requirement, and uses a query engine that scans data stored outside of the system, typically in cloud object storage. Object storage is by nature essentially infinite in scale, so it’s well suited to working with massive scale data sets. The lack of a requirement to ingest data makes things easier for admins who don’t have to inventory and understand their data and where it lives. They can just leave it where it’s dumped, scan some of it, and ignore most of it.

Many proponents of data lakes call data warehouses a dated or even failed architecture pattern, and say that with some enhancements, a data lake architecture can replace a data warehouse.  While the limitations of many data warehouses are real and problematic, canceling an entire category because of the limitations of some examples in the space is shortsighted. One approach to the problem could be adapting lake technology to serve the functions of a data warehouse, often called Lakehouse. Another approach, and one that Snowflake has used since its first cloud service was launched almost 10 years ago, is to build a better data warehouse while incorporating much of the functionality common to data lakes.

So, what specifically has Snowflake done to address the problems common to legacy data warehouses? Among other things, these include:

  • Near-infinite scale: While most data warehouses are limited by a maximum amount of storage or compute resources, Snowflake solves this problem with a near-unlimited number of T-shirt sized compute clusters that can all see the same, essentially infinite-scale data repository.
  • Separation of storage and compute: Most data warehouse systems have fixed scaling where you have to add both compute and storage together, wasting resources when you only need one or the other; Snowflake scales each separately so that users don’t have to pay for resources they don’t need.
  • Workload isolation: Snowflake allows each major category of work to get its own cluster, allowing each job to scale up and down as needed, with ample resources on hand and no impact to other work running against the same repository, as they get access to fully separate resources. Most data warehouses either don’t address this issue, or use unwieldy resource prioritization tooling that is generally complicated and ineffective.
  • Pay-for-use model: Unlike any on-premises data warehouse system or cloud-based data warehouses that rely on pre-cloud architectures, Snowflake’s near-instant scalability allows for just-in-time provisioning for users to pay for only as much capacity as they need at any given moment, and not to have to over-provision to accommodate unexpected demand.

And, Snowflake provides many of the features common to data lake systems, including:

  • Unlimited data scale: See above.
  • Mixed data types: While most data warehouses support structured data only, the Snowflake Data Cloud can process structured, semi-structured (JSON, XML, etc.), and unstructured data, allowing Snowflake to cover many of the web and big data-type workloads that commonly rely on data lake architectures.
  • Choice of languages: Snowflake uses SQL, like most data warehouses, but Snowpark allows coders to interact with data in the repository in the language of their choice, including Python (in public preview), Scala, and Java.
  • External data access: The Snowflake Data Cloud supports data ingested to its managed internal repository, as well as external scans of data stored in customer-managed cloud object storage via External Tables.
  • Open table format: Snowflake recently announced support for Iceberg, an Apache Foundation open table format that’s widely used in the industry by many vendors and users. For use cases requiring open standards, Iceberg Tables (in private preview) give customers more options of storage patterns with Snowflake’s performance, security, governance, ease of use, and collaboration.

Beyond these capabilities, the Snowflake Data Cloud allows customers to do things that data lake-based architectures can’t do well, at least so far:

  • Strong governance of data: A data lake-based architecture can build role-based access controls into the query platform, but it doesn’t own the data and can’t fully restrict access to it, allowing users to see or modify data in the object storage console even if they can’t get to it through the data platform.
  • Advanced data collaboration: The core architecture of the Snowflake Data Cloud, and the fully automated replication it supports allows users to share live data across clouds and regions, and fully revoke access to previously shared data, something which lake-based solutions struggle with.
  • Consistent performance across many types of workloads: The Snowflake Data Cloud provides high and non-variable performance for dashboards, data transformation, data science, and nearly everything else that enterprises want to do with their data. While competitors may be able to come up with a few cases where their platform might slightly outperform Snowflake, Snowflake provides effective performance across a broad range of demanding use cases, such as handling high data volume, high concurrency, complex data relationships, and all the other real-world scenarios that our customer base faces day in and day out.
  • High levels of automation: While data lake solutions have come a long way toward handling data warehouse-type workloads, it generally comes with a high cost of complexity. Snowflake eliminates the manual effort needed for care and feeding of the platform and lets customers focus on their data instead.

While Snowflake is well known in the data warehouse arena, our architecture and the capabilities we give customers since the beginning are a hybrid of warehouse and lake capabilities. Whether you’re solving for massive data scale, mixed data types, language preferences, access to external data, extensibility, or geographic and cloud diversity, Snowflake can help—regardless of what technical term you use to describe it. We call it the Snowflake Data Cloud, and customers such as Matthew Jones, Kount’s Data Science Manager, say, “Snowflake has been immensely impactful for us data scientists. We can consolidate data into one massive data lake as the single source of truth, yet it’s all easily queryable.”

Source

Originally posted on August 29, 2022 @ 6:31 pm