Why Apache Iceberg will rule data in the cloud

The cloud has allowed data teams to collect vast quantities of data and store it at reasonable cost, opening the door to new analytics use cases that leverage data lakes, data mesh, and other modern architectures. But for very large volumes of data, generic cloud storage also presents challenges and limitations in how that data can be accessed, managed, and used.

Typical blob storage systems in the cloud lack the information required to show relationships between files or how they correspond to a table, making the job of query engines that much harder. Additionally, files by themselves do not make it easy to change schemas of a table, or to “time travel” over it. Each query engine must have its own view of how to query the files. All of a sudden, what seemed like an easy-to-implement data architecture becomes more difficult than expected.

This is where applying table formats to data becomes extremely useful. Table formats explicitly define a table, its metadata, and the files that compose the table. Instead of applying a schema when the data is read, clients already know the schema before the query is run. Moreover, the table metadata can be saved in a way that offers more fine-grained partitioning. Therefore, applying a table format to the data can offer a number of advantages, such as:

  • Faster performance due to better filtering or partitioning
  • Easier evolution of the schema
  • Ability to “time travel” across the table to view data at a given point in time
  • Table ACID compliance

Why Apache Iceberg?

Choosing which table format to use is an important decision because it can enable or limit the features available. Over the past two years, we have seen significant support emerging for Apache Iceberg, a table format originally developed by Netflix that was open-sourced as an Apache incubator project in 2018 and graduated from the incubator program in 2020.

Iceberg was built from the ground up to address some of the challenges in Apache Hive when working with very large data sets, including issues around scale, usability, and performance. As a Netflix engineer noted at the time, table formats for very large-scale data sets should work as reliably and predictably as SQL, “without any unpleasant surprises.” 

With several options available, we believe Iceberg is superior to other open table formats available. Here are five reasons why.

Iceberg makes a clean break from the past

The past can have a major impact on how a table format works today. Some table formats have evolved from older technologies, while others have made a clean break. Iceberg is in the latter camp. It was built from the ground up to address shortcomings in Apache Hive, which means it has avoided some of the undesirable qualities that held data lakes back in the past. How schema changes can be handled, such as renaming a column, is a good example. 

Looking ahead, this also means Iceberg does not need to rationalize how to further break from related tools without causing issues with production data applications. Over time, other table formats will likely catch up, but as of now, Iceberg is focused on delivering the next set of new features, instead of looking back to fix old problems. 

Iceberg is agnostic to processing engine and file format

By decoupling the processing engine from the table format, Iceberg provides greater flexibility and choice. Instead of being forced to use one processing engine, engineers can pick the best tool for the job. Choice is important for at least two key reasons. First, the engines a company uses to process data can change over time. For example, many businesses moved from Hadoop to Spark or Trino. Second, it’s common for large organizations to use several different technologies, and having choice enables them to use several tools interchangeably.

Iceberg also supports multiple file formats, including Apache Parquet, Apache Avro, and Apache ORC. This provides flexibility today, but also enables better long-term plugability for file formats that may emerge in the future. 

Iceberg is a well-run open source project

The Iceberg project is managed by the Apache Software Foundation, which means it adheres to several important Apache Ways, including earned authority and consensus decision making. This is not necessarily the case for every project calling itself “open source.” Apache Iceberg makes its project management public, so you know who is running the project. Other table formats do not disclose who has decision-making authority. A table format is a fundamental choice in a data architecture, so choosing a project that is truly open and collaborative can significantly reduce risks of accidental lock-in. 

Collaboration in Iceberg is spawning new ideas and help

There are several signs that the collaborative community around Apache Iceberg is benefiting users and setting the project up for long-term success. For users, the Slack channel and GitHub repository show high engagement, both around new ideas and support for existing functionality. Critically, engagement is coming from across the industry, not just one group or the original authors of Iceberg.

The high degree of collaboration is also benefiting the technology itself. The project is soliciting a growing number of proposals that are diverse in their thinking and solve many different use cases. Additionally, the project is spawning new projects and ideas, such as Project Nessie, the Puffin Spec, and the open Metadata API. 

Iceberg includes features that are paid in other table formats

Unlike some other table projects, Iceberg has performance-oriented features built in from the start, which is beneficial for users in a few ways. First, users often assume a project with open code includes performance features, only to discover they are not included or vaguely promised in the future. Second, if you want to move workloads around, which should be easy with a table format, you’re much less likely to run into substantial differences in Iceberg implementations. Third, once you start using open source Iceberg, you’re unlikely to discover that a feature you need is hidden behind a paywall. The distinction between what is open and what isn’t is also not a point-in-time problem.

As an open project from the start, Iceberg exists to solve a practical problem, not a business use case. This is a small but important distinction: Vendors with paid products who provide support for Iceberg, such as Snowflake, AWS, Apple, Cloudera, Google Cloud, and more, can compete in how well they implement the Iceberg specification, but the Iceberg project itself is not intended to drive business for a specific company. 

Snowflake and Iceberg

At Snowflake, we created our own table format early on, which enabled all sorts of new capabilities. But as businesses move to a cloud data platform, their needs and timelines vary. Some companies have regulatory requirements that restrict where data can be stored, or have existing investments they need to protect.

Supporting an external table format like Iceberg allows our customers to leverage all of their data from within Snowflake, even if some of it needs to reside in a different location. That’s why we added support for Iceberg as an additional table option within Snowflake earlier this year, and more recently introduced a new type of Snowflake table called Iceberg Tables. 

Getting Started with Apache Iceberg

There are some excellent resources within the Apache Iceberg community to learn more about the project and to get involved in the open source effort.

  • The Iceberg Getting Started guide provides examples of how to get started in purely open source Iceberg and Apache Spark.
  • Iceberg has several robust communities where you can get involved, such as the public Slack channels. 
  • If you want to make changes to Iceberg or propose a new idea, create a pull request based on the contribution guide. The community regularly participates in and combines community requests.

If you’re a Snowflake user, you can get started with our Iceberg private-preview support today. Contact your Snowflake account team to learn more about these features or to sign up. 

  • Iceberg Tables: Try out our new table type based entirely on Iceberg and Parquet in external storage, but with the benefits and similar performance of Snowflake tables.
  • External Tables for Iceberg: Enable easy connection from Snowflake with an existing Iceberg table via a Snowflake External Table.

James Malone is senior manager of product management at Snowflake.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Copyright © 2022 IDG Communications, Inc.

Source

Originally posted on August 29, 2022 @ 12:44 pm