Why Column-Aware Metadata Is Key to Automating Data Transformation

Data, data, data. It does seem we are not only surrounded by talk about data, but by the actual data itself. We are collecting data from every nook and cranny of the universe (literally!). IoT devices in every industry; geolocation information on our phones, watches, cars, and every other mobile device; every website or app we access—all are collecting data. 

In order to derive value from this avalanche of data, we have to get more agile when it comes to preparing the data for consumption. This process is known as data transformation, and while automation in many areas of the data ecosystem has changed the data industry over the last decade, data transformations have lagged behind. 

That started to change in 2022, and in 2023 I predict we will see an accelerated adoption of platforms that enable data transformation automation. 

Why we need to automate data transformations

If we are going to be truly data-driven, we need to automate every possible task in our data ecosystem. Over the multiple decades I’ve spent in the data industry, one observation has remained nearly constant: the majority of the work in building a data analytics platform revolves around data transformations (what we used to call “the T in ETL or ELT”). This must change. Gone are the days when a few expert data engineers could manage the influx of new data and data types, and quickly apply complex business rules to deliver it to their business consumers. 

We cannot scale our expertise as fast as we can scale the Data Cloud. There are just not enough hours in a day to do all the data profiling, design, and coding required to build, deploy, manage, and troubleshoot an ever-growing set of data pipelines with transformations. Add to that, there is a dearth of expert engineers to do all that coding and to have a great rapport with the business users so they understand the rules that need to be applied. Engineers like this don’t grow on trees. They require very specific technical skills and years of experience to become efficient and effective at their craft.

The solution? Code automation. There are plenty of SQL-savvy data analysts and architects out there who can be trained on modern data tools with user-friendly UIs. The more we can generate code and automate data pipelines, the more data we can deliver to the folks who need it most, in a timely manner. Add to that, generated code, based on templates, is easier to test and tends to have way fewer (if any) coding errors.

The fact is, with all this growth, not all that data is in one table or even one database; rather, it is spread across hundreds or even thousands of objects. A single organization may have access to millions of attributes. Translate that to database terms, and that means tens or hundreds of millions of columns that the organization needs to understand and manage.

Legacy solutions, even ones with some automation, are never going to manage and transform the data in all of those columns easily and quickly. How will we know where that data came from, where it went, and how it was changed along the way? With all the privacy laws and regulations, which vary from country to country and from state to state, how will we ever be able to trace the data and audit these transformations—at massive scale—without a better approach?

Using column-level metadata to automate data pipelines

I believe the best answer to these questions is that automation tools we use need to be column-aware. It is no longer sufficient to keep track of just tables and databases. That is not fine-grained enough for today’s business needs.

For the future, our automation tools must collect and manage metadata at the column level. And the metadata must include more than just the data type and size. We need much more context today if we really want to unlock the power of our data. We need to know the origin of that data, how current the data is, how many hops it made to get to its current state, who has access to which columns, and what rules and transformations were applied along the way (such as masking or encryption). 

Column awareness is the next level of innovation needed to allow us to attain the agility, governance, and scalability that today’s data world demands. Legacy ETL and integration tools won’t cut it anymore. Not only do they lack column awareness, they can’t handle the scale and diversity of data we have today in the cloud. 

So, in 2023 I expect to see a much greater adoption of, and demand for, column-aware automation tools to enable us to derive value from all this data faster. It will be a new era for data transformation and delivery platforms. The legacy ETL and ELT tools that got us this far will fall by the wayside as modern automation tools come to the fore with their simplicity and ease of use.

A word about data sharing

Many have said it, but it bears repeating—data sharing and data collaboration are becoming critical to the success of all organizations as they strive for better customer service and better outcomes. Since my involvement in the early days of data warehousing, I have talked about the dream of enriching our internal data with external, third-party data. Thanks to the Snowflake Data Cloud, that dream is now a reality. We just have to take advantage of it.

I believe that 2023 will be the Year of Data Collaboration and Data Sharing. The technology is ready, the industry is ready. Taking advantage of the collaboration and data sharing capabilities of Snowflake will provide the competitive edge that will allow many organizations to become or remain leaders in their industries. In this new age of advanced analytics, data science, ML, and AI, taking advantage of third-party data through data sharing and collaboration is essential if you want to be truly data-driven and stay ahead of the competition. 

Successful organizations must, and will, not only consume data from their partners, constituents, and other data providers, but also make their data available for others to consume. For many this will lead to a related benefit: the ability to monetize data. Again, thanks to Snowflake, it is easier than ever to create shareable data products and make them available on Snowflake Marketplace, at an appropriate price.

With these new capabilities, properly managing and governing the data that is being shared will be paramount, and it must happen at the column level. Just as automating data transformations at scale has been enabled by using column-level metadata, data sharing and governance most certainly need to be at the column level—especially when it comes to sensitive data like PII and PHI. Automating the build of your data transformations using a column-aware transformation tool will be a critical success factor for organizations seeking to accelerate their development of shared data products, now and into the foreseeable future.

If you want to get a jump on this, take a look at a modern data automation tool from Snowflake partner Coalesce.io and see how much faster you can get value from your data and bring some of that data to market.

Source

Originally posted on January 25, 2023 @ 10:53 pm