Data is becoming more valuable, which means data downtime is more expensive. Here’s what some of the best data leaders are doing to address it.
In 2008, a minute of Amazon.com being unavailable would have cost the company $31,000. In 2021, that same minute of downtime would cost approximately $9 million.
The reason for this jump is simple: As the company and its online retail operations grew and more orders were placed every minute, downtime became more costly. This story is not unique to Amazon, of course. Web and SaaS applications have become mission critical to virtually every organization, which means any downtime has a significant cost to the business.
In response to this change, organizations didn’t say, “Well, the nature of applications and software development means it will always be a bit messy.” Instead they scaled their investments in line with the growing severity of the problem, which resulted in the widespread adoption of disciplines like site reliability engineering and technologies like observability.
Just as SaaS has moved from powering websites to being trusted with increasingly critical tasks from banking to real-time navigation, data needs to move from solely powering executive dashboards to creating valuable data applications such as machine learning models and real-time marketing applications.
However, just like the SaaS renaissance was not possible until cloud computing and application performance management solutions were invented to solve scaling and reliability issues, the data renaissance will not flourish until those same challenges are addressed.
Snowflake has helped solve the scaling challenge in a big way. We’re starting to see similar progress on the data reliability side as well. This includes the emerging discipline of data reliability engineering; technologies such as data testing and data observability; and emerging best practices such as data SLAs, data contracts, DataOps, circuit breakers, and schema change management.
In this post, I’ll walk through a four-step process data leaders can use to further improve and scale their data reliability to better drive data trust and adoption in their organization.
Step 1: Assess your current state
It’s somewhere between ironic and tragic, but data teams don’t always have great metrics in place to measure their data reliability or overall data health.
As a result, many teams are judged by business stakeholders on a qualitative basis, which is often overly weighted on the amount of time that has passed since they experienced a data incident. The other consequence of this is that data teams routinely underestimate the severity and underinvest in systemic fixes for data reliability.
Here are some example metrics and baselines for you to consider when assessing your current data reliability.
- Data downtime: To calculate this metric, take your number of data incidents and multiply it by your average time to detection (TTD) plus your average time to resolution (TTR). Without end-to-end data monitoring, however, you will likely only know the number of incidents you caught. If that’s the case, here’s a metric to help you create an estimate: the average organization will experience about 67 data incidents a year for every 1,000 tables in their environment. Once you have determined your estimated number of incidents, you can multiply by average TTD and TTR. In a recent survey of 300 data professionals, Wakefield Research determined that most respondents took more than 4 hours to detect an issue with an average TTR of 9 hours.
- Total engineering resources spent on data quality issues: Survey your engineering team to understand what percentage of their time they are spending on data quality issues. Most industry surveys, including our own, peg this consistently between 30 to 50%. From there it’s a simple process to convert those hours into salary to understand your labor cost. You can also review your OKRs and KPIs to see how many are related to improving data quality or a consequence of poor data quality.
- Data trust: Survey your stakeholders to see how much they trust the data your team is responsible for. You can get a heuristic measure for this by seeing how many times data issues are caught by people outside your team. Accenture found that only a third of businesses trust their data enough to use it effectively.
By assessing your current state of data reliability, you can set realistic goals for improvement as well as create a consensus for preventive measures before a major and painful incident arises.
Step 2: Identify priorities and set goals
For this step, you will need to talk to the business to understand how they use data in their daily workflow. You can then translate this into data SLAs to better track performance and determine the interventions required to reach the future state. Examples include:
- Freshness: Data will be refreshed by 7 a.m. daily (great for cases where the CEO or other key executives are checking their dashboards at 7:30 a.m.); data will never be older than X hours.
- Distribution: Column X will never be null; column Y will always be unique; field X will always be equal to or greater than field Y.
- Volume: Table X will never decrease in size.
- Schema: No fields will be deleted on this table.
- Overall data downtime: We will reduce our incidents X%, time to detection X%, and time to resolution X%.
- Ingestion (great for keeping external partners accountable): Data will be received by 5 a.m. each morning from partner Y.
Step 3: Track improvements and optimize processes
Now that you can reliably measure your data health, you can proactively invest resources where they are needed. For example, you may have six warehouses running smoothly, but one warehouse and domain are having repeated issues and in need of systemic solutions.
This is also a great time to work on your team’s data quality culture and processes, such as proactively handling schema changes, ensuring easy discoverability, and improving incident response.
Another example is leveraging an improved understanding of data lineage and how your data assets are connected to prioritize efforts around your key assets, or conversely to deprecate old, unused assets without worrying about unexpected downstream consequences.
Step 4: Proactively communicate data reliability
With these efforts, you should have dramatically improved your data reliability and should already be seeing higher adoption and a more data-driven culture across your organization. But now it’s time to accelerate that progress and proactively answer one of the most frequently asked questions of any data leader: “How do I know how much I can trust this data?”
Data certification is a great way to answer this question before it’s even asked. Data certification is the process by which data assets are approved for use across the organization after having met mutually agreed upon SLAs (often tiered gold/silver/bronze) for data quality, observability, ownership/accountability, issue resolution, and communication.
By labeling assets and communicating data reliability proactively, you can prevent versioning issues and increase efficiency.
Data adoption is at the crossroads of data access and reliability
A data renaissance is on the horizon, but to unleash its true potential, data leaders need to prioritize reliability. Now is the time to invest in improving your data reliability metrics, goals, processes, and proactive communication.
Until then—here’s wishing you no data downtime!
Originally posted on September 13, 2022 @ 11:13 pm