The dawn of intelligent and automated data orchestration

The exponential growth of data, and in particular unstructured data, is a problem enterprises have been wrestling with for decades. IT organizations are in a constant battle between ensuring that data is accessible to users, one the one hand, and that the data is globally protected and in compliance with data governance policies, on the other. Added to this is the need to ensure that files are stored in the most cost-effective manner possible, on whichever storage is best at that point in time.

The problem is there is no such thing as a one-size-fits-all storage platform that can serve as the shared repository for all of an organization’s data, especially across multiple locations. Instead, there are myriad storage choices available from as many vendors, each of which is best suited for a particular performance requirement, access protocol, and cost profile for each phase of the data’s life cycle. Users and applications simply want reliable, persistent access to their files. But data policies inevitably require files to move to different storage platforms or locations over time. This creates additional cost and complexity for IT and disrupts user workflows. 

The explosion of AI and machine learning applications has sparked a new explosion of data that is only making this problem worse. Not only is the creation of data growing even faster, AI applications need access to legacy data repositories for training and inferencing workloads. This typically requires copying data from lower-cost, lower-performance storage systems into much higher-cost, higher-performamce platforms. 

In the consumer space, people have become used to the fact that when they open their iPhone or Android device, they simply see their files where they expect them, regardless of where the files are actually located. If they get a new device, the files are immediately available. Their view of the files is persistent, and abstracted from the physical location of the files themselves. Even if the files move from cloud to on-premises storage, or from old device to new, from the user’s perspective the files are just there where they always were. This data orchestration between platforms is a background operation, transparent to the user. 

This same capability is desperately needed by the enterprise, where data volumes and performance levels can be extreme. The fact that migrating data between platforms or locations is disruptive to users and applications is one reason why it is so difficult. This creates what is often called data gravity, where the operational cost of copying the data to a different platform is greater than the savings that can be achieved by leaving it where it is. When multiple sites and the cloud are added to the equation, the problem becomes even more acute.

The need for automated data orchestration

The traditional IT infrastructures that house unstructured data are inevitably siloed. Users and applications access their data via file systems, which is the metadata layer that translates the ones and zeros on storage platforms into usable file and folder structures we see on our desktops.

The problem is that in traditional IT architectures, file systems are buried in the infrastructure, at the storage layer, which typically locks them and your data into a proprietary storage vendor platform. Moving the data from one vendor’s storage type to another, or to a different location or cloud, involves creating a new copy of both the file system metadata and the actual file essence. This proliferation of file copies and the complexity needed to initiate copy management across silos interrupts user access and inhibits IT modernization and consolidation use cases.

This reality also impacts data protection, which may become fragmented across the silos. And operationally it impacts users, who need to remain online and productive as changes are required in the infrastructure. It also creates economic inefficiencies when multiple redundant copies of data are created, or when idle data gets stuck on expensive high-performance storage systems when it would be better managed elsewhere.

What is needed is a way to provide users and applications with seamless multi-protocol access to all their data, which is often fragmented across multiple vendor storage silos, including across multiple sites and cloud providers. In addition to global user access, IT administrators need to be able to automate cross-platform data services for workflow management, data protection, tiering, etc., but do so without interrupting users or applications.

To keep existing operations across the many interconnected departmental stakeholders running at peak efficiency, while at the same time modernizing IT infrastructures to keep up with the next generation of data-centric use cases, the ability to step above vendor silos and focus on outcomes is crucial. 

Defining data orchestration

Data orchestration is the automated process of ensuring files are where they need to be when they need to be there, regardless of which vendor platform, location, or cloud is required for that stage of the data life cycle. By definition data orchestration is a background operation, completely transparent to users and applications. When data is being actively processed, it may need to be placed in high-performance storage close to compute resources. But once the processing run is finished, these data should shift to a lower-cost storage type or to the cloud or other location, but must do so without interrupting user or application access.

Data orchestration is different from the traditional methods of shuffling data copies between silos, sites, and clouds precisely because it is a background operation that is transparent to users and applications. From a user perspective, the data has not moved. It remains in the expected file/folder structure on their desktop in a cross-platform global namespace. Which actual storage device or location the files sit on at the moment is driven by workflow requirements, and will change as workflows require.

Proper vendor-neutral data orchestration means that these file placement actions do not disrupt user access, or cause any change to the presentation layer of the file hierarchy in the global namespace. This is true whether the files are moving between silos in a single data center or across multiple data centers or the cloud. A properly automated data orchestration system ensures that data placement actions never impact users, even on live data that is being actively used.

Enabling a global data environment

Instead of managing data by copying files from silo to silo, which interrupts user access and adds complexity, Hammerspace offers a software-defined data orchestration and storage solution that provides unified file access via a high-performance parallel global file system that can span different storage types from any vendor, as well as across geographic locations, public and private clouds, and cloud regions. As a vendor-neutral, software-defined solution, Hammerspace bridges silos across one or more locations to enable a cross-platform global data environment.

This global data environment can dynamically expand or contract to accommodate burst workflows to cloud or remote sites, for example, all while enabling uninterrupted and secure global file access to users and applications across them all. And rather than needing to rely on vendor-specific point solutions to shuffle copies between silos and locations, Hammerspace leverages multiple metadata types including workflow-defined custom metadata to automate cross-platform data services and data placement tasks. This includes data tiering and placement policies, but also data protection functions such as cross-platform global audit records, undelete, versioning, transparent disaster recovery, write once ready many (WORM), and much more.

All data services can be globally automated, and invoked even on live data without user interruption across all storage types and locations.

Hammerspace automatically assimilates file metadata from data in place, with no need to migrate data off of existing storage. In this way, within minutes users and applications even in very large environments can mount the global file system to get cross-platform access via industry-standard SMB and NFS file protocols to all of their data globally, spanning all existing and new storage types and locations. No client software is needed for users or applications to directly access their files, with file system views identical to what they are used to.

The result is that file metadata is truly shared across all users, applications, and locations in a global namespace, and is no longer trapped at the infrastructure level in proprietary vendor silos. The silos between different storage platforms and locations disappear.

The power of global metadata

In traditional storage arrays users don’t know or care which individual disk drive within the system their files are on at the moment or may move to later. All of the orchestration of the raw data bits across platters and drives in a storage array is transparent to them, since users are interacting with the storage system’s file system metadata that lives above the hardware level.

In the same way, when users access their files via the Hammerspace file system all data movement between storage silos and locations is just as transparent to them as the movement of bits between drives and platters on their storage arrays. The files and folders are simply where they expect them to be on their desktop, because their view of those files comes via the global file system metadata above the infrastructure level. Data can remain on existing storage or move to new storage or the cloud transparently. Users simply see their file system as always, in a unified global namespace, with no change to their workflows.

It is as if all files on all storage types and locations were aggregated into a giant local network-attached storage (NAS) platform, with unified standards-based access from anywhere.

For IT organizations, this now opens a world of possibilities by enabling them to centrally manage their data across all storage types and locations without the risk of disrupting user access. In addition, it lets them control those storage resources and automate data services globally from a single pane of glass. And it is here that we can begin to see the power of global metadata.

That is, IT administrators can now use any combination of multiple metadata types to automate critical data services globally across otherwise incompatible vendor silos. And they can do this completely in the background, without proprietary point solutions or disruption to users.

Using Hammerspace automation tools called Objectives, administrators can proactively define any number of rules for how different classes of data should be managed, placed, and protected across the enterprise. This can be done at a file-level basis, with these metadata variables providing a level of intelligence about what the data is, and the value it has to the organization.

This means that data services can be fine-tuned to align with business rules. These include services such as tiering across silos, locations, and the cloud, data migration and other data placement tasks, staging data between storage types and locations to automate workflows, extending on-prem infrastructure to the cloud, performing global snapshots, implementing global disaster recovery processes, and much more. All can now be automated globally without interruption to users.

And in environments where AI and machine learning workflows enable enterprises to discover new value from their existing data, the ability to automate orchestration for training and inferencing workflows with data in place on existing silos without the creation of new aggregated repositories has even greater relevance.

This powerful data-centric approach to managing data across storage silos dramatically reduces complexity for IT staff, which can both reduce operating costs and increase storage utilization. This enables customers to get better use out of their existing storage and delay the need to add more storage.

The days of enterprises struggling with a siloed, distributed, and inefficient data environment are over. It’s time to start expecting more from your data architectures with automated data orchestration.

Trond Myklebust is co-founder and CTO of Hammerspace. As the maintainer and lead developer for the Linux kernel NFS client, Trond has helped to architect and develop several generations of networked file systems. Before joining Hammerspace, Trond worked at NetApp and the University of Oslo. Trond holds an MS degree in quantum field theory and fundamental fields from Imperial College, London. He worked in high-energy physics at the University of Oslo and CERN.

New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.

Copyright © 2024 IDG Communications, Inc.

Source