Better application networking and security with CAKES

Modern software applications are underpinned by a large and growing web of APIs, microservices, and cloud services that must be highly available, fault tolerant, and secure. The underlying networking technology must support all of these requirements, of course, but also explosive growth.

Unfortunately, the previous generation of technologies are too expensive, brittle, and poorly integrated to adequately solve this challenge. Combined with non-optimal organizational practices, regulatory compliance requirements, and the need to deliver software faster, a new generation of technology is needed to address these API, networking, and security challenges.

CAKES is an open-source application networking stack built to integrate and better solve these challenges. This stack is intended to be coupled with modern practices like GitOps, declarative configuration, and platform engineering. CAKES is built on the following open-source technologies:

  • C – CNI (container network interface) / Cilium, Calico
  • A – Ambient Mesh / Istio
  • K – Kubernetes
  • E – Envoy / API gateway
  • S – SPIFFE / SPIRE

In this article, we explore why we need CAKES and how these technologies fit together in a modern cloud environment, with a focus on speeding up delivery, reducing costs, and improving compliance.

Why CAKES?

Existing technology and organization structures are impediments to solving the problems that arise with the explosion in APIs, the need for iteration, and an increased speed of delivery. Best-of-breed technologies that integrate well with each other, that are based on modern cloud principles, and that have been proven at scale are better equipped to handle the challenges we see.

Conway’s law strikes again

A major challenge in enterprises today is keeping up with the networking needs of modern architectures while also keeping existing technology investments running smoothly. Large organizations have multiple IT teams responsible for these needs, but at times, the information sharing and communication between these teams is less than ideal. Those responsible for connectivity, security, and compliance typically live across networking operations, information security, platform/cloud infrastructure, and/or API management. These teams often make decisions in silos, which causes duplication and integration friction with other parts of the organization. Oftentimes, “integration” between these teams is through ticketing systems.

For example, a networking operations team generally oversees technology for connectivity, DNS, subnets, micro-segmentation, load balancing, firewall appliances, monitoring/alerting, and more. An information security team is usually involved in policy for compliance and audit, managing web app firewalls (WAF), penetration testing, container scanning, deep packet inspection, and so on. An API management team takes care of onboarding, securing, cataloging, and publishing APIs.

If each of these teams independently picks the technology for their silo, then integration and automation will be slow, brittle, and expensive. Changes to policy, routing, and security will reveal cracks in compliance. Teams may become confused about which technology to use, as inevitably there will be overlap. Lead times for changes in support of app developer productivity will get longer and longer. In short, Conway’s law, which states that an organizational system often end ups like the communication structure of that organization, rears its ugly head.

cakes 01 Solo.io

Figure 1. Technology silos lead to fragmented technology choices, expensive and brittle integrations, and overlap

Sub-optimal organizational practices

Conway’s law isn’t the only issue here. Organizational practices in this area can be sub-optimal. Implementations on a use-case-by-use-case basis result in many isolated “network islands” within an organization because that’s how things “have always been done.”

For example, a new line of business spins up, which will provide services to other parts of the business and consume services from other parts. The modus operandi is to create a new VPC (virtual private cloud), install new F5 load balancers, new Palo Alto firewalls, create a new team to configure and manage it, etc. Doing this use case by use case causes a proliferation of these network islands, which are difficult to integrate and manage.

As time goes on, each team solves challenges in their environments independently. Little by little, these network islands start to move away from each other. For example, we at Solo.io have worked with large financial institutions where it’s common to find dozens if not hundreds of these drifting network islands. Organizational security and compliance requirements become very difficult to keep consistent and auditable in an environment like that.

cakes 02 Solo.io

Figure 2. Existing practices lead to expensive duplication and complexity.

Outdated networking assumptions and controls

Lastly, the assumptions we’ve made about perimeter network security and the controls we use to enforce security policy and network policy are no longer valid. We’ve traditionally assigned a lot of trust to the network perimeter and “where” services are deployed within network islands or network segments. The “perimeter” deteriorates as we punch more holes in the firewall, use more cloud services, and deploy more APIs and microservices on premises and in public clouds (or in multiple public clouds as demanded by regulations). Once a malicious actor makes it past the perimeter, they have lateral access to other systems and can get access to sensitive data. Security and compliance policies are typically based on IP addresses and network segments, which are ephemeral and can be reassigned. With rapid changes in the infrastructure, “policy bit rot” happens quickly and unpredictably.

Policy bit rot happens when we intend to enforce a policy, but because of a change in complex infrastructure and IP-based networking rules, the policy becomes skewed or invalid. Let’s take a simple example of service A running on VM 1 with IP address 10.0.1.1 and service B running on VM 2 with IP address 10.0.1.2. We can write a policy that says “service A should be able to talk to service B” and implement that as firewall rules allowing 10.0.1.1 to talk to 10.0.1.2.

cakes 03 Solo.io

Figure 3. Service A calling Service B on two different VMs with IP-based policy.

Two simple things could happen here to rot our policy. First, a new Service C could be deployed to VM 2. The result, which may not be intended, is that now service A can call service C. Second, VM 2 could become unhealthy and recycled with a new IP address. The old IP address could be re-assigned to a VM 3 with Service D. Now service A can call service D but potentially not service B.

cakes 04 Solo.io

Figure 4. Policy bit rot can happen quickly and go undetected when relying on ephemeral networking controls.

The previous example is for a very simple use case, but if you extend this to hundreds of VMs with hundreds if not thousands of complex firewall rules, you can see how changes to environments like this can get skewed. When policy bit rot happens, it’s very difficult to understand what the current policy is unless something breaks. But just because traffic isn’t breaking right now doesn’t mean that the policy posture hasn’t become vulnerable.

Conway’s law, complex infrastructure, and outdated networking assumptions make for a costly quagmire that slows the speed of delivery. Making changes in these environments leads to unpredictable security and policy impacts, makes auditing difficult, and undermines modern cloud practices and automation. For these reasons, we need a modern, holistic approach to application networking.

A better approach to application networking

Technology alone won’t solve some of the organizational challenges discussed above. More recently, the practices that have formed around platform engineering appear to give us a path forward. Organizations that invest in platform engineering teams to automate and abstract away the complexity around networking, security, and compliance enable their application teams to go faster.

Platform engineering teams take on the heavy lifting around integration and honing in on the right user experience for the organization’s developers. By centralizing common practices, taking a holistic view of an organization’s networking, and using workflows based on GitOps to drive delivery, a platform engineering team can get the benefits of best practices, reuse, and economy of scale. This improves agility, reduces costs, and allows app teams to focus on delivering new value to the business.

cakes 05 Solo.io

Figure 5. A platform engineering team abstracts away infrastructure complexity and presents a developer experience to application developer teams through an internal developer portal.

For a platform engineering team to be successful, we need to give them tools that are better equipped to live in this modern, cloud-native world. When thinking about networking, security, and compliance, we should be thinking in terms of roles, responsibilities, and policy that can be mapped directly to the organization.

We should avoid relying on “where” things are deployed, what IP addresses are being used, and what micro-segmentation or firewall rules exist. We should be able to quickly look at our “intended” posture and easily compare it to existing deployment or policy. This will make auditing simpler and compliance easier to ensure. How do we achieve it? We need three simple but powerful foundational concepts in our tools:

  • Declarative configuration
  • Workload identity
  • Standard integration points

Declarative configuration

Intent and current state are often muddied by complexities of an organization’s infrastructure. Trying to wade through thousands of lines of firewall rules based on IP addresses and network segmentation and understand intent can be nearly impossible. Declarative configuration formats help solve this.

Instead of thousands of imperative steps to achieve a desired posture, declarative configuration allows us to very clearly state what the intent or the end state of the system should be. We can look at the live state of a system and compare it with its intended state much more easily with declarative configuration than trying to reverse engineer through complex steps and rules. If the infrastructure changes we can “recompile” the declarative policy to this new target, which allows for agility.

cakes 06 Solo.io

Figure 6. Declare what, not how.

Writing network policy as declarative configuration is not enough, however. We’ve seen large organizations build nice declarative configuration models, but the complexity of their infrastructure still leads to complex rules and brittle automation. Declarative configuration should be written in terms of strong workload identity that is tied to services mapped to organization structure. This workload identity is independent of the infrastructure, IP addresses, or micro-segmentation. Workload identity helps reduce policy bit rot, reduces configuration drift, and makes it easier to reason about the intended state of the system and the actual state.

Workload identity

Previous methods of building policy based on “where” workloads are deployed are too susceptible to “policy bit rot.” Constructs like IP addresses and network segments are not durable, that is, they are ephemeral and can be changed, reassigned, or are not even relevant. Changes to these constructs can nullify intended policy. We need to identify workloads based on what they are, how they map within the organizational structure, and do so independently of where they are deployed. This decoupling allows intended policy to resist drift when the infrastructure changes, is deployed over hybrid environments, or experiences faults/failures.

cakes 07 Solo.io

Figure 7. Strong workload identity should be assigned to workloads at startup. Policies should be written in terms of durable identity regardless of where workloads are deployed.

With a more durable workload identity, we can write authentication and authorization policies with declarative configuration that are easier to audit and that map clearly to compliance requirements. A high-level compliance requirement such as “test and developer environments cannot interact with production environments or data” becomes easier to enforce. With workload identity, we know which workloads belong to which environments because it’s encoded in their workload identity.

Most organizations already have existing investments in identity and access management systems, so the last piece of the puzzle here is the need for standard integration points.

Standard integration points

A big pain point in existing networking and security implementations is the expensive integrations between systems that were not intended to work well together or that expose proprietary integration points. Some of these integrations are heavily UI-based, which are difficult to automate. Any system built on declarative configuration and strong workload identity will also need to integrate with other layers in the stack or supporting technology.

Source