logo

Matt Wyskiel

The Blurred Lines of Cloud Resource Ownership

September 6, 2024

I didn't expect to write a follow up to my post on architecture, organization, and ownership so soon, but I had a new use case come up that I couldn't pass up discussing.

So this team came to me to ask my advice about a new integration they were planning. They have two existing services on ECS, and they want them to communicate asynchronously via SNS and SQS.

The first service would send a message to a Topic, which would fan out to subscriptions. They would follow the best practice of putting a Queue as the endpoint, for rate-limiting and retrying purposes, between the topic and a Lambda Function. The Function would then process the message and perform an action on the second service.

ownership-original-infra

Their problem statement: They needed to come up with a plan to organize, on the developer-side, the Infrastructure-as-Code that would make this happen.

It was a design question about which service owned which resources, so they could potentially optimize their workflow.

Their first thought was the following:

"The Topic serves as Service 1's 'notification sender', and the Queues' and Lambdas' purpose is to handle those message for the sake of Service 2. So, we should deploy the topic with Service 1 and the other pieces with Service 2!"

Which, on the surface, sounds incredibly reasonable. Its logical flow and ownership structure seems clear. It would ensure that you know where a component belongs, so that it can be updated alongside its 'owning' Service.

ownership-each-service-owning

There are two quirks with this design, of course, which I explained.

  1. Our custom-built CloudFormation deployment pipelines distinguish 'Server-ful', container-based services from Serverless Infrastructure, as they require different build and deployment steps. The system is meant to support one or the other at a given time, not both in one repository. It's technically possible, but it would be additional cognitive load and effort to maintain.
  2. Like we revealed in our last situation with SNS Topic Subscriptions, the topic would need to own the subscriptions. This would involve a song-and-dance where the topic is deployed first 'empty', then Queues with the proper permissions for that Topic in their policies, and then an updated Topic with the subscriptions to the queues by ARN. This is a fragile infrastructure deployment flow prone to breakage if you don't follow the process exactly.

The tech lead I was talking to was okay with these conditions but I could tell he was reluctant about it.

In the back of my head I was also unsure about the game plan, but we talked for a while more, coordinating as if we were moving forward.

And then, towards the end of the meeting, I interrupted and said, "Wait a minute, I have another idea."

I saw what potentially many of you saw when taking a look at that first diagram. The infrastructure enabling async communication between the two services could instead be grouped together and deployed as a cohesive unit!

ownership-separate-infra

This has multiple advantages:

  1. Better compatibility with our deployment pipelines, as these components can be separated from the services, and optimized for Infra Pipeline types as opposed to shoehorning into a Service Pipeline type.
  2. The components can much more easily integrate with each other, via local CloudFormation Outputs. In this particular case, this means that the Topic Subscription can be defined inline using endpoints defined in the same place, tightening the coupling between resources as that's the way CloudFormation works best with for SNS.
  3. Defined input and output points: a) Service 1 could receive the ARN from the topic as a CloudFormation Export, and send messages to it via the SNS API as usual. b) The Lambdas could communicate with Service 2 via API Call, which is already a stateless communication method with only a need for a URL.

The key piece to this story that enables us to go the separated-component route is that there are no current plans from the development team to expand the final endpoint of the asynchronous notifications to other services, much less those not owned by the team.

This is the ideal of a self-contained system. "It just works" because the team owns all the components. That means they can optimize their resource architectures to their optimal workload and processes.

The team is now pursuing this option, all of us relieved at the prospect of easier maintainability.

Just because the purpose of the Topic is to be the 'notification center' for Service 1, does not mean that it has to be dogmatically attached to Service 1 just for the sake of Ownership Consistency (TM), if it's going to make the team's life harder in maintaining that stack both in the present and into the future.

The team wants to deliver. The team is always looking to strike a balance between quick iteration and reducing technical debt for later.

Ownership rules keep things simple, until they don't. In those cases, break ownership rules to make your life easier and deliver faster.