logo

Matt Wyskiel

Architecture is Organizational

August 27, 2024

A few months ago I was faced with an architectural question from a team I was helping with their cloud deployments.

They were developing a data processing solution that wanted to hook into a shared S3 bucket, owned by another team, which was already integrated into its own system:

s3-notif-architecture-new-solution

My organization has empowered teams that build and deploy their own infrastructure alongside their code, so I encouraged them to take ownership of their solution as far as possible.

In their case, this meant adding the Infrastructure-as-Code for the S3 Notification to their own repository, deployed at their own cadence. Something like the following:

---
AWSTemplateFormatVersion: 2010-09-09
Parameters:
  NotificationBucket:
    Type: String
    Description: S3 bucket that's used for the Lambda event notification

Resources:
  S3NotificationLambdaHandler:
    Type: 'AWS::Lambda::Function'
    Properties:
      ...

  LambdaInvokePermission:
    Type: 'AWS::Lambda::Permission'
    Properties:
      FunctionName: !GetAtt S3NotificationLambdaHandler.Arn
      Action: 'lambda:InvokeFunction'
      Principal: s3.amazonaws.com
      SourceAccount: !Ref 'AWS::AccountId'
      SourceArn: !Sub 'arn:aws:s3:::${NotificationBucket}'

  LambdaIAMRole:
    Type: 'AWS::IAM::Role'
    Properties:
      ...
      Policies:
        - PolicyName: root
          PolicyDocument:
            Version: 2012-10-17
            Statement:
              - Effect: Allow
                Action:
                  - 's3:GetBucketNotification'
                  - 's3:PutBucketNotification'
                Resource: !Sub 'arn:aws:s3:::${NotificationBucket}'
              - ...

  CustomResourceLambdaFunction:
    Type: 'AWS::Lambda::Function'
    Properties:
      Handler: index.lambda_handler
      Role: !GetAtt LambdaIAMRole.Arn
      Code:
        ZipFile: |
	        <code that does CRUD on adding the S3 Notification for the external bucket>
      Runtime: python3.9
      Timeout: 50

  LambdaTrigger:
    Type: 'Custom::LambdaTrigger'
    DependsOn: LambdaInvokePermission
    Properties:
      ServiceToken: !GetAtt CustomResourceLambdaFunction.Arn
      LambdaArn: !GetAtt S3NotificationLambdaFunction.Arn
      Bucket: !Ref NotificationBucket

Knowing in the back of my mind that using CustomResources are almost always a hack, nonetheless we deployed to Production and it worked smoothly.

That is, until a few weeks later when the notifications suddenly stopped flowing.

We immediately investigated and found out that it was because of a quirk in CloudFormation. Since the external bucket is also deployed via CloudFormation, it considers the updates that were made outside of the Template--even if on-purpose and supporting production critical functions--to be 'Drift', and deleted them.

The good news is that the workaround is just a redeploy from the hooking-in team's side which will get everything back up and running.

The bad news is, time-sensitive communication to coordinate production releases cancels out the benefits of an independent, empowered teams model.

We ended up resolving this issue by moving responsibility for the notification hook to the team that owns the original bucket.

CloudFormation makes this much easier, which is a sign that it was the right call:

---
AWSTemplateFormatVersion: 2010-09-09
Resources:
  S3Bucket:
    Type: 'AWS::S3::Bucket'
    Properties:
      ...
      NotificationConfiguration:
        TopicConfigurations:
          - Topic: 'arn:aws:sns:us-east-1:123456789012:TestTopic'
            Event: 's3:ObjectCreated:*'
Outputs:
  ...

In another case, there was a shared SNS Topic which a team wanted to subscribe to and handle with a Lambda function:

sns-existing-with-new

This is much easier to do decentralized with CloudFormation:

---
AWSTemplateFormatVersion: '2010-09-09'

Parameters:
  Topic:
    Type: String

Resources:
  Function:
    Type: AWS::Lambda::Function
    Properties:
      ...

  FunctionInvokePermission:
    Type: 'AWS::Lambda::Permission'
    Properties:
      Action: 'lambda:InvokeFunction'
      FunctionName: !Ref Function
      Principal: sns.amazonaws.com   

  MySubscription:
    Type: AWS::SNS::Subscription
    Properties:
	  Endpoint: !Ref Function
	  Protocol: lambda
	  TopicArn: !Ref Topic

But wouldn't you know it, the same situation happened!

The topic was updated from its own stack, kicking out the externally-defined Subscription.

The solution was much the same, transferring ownership of the subscription to the owner of the Topic.

---
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: >
  example-lambda-sns

  Example CloudFormation template to subscribe a lambda to an SNS Topic.
Resources:
  Topic:
    Type: AWS::SNS::Topic
    Properties:
      DisplayName: !Sub 'example-sns-topic'
      TopicName: !Sub 'example-sns-topic'
      Subscription:
        - Protocol: lambda
          Endpoint: !GetAtt Function.Arn

  Function:
    Type: AWS::Lambda::Function
    Properties:
      ...

  FunctionInvokePermission:
    Type: 'AWS::Lambda::Permission'
    Properties:
      Action: 'lambda:InvokeFunction'
      FunctionName: !Ref Function
      Principal: sns.amazonaws.com   

  ExampleTopicPolicy:
    Type: 'AWS::SNS::TopicPolicy'
    Properties:
      ...

However, this got me thinking critically about the very topic of ownership, architecture, and how it all depends on how the people are working together in the first place.

So, a few points on that.

There's a fine line between independence and silos.

Teams should be able to maintain their work with as few dependencies as possible. There is nothing more time-sucking than waiting on another team to deliver some thing you need to get your work done. The microservice revolution did wonders to enable teams to deliver so much more in a much shorter amount of time. Further, in a world of continuous integration and delivery, any need to coordinate deployments and configurations synchronously with another team is a regression into a past we don't want to work with anymore.

However, some architectures--especially those based on shared resources like Buckets or Topics--will require cross-team communication. As it turns out, the infrastructure will enforce that--as hard as you try to work around it.

Be prepared at the design stage for the side affects of ownership decisions.

As useful as shared resources are for following the Don't Repeat Yourself rule (and believe me, one of my favorite tasks to do is refactoring a project to make that a reality), the risk must be accepted that creating those shared resources inherently creates a dependency that has to be managed.

The second you build a shared resource that other teams start using, it is on you to communicate changes to that resource and ensure that those other teams' projects do not break because you made a change that they didn't account for.

A Secrets Rotation might be a more obvious example of this, since if dependent services aren't updated, the breakage will be immediate. However, as I've shown here, even more background resources like S3 Bucket notifications can be affected by this rule too.

Organizations need to make decisions that best suit their needs. Everyone needs to be clear about what using shared resources will entail in terms of responsibilities, and communicate well to ensure that production and business-critical systems don't break unexpectedly. These factors need to be considered, and decisions made, at the design stage, so that everyone is on the same page for the long-term of the project.