<< BACK

Additive Infrastructure Changes

the philosophy and CDK implementation of layering new infrastructure alongside existing resources — add first, validate on test domains, cut over DNS, then decommission.

DATE:
APR.07.2025
READ:
14 MIN

Infrastructure migrations are among the most anxiety-inducing changes in cloud engineering. A misconfigured load balancer swap can take down production. An accidental Route53 record deletion can leave customers staring at DNS errors for hours while TTLs expire. A CloudFormation rollback touching stateful resources can corrupt data.

After managing AWS CDK infrastructure for ECS Fargate services across multiple environments, we adopted a strict philosophy: every infrastructure change should be additive. New resources are created alongside existing ones. Nothing is modified or removed in the same deployment. Validation happens on test domains before any DNS cutover. Rollback is as simple as reverting the CDK code and redeploying.

This post walks through exactly how we layered API Gateway with VPC Links and internal ALBs on top of existing public-facing ALBs, without touching a single existing resource.


The four principles

Principle 1: Never modify existing resources in the same change that adds new ones

When you look at your cdk diff output, you should see only additions. Zero modifications. Zero deletions. If your diff shows a change to an existing security group, listener, or target group — stop and redesign.

++
$ cdk diff ApiGatewayStack

Stack ApiGatewayStack
Resources
[+] AWS::ElasticLoadBalancingV2::LoadBalancer InternalALB
[+] AWS::ElasticLoadBalancingV2::Listener InternalALB/HTTPSListener
[+] AWS::ElasticLoadBalancingV2::TargetGroup InternalTG
[+] AWS::EC2::SecurityGroup InternalALBSG
[+] AWS::ApiGatewayV2::Api HttpApi
[+] AWS::ApiGatewayV2::VpcLink VpcLink
[+] AWS::ApiGatewayV2::Integration ALBIntegration
[+] AWS::ApiGatewayV2::Route DefaultRoute
[+] AWS::ApiGatewayV2::Stage ProdStage
[+] AWS::Route53::RecordSet GwTestRecord
++

Every line starts with [+]. That is the goal.

Principle 2: Validate on test domains before DNS cutover

Create gw-* prefixed subdomains for every service. If the production domain is api.example.com, create gw-api.example.com pointing to the new API Gateway endpoint. Run full integration tests against the new path without touching production traffic.

++
route53.ARecord(
    self, "GwTestRecord",
    zone=hosted_zone,
    record_name=f"gw-{service_name}",
    target=route53.RecordTarget.from_alias(
        targets.ApiGatewayv2DomainProperties(
            regional_domain_name=custom_domain.attr_regional_domain_name,
            regional_hosted_zone_id=custom_domain.attr_regional_hosted_zone_id,
        )
    ),
)
++

Principle 3: Dual-register services in both old and new target groups

ECS services can be registered with multiple target groups simultaneously. Register existing ECS tasks with both the original public ALB’s target group and the new internal ALB’s target group. Both paths work at the same time.

++
# The existing public ALB target group attachment is untouched
# We ADD a second target group without modifying the first
fargate_service.attach_to_application_target_group(internal_target_group)
++

Principle 4: Plan for rollback — reverting code removes only new resources

Because we only added new resources and never modified existing ones, rollback is trivial:

++
git revert HEAD
cdk deploy ApiGatewayStack
# CloudFormation deletes all new resources, leaves everything else identical
++

The full additive migration lifecycle

The migration proceeds through four phases, each a separate deployment, each independently reversible:

Phase 1: Add new infrastructure Deploy internal ALB, VPC Link, API Gateway, dual-register ECS services, create test DNS records. Only [+] entries in the changeset.

Phase 2: Validate on test domains Run full integration tests against gw-api.example.com. Compare latency, error rates, throughput against the existing path. Fix any issues by modifying only the new resources.

Phase 3: DNS cutover Update Route53 to point api.example.com at the API Gateway endpoint. Monitor production traffic. If anything looks wrong, revert DNS — the old path is still fully functional.

Phase 4: Decommission old resources Remove old Route53 records, then the old ALB, in separate deployments with a cooling-off period between each.


CDK implementation: the additive stack

++
from aws_cdk import (
    Stack, Duration,
    aws_ec2 as ec2,
    aws_elasticloadbalancingv2 as elbv2,
    aws_apigatewayv2 as apigwv2,
    aws_route53 as route53,
    aws_route53_targets as targets,
    aws_certificatemanager as acm,
    CfnOutput,
)
from constructs import Construct


class AdditiveApiGatewayStack(Stack):
    def __init__(
        self,
        scope: Construct,
        construct_id: str,
        vpc: ec2.IVpc,
        existing_ecs_service,
        hosted_zone: route53.IHostedZone,
        certificate: acm.ICertificate,
        service_name: str,
        service_port: int = 8000,
        **kwargs,
    ) -> None:
        super().__init__(scope, construct_id, **kwargs)

        # New internal ALB — does not touch the existing public ALB
        internal_alb_sg = ec2.SecurityGroup(
            self, "InternalALBSG",
            vpc=vpc,
            description=f"Internal ALB SG for {service_name}",
            allow_all_outbound=True,
        )

        internal_alb = elbv2.ApplicationLoadBalancer(
            self, "InternalALB",
            vpc=vpc,
            internet_facing=False,
            security_group=internal_alb_sg,
        )

        internal_tg = elbv2.ApplicationTargetGroup(
            self, "InternalTG",
            vpc=vpc,
            port=service_port,
            protocol=elbv2.ApplicationProtocol.HTTP,
            target_type=elbv2.TargetType.IP,
            health_check=elbv2.HealthCheck(
                path="/health",
                interval=Duration.seconds(30),
                timeout=Duration.seconds(5),
                healthy_threshold_count=2,
                unhealthy_threshold_count=3,
            ),
        )

        internal_alb.add_listener(
            "HTTPSListener",
            port=443,
            certificates=[certificate],
            default_target_groups=[internal_tg],
        )

        # Dual-register: attach existing service to the new target group
        # The existing public ALB registration is NOT modified
        existing_ecs_service.attach_to_application_target_group(internal_tg)

        # VPC Link for API Gateway
        vpc_link = apigwv2.CfnVpcLink(
            self, "VpcLink",
            name=f"{service_name}-vpc-link",
            subnet_ids=[s.subnet_id for s in vpc.private_subnets],
            security_group_ids=[internal_alb_sg.security_group_id],
        )

        # HTTP API Gateway
        http_api = apigwv2.CfnApi(
            self, "HttpApi",
            name=f"{service_name}-api",
            protocol_type="HTTP",
        )

        integration = apigwv2.CfnIntegration(
            self, "ALBIntegration",
            api_id=http_api.ref,
            integration_type="HTTP_PROXY",
            integration_uri=internal_alb.listeners[0].listener_arn,
            integration_method="ANY",
            connection_type="VPC_LINK",
            connection_id=vpc_link.ref,
            payload_format_version="1.0",
        )

        apigwv2.CfnRoute(
            self, "DefaultRoute",
            api_id=http_api.ref,
            route_key="$default",
            target=f"integrations/{integration.ref}",
        )

        apigwv2.CfnStage(
            self, "ProdStage",
            api_id=http_api.ref,
            stage_name="$default",
            auto_deploy=True,
        )

        # Test domain only — production domain is NOT modified
        custom_domain = apigwv2.CfnDomainName(
            self, "GwDomain",
            domain_name=f"gw-{service_name}.example.com",
            domain_name_configurations=[
                apigwv2.CfnDomainName.DomainNameConfigurationProperty(
                    certificate_arn=certificate.certificate_arn,
                    endpoint_type="REGIONAL",
                )
            ],
        )

        apigwv2.CfnApiMapping(
            self, "GwApiMapping",
            api_id=http_api.ref,
            domain_name=custom_domain.ref,
            stage="$default",
        )

        route53.ARecord(
            self, "GwTestRecord",
            zone=hosted_zone,
            record_name=f"gw-{service_name}",
            target=route53.RecordTarget.from_alias(
                targets.ApiGatewayv2DomainProperties(
                    regional_domain_name=custom_domain.attr_regional_domain_name,
                    regional_hosted_zone_id=custom_domain.attr_regional_hosted_zone_id,
                )
            ),
        )

        CfnOutput(self, "TestEndpoint",
            value=f"https://gw-{service_name}.example.com")
++

Validation: inspect the changeset before deploying

Two validation steps before any deployment:

++
# Step 1: Synthesize and check for unexpected references to existing resources
cdk synth ApiGatewayStack > /tmp/template.yaml
grep -c "Existing" /tmp/template.yaml  # Expected: 0

# Step 2: Create a changeset without executing it
cdk deploy ApiGatewayStack --no-execute
# Review in CloudFormation console — every action should be "Add"
++

Watch for these red flags in the changeset:

+--------------------+-------------+--------------------+
| Action             | Risk Level  | What to do         |
+--------------------+-------------+--------------------+
| Add                | Safe        | Proceed            |
+--------------------+-------------+--------------------+
| Modify (no         | Medium risk | Investigate what's |
| replacement)       |             | changing           |
+--------------------+-------------+--------------------+
| Modify             | High risk   | Redesign           |
| (conditional       |             |                    |
| replacement)       |             |                    |
+--------------------+-------------+--------------------+
| Modify             | Critical    | Stop and redesign  |
| (replacement)      |             |                    |
+--------------------+-------------+--------------------+
| Remove             | Critical    | Stop and redesign  |
+--------------------+-------------+--------------------+

If you see anything other than “Add”, stop. The deployment design needs to change.


The cost tradeoff

Running both old and new infrastructure simultaneously has a cost. During a two-week validation window, you pay for:

  • An additional internal ALB (~$16/month)
  • Additional API Gateway requests (minimal for test traffic)
  • VPC Link (no additional charge)

For our environment, the overlap period cost approximately $20-30. Compare that to the cost of even 5 minutes of production downtime, and the math is straightforward.


The decommission phase

Once the new path is validated and DNS has cut over, remove old resources in stages:

++
Week 1-2: Both paths active, test domain validation
Week 3:   DNS cutover to API Gateway path
Week 4:   Monitor production traffic on new path
Week 5:   Remove old Route53 records (keep old ALB as fallback)
Week 6:   Remove old ALB and target groups
++

Each removal is its own deployment. Each is independently reversible. At week 5, if something looks wrong, re-add the Route53 records and you’re back to dual-path in minutes.

++
# Week 5 deployment: only deletions of old DNS records
cdk diff DecommissionPhase1

Stack DecommissionPhase1
Resources
[-] AWS::Route53::RecordSet OldApiRecord
++

When we discovered a bug

During validation of one VPC Link, we discovered a misconfigured security group — the internal ALB’s SG wasn’t allowing traffic from the VPC Link. The fix was straightforward:

++
# Only needed on the new resources
internal_alb_sg.add_ingress_rule(
    ec2.Peer.ipv4(vpc.vpc_cidr_block),
    ec2.Port.tcp(443),
    "VPC Link to internal ALB"
)
++

The changeset for that fix touched only the new resources. The existing production path was completely unaffected throughout the entire debugging session. That’s the value of this approach — you can iterate on the new infrastructure without any production risk.

The mantra: add, validate, cutover, decommission. Never skip a step. Never combine steps into a single deployment. And always verify that cdk diff shows only [+] entries before deploying.