Additive Infrastructure Changes
the philosophy and CDK implementation of layering new infrastructure alongside existing resources — add first, validate on test domains, cut over DNS, then decommission.
- DATE:
- APR.07.2025
- READ:
- 14 MIN
Infrastructure migrations are among the most anxiety-inducing changes in cloud engineering. A misconfigured load balancer swap can take down production. An accidental Route53 record deletion can leave customers staring at DNS errors for hours while TTLs expire. A CloudFormation rollback touching stateful resources can corrupt data.
After managing AWS CDK infrastructure for ECS Fargate services across multiple environments, we adopted a strict philosophy: every infrastructure change should be additive. New resources are created alongside existing ones. Nothing is modified or removed in the same deployment. Validation happens on test domains before any DNS cutover. Rollback is as simple as reverting the CDK code and redeploying.
This post walks through exactly how we layered API Gateway with VPC Links and internal ALBs on top of existing public-facing ALBs, without touching a single existing resource.
The four principles
Principle 1: Never modify existing resources in the same change that adds new ones
When you look at your cdk diff output, you should see only additions. Zero modifications. Zero deletions. If your diff shows a change to an existing security group, listener, or target group — stop and redesign.
$ cdk diff ApiGatewayStack
Stack ApiGatewayStack
Resources
[+] AWS::ElasticLoadBalancingV2::LoadBalancer InternalALB
[+] AWS::ElasticLoadBalancingV2::Listener InternalALB/HTTPSListener
[+] AWS::ElasticLoadBalancingV2::TargetGroup InternalTG
[+] AWS::EC2::SecurityGroup InternalALBSG
[+] AWS::ApiGatewayV2::Api HttpApi
[+] AWS::ApiGatewayV2::VpcLink VpcLink
[+] AWS::ApiGatewayV2::Integration ALBIntegration
[+] AWS::ApiGatewayV2::Route DefaultRoute
[+] AWS::ApiGatewayV2::Stage ProdStage
[+] AWS::Route53::RecordSet GwTestRecordEvery line starts with [+]. That is the goal.
Principle 2: Validate on test domains before DNS cutover
Create gw-* prefixed subdomains for every service. If the production domain is api.example.com, create gw-api.example.com pointing to the new API Gateway endpoint. Run full integration tests against the new path without touching production traffic.
route53.ARecord(
self, "GwTestRecord",
zone=hosted_zone,
record_name=f"gw-{service_name}",
target=route53.RecordTarget.from_alias(
targets.ApiGatewayv2DomainProperties(
regional_domain_name=custom_domain.attr_regional_domain_name,
regional_hosted_zone_id=custom_domain.attr_regional_hosted_zone_id,
)
),
)Principle 3: Dual-register services in both old and new target groups
ECS services can be registered with multiple target groups simultaneously. Register existing ECS tasks with both the original public ALB’s target group and the new internal ALB’s target group. Both paths work at the same time.
# The existing public ALB target group attachment is untouched
# We ADD a second target group without modifying the first
fargate_service.attach_to_application_target_group(internal_target_group)Principle 4: Plan for rollback — reverting code removes only new resources
Because we only added new resources and never modified existing ones, rollback is trivial:
git revert HEAD
cdk deploy ApiGatewayStack
# CloudFormation deletes all new resources, leaves everything else identicalThe full additive migration lifecycle
The migration proceeds through four phases, each a separate deployment, each independently reversible:
Phase 1: Add new infrastructure Deploy internal ALB, VPC Link, API Gateway, dual-register ECS services, create test DNS records. Only [+] entries in the changeset.
Phase 2: Validate on test domains Run full integration tests against gw-api.example.com. Compare latency, error rates, throughput against the existing path. Fix any issues by modifying only the new resources.
Phase 3: DNS cutover Update Route53 to point api.example.com at the API Gateway endpoint. Monitor production traffic. If anything looks wrong, revert DNS — the old path is still fully functional.
Phase 4: Decommission old resources Remove old Route53 records, then the old ALB, in separate deployments with a cooling-off period between each.
CDK implementation: the additive stack
from aws_cdk import (
Stack, Duration,
aws_ec2 as ec2,
aws_elasticloadbalancingv2 as elbv2,
aws_apigatewayv2 as apigwv2,
aws_route53 as route53,
aws_route53_targets as targets,
aws_certificatemanager as acm,
CfnOutput,
)
from constructs import Construct
class AdditiveApiGatewayStack(Stack):
def __init__(
self,
scope: Construct,
construct_id: str,
vpc: ec2.IVpc,
existing_ecs_service,
hosted_zone: route53.IHostedZone,
certificate: acm.ICertificate,
service_name: str,
service_port: int = 8000,
**kwargs,
) -> None:
super().__init__(scope, construct_id, **kwargs)
# New internal ALB — does not touch the existing public ALB
internal_alb_sg = ec2.SecurityGroup(
self, "InternalALBSG",
vpc=vpc,
description=f"Internal ALB SG for {service_name}",
allow_all_outbound=True,
)
internal_alb = elbv2.ApplicationLoadBalancer(
self, "InternalALB",
vpc=vpc,
internet_facing=False,
security_group=internal_alb_sg,
)
internal_tg = elbv2.ApplicationTargetGroup(
self, "InternalTG",
vpc=vpc,
port=service_port,
protocol=elbv2.ApplicationProtocol.HTTP,
target_type=elbv2.TargetType.IP,
health_check=elbv2.HealthCheck(
path="/health",
interval=Duration.seconds(30),
timeout=Duration.seconds(5),
healthy_threshold_count=2,
unhealthy_threshold_count=3,
),
)
internal_alb.add_listener(
"HTTPSListener",
port=443,
certificates=[certificate],
default_target_groups=[internal_tg],
)
# Dual-register: attach existing service to the new target group
# The existing public ALB registration is NOT modified
existing_ecs_service.attach_to_application_target_group(internal_tg)
# VPC Link for API Gateway
vpc_link = apigwv2.CfnVpcLink(
self, "VpcLink",
name=f"{service_name}-vpc-link",
subnet_ids=[s.subnet_id for s in vpc.private_subnets],
security_group_ids=[internal_alb_sg.security_group_id],
)
# HTTP API Gateway
http_api = apigwv2.CfnApi(
self, "HttpApi",
name=f"{service_name}-api",
protocol_type="HTTP",
)
integration = apigwv2.CfnIntegration(
self, "ALBIntegration",
api_id=http_api.ref,
integration_type="HTTP_PROXY",
integration_uri=internal_alb.listeners[0].listener_arn,
integration_method="ANY",
connection_type="VPC_LINK",
connection_id=vpc_link.ref,
payload_format_version="1.0",
)
apigwv2.CfnRoute(
self, "DefaultRoute",
api_id=http_api.ref,
route_key="$default",
target=f"integrations/{integration.ref}",
)
apigwv2.CfnStage(
self, "ProdStage",
api_id=http_api.ref,
stage_name="$default",
auto_deploy=True,
)
# Test domain only — production domain is NOT modified
custom_domain = apigwv2.CfnDomainName(
self, "GwDomain",
domain_name=f"gw-{service_name}.example.com",
domain_name_configurations=[
apigwv2.CfnDomainName.DomainNameConfigurationProperty(
certificate_arn=certificate.certificate_arn,
endpoint_type="REGIONAL",
)
],
)
apigwv2.CfnApiMapping(
self, "GwApiMapping",
api_id=http_api.ref,
domain_name=custom_domain.ref,
stage="$default",
)
route53.ARecord(
self, "GwTestRecord",
zone=hosted_zone,
record_name=f"gw-{service_name}",
target=route53.RecordTarget.from_alias(
targets.ApiGatewayv2DomainProperties(
regional_domain_name=custom_domain.attr_regional_domain_name,
regional_hosted_zone_id=custom_domain.attr_regional_hosted_zone_id,
)
),
)
CfnOutput(self, "TestEndpoint",
value=f"https://gw-{service_name}.example.com")Validation: inspect the changeset before deploying
Two validation steps before any deployment:
# Step 1: Synthesize and check for unexpected references to existing resources
cdk synth ApiGatewayStack > /tmp/template.yaml
grep -c "Existing" /tmp/template.yaml # Expected: 0
# Step 2: Create a changeset without executing it
cdk deploy ApiGatewayStack --no-execute
# Review in CloudFormation console — every action should be "Add"Watch for these red flags in the changeset:
+--------------------+-------------+--------------------+ | Action | Risk Level | What to do | +--------------------+-------------+--------------------+ | Add | Safe | Proceed | +--------------------+-------------+--------------------+ | Modify (no | Medium risk | Investigate what's | | replacement) | | changing | +--------------------+-------------+--------------------+ | Modify | High risk | Redesign | | (conditional | | | | replacement) | | | +--------------------+-------------+--------------------+ | Modify | Critical | Stop and redesign | | (replacement) | | | +--------------------+-------------+--------------------+ | Remove | Critical | Stop and redesign | +--------------------+-------------+--------------------+
If you see anything other than “Add”, stop. The deployment design needs to change.
The cost tradeoff
Running both old and new infrastructure simultaneously has a cost. During a two-week validation window, you pay for:
- An additional internal ALB (~$16/month)
- Additional API Gateway requests (minimal for test traffic)
- VPC Link (no additional charge)
For our environment, the overlap period cost approximately $20-30. Compare that to the cost of even 5 minutes of production downtime, and the math is straightforward.
The decommission phase
Once the new path is validated and DNS has cut over, remove old resources in stages:
Week 1-2: Both paths active, test domain validation
Week 3: DNS cutover to API Gateway path
Week 4: Monitor production traffic on new path
Week 5: Remove old Route53 records (keep old ALB as fallback)
Week 6: Remove old ALB and target groupsEach removal is its own deployment. Each is independently reversible. At week 5, if something looks wrong, re-add the Route53 records and you’re back to dual-path in minutes.
# Week 5 deployment: only deletions of old DNS records
cdk diff DecommissionPhase1
Stack DecommissionPhase1
Resources
[-] AWS::Route53::RecordSet OldApiRecordWhen we discovered a bug
During validation of one VPC Link, we discovered a misconfigured security group — the internal ALB’s SG wasn’t allowing traffic from the VPC Link. The fix was straightforward:
# Only needed on the new resources
internal_alb_sg.add_ingress_rule(
ec2.Peer.ipv4(vpc.vpc_cidr_block),
ec2.Port.tcp(443),
"VPC Link to internal ALB"
)The changeset for that fix touched only the new resources. The existing production path was completely unaffected throughout the entire debugging session. That’s the value of this approach — you can iterate on the new infrastructure without any production risk.
The mantra: add, validate, cutover, decommission. Never skip a step. Never combine steps into a single deployment. And always verify that cdk diff shows only [+] entries before deploying.