Stop Writing Spaghetti Terraform: The Module Architecture That Scales to 50 Teams

I’ve walked into enough platform engineering engagements to recognise the smell. It hits you before you even open a single .tf file. Someone says something like: “We have a main.tf that’s getting a bit long” — and when you finally pull up the repo, you’re staring at 4,000 lines of raw Terraform with hardcoded AMI IDs, copy-pasted security group rules, and a variables.tf that’s grown into a philosophical document no one actually reads.

This isn’t a failure of the engineers. They were moving fast, shipping features, doing their jobs. But the architecture — or rather, the absence of one — has turned what should be a force multiplier into a grinding liability. Every new team that onboards copies the existing mess and adds to it. The blast radius of a typo grows. The audit logs become a horror show. And when SOC 2 auditors ask you to demonstrate least-privilege IAM and consistent tagging across all 200 of your cloud resources, someone quietly leaves the room.

I’ve spent years fixing this. What follows is the module architecture I reach for when a platform needs to scale from one team to fifty without collapsing under its own weight.

The Architecture Context: Why Your Flat Terraform Breaks at Scale

Most Terraform repos start flat. Everything in one directory, one state file, one workspace. For a single team, it’s fine. You can hold the entire mental model in your head. But organisational Terraform problems aren’t technical — they’re Conway’s Law problems in disguise.

When you have 10 squads all deploying into the same AWS account using the same Terraform, three things happen simultaneously and inevitably:

State contention. One slow plan blocks everyone. One botched apply corrupts shared state and takes down the whole afternoon.
Config drift. The “Compute Squad” starts tweaking the security group rules to unblock a sprint. Six weeks later, the “Data Squad” has a completely different network topology in the same VPC, and neither knows it.
Compliance collapse. There’s no single place to enforce that every resource has a data_classification tag. Every team does it differently. Your auditor finds 47 S3 buckets with no tags. You spend a week doing archaeology.

The fix is a three-layer module architecture: a Foundation Layer, a Service Module Layer, and a Product Configuration Layer. Think of it as a franchise model — corporate sets the standards (foundation), the kitchen equipment is standardised (service modules), and each franchise location configures its own menu (product config).

Here’s the high-level picture:

┌─────────────────────────────────────────────────────┐
│              PRODUCT CONFIGURATION LAYER             │
│  (Per-team: compute-squad/, data-squad/, etc.)       │
│  Instantiates service modules with team-specific     │
│  variables. No raw resources. Just module calls.     │
└───────────────────┬─────────────────────────────────┘
                    │ calls
┌───────────────────▼─────────────────────────────────┐
│              SERVICE MODULE LAYER                    │
│  (Reusable: terraform-aws-eks/, terraform-aws-rds/)  │
│  Opinionated, versioned, compliance-baked-in.        │
│  Exposes only safe knobs to consumers.               │
└───────────────────┬─────────────────────────────────┘
                    │ references
┌───────────────────▼─────────────────────────────────┐
│              FOUNDATION LAYER                        │
│  (Shared: VPC, IAM roles, KMS keys, S3 backends)    │
│  Separate state. Separate pipeline. Own by Platform. │
│  Changes here require a CAB review.                  │
└─────────────────────────────────────────────────────┘

Implementation Details

Layer 1: The Foundation — The One Thing You Get Right Once

The foundation is sacred. It contains your VPC, your Transit Gateway attachments, your root KMS keys, your centralised CloudTrail, and your Terraform remote state backends. It is owned by the Platform team. It changes rarely. It changes carefully.

Here’s how I structure the state backend for a multi-team org:

# foundation/backend.tf
# This state is the source of truth for shared infrastructure.
# Encryption at rest and state locking are non-negotiable.

terraform {
  backend "s3" {
    bucket         = "acme-terraform-state-prod"
    key            = "foundation/terraform.tfstate"
    region         = "eu-west-1"
    encrypt        = true
    kms_key_id     = "arn:aws:kms:eu-west-1:123456789012:key/mrk-abc123"
    dynamodb_table = "terraform-state-locks"
  }
}

The foundation outputs are consumed via SSM Parameter Store by everything above it. No team ever touches this state directly.

# foundation/outputs.tf
# Expose only what downstream modules need. Nothing more.

output "vpc_id" {
  description = "The shared production VPC ID"
  value       = module.vpc.vpc_id
}

output "private_subnet_ids" {
  description = "Private subnets across all AZs"
  value       = module.vpc.private_subnets
}

output "kms_key_arn" {
  description = "Default KMS key for service encryption"
  value       = aws_kms_key.default.arn
  sensitive   = true
}

Architect’s Note

The temptation is to put your foundation and your service modules in the same Terraform state to “keep things simple.” This is how you end up with a `terraform destroy` that accidentally deletes your VPC. Separate state files are not bureaucracy — they are blast radius control. In practice, I enforce this with separate AWS accounts for foundation, platform, and product workloads using AWS Organizations. A rogue `apply` in the Compute Squad’s account literally cannot touch the networking layer. This is why we publish these outputs to SSM Parameter Store — it provides a stable, audit-logged contract between layers without exposing the raw state files.

Layer 2: Service Modules — The Paved Road

This is where your compliance lives permanently. A service module is a versioned, opinionated wrapper around an AWS resource (or set of resources) that bakes in your security and compliance requirements by default. Consumers get a small set of safe knobs. They cannot, for example, accidentally create an unencrypted RDS instance or an internet-facing EKS API endpoint.

Here’s a minimal but complete example for a compliant EKS cluster module:

# modules/terraform-aws-eks-compliant/main.tf

# This module enforces:
# - Private API endpoint only (no public access)
# - Envelope encryption of Kubernetes secrets via KMS
# - IRSA enabled (no node-level IAM roles with broad permissions)
# - Mandatory tagging for cost allocation and compliance

resource "aws_eks_cluster" "this" {
  name     = var.cluster_name
  role_arn = aws_iam_role.cluster.arn
  version  = var.kubernetes_version

  vpc_config {
    subnet_ids              = var.private_subnet_ids
    endpoint_private_access = true
    endpoint_public_access  = false  # Hard-coded. Not a variable. Not negotiable.
    security_group_ids      = [aws_security_group.cluster.id]
  }

  encryption_config {
    provider {
      key_arn = var.kms_key_arn
    }
    resources = ["secrets"]
  }

  enabled_cluster_log_types = ["api", "audit", "authenticator", "controllerManager", "scheduler"]

  tags = merge(var.tags, local.mandatory_tags)
}

locals {
  # These tags are injected by the module regardless of what the caller passes.
  # They are required for SOC 2 CC6.1 (logical access controls) and cost allocation.
  mandatory_tags = {
    managed_by          = "terraform"
    compliance_scope    = "soc2-hipaa"
    data_classification = var.data_classification
    cost_centre         = var.cost_centre
    squad               = var.squad
    environment         = var.environment
  }
}

# modules/terraform-aws-eks-compliant/variables.tf

variable "cluster_name" {
  type        = string
  description = "EKS cluster name. Used as prefix for all associated resources."
}

variable "kubernetes_version" {
  type        = string
  description = "Kubernetes version. Must be within N-1 of current AWS EKS latest."
  # Enforce version constraints at the module level.
  validation {
    condition     = can(regex("^1\\.(2[6-9]|3[0-9])$", var.kubernetes_version))
    error_message = "Kubernetes version must be 1.26 or later. Older versions are EOL and out of compliance."
  }
}

variable "data_classification" {
  type        = string
  description = "Data sensitivity level. Drives encryption tier and audit logging scope."
  validation {
    condition     = contains(["public", "internal", "confidential", "restricted"], var.data_classification)
    error_message = "data_classification must be one of: public, internal, confidential, restricted."
  }
}

variable "cost_centre" {
  type        = string
  description = "Cost centre code for billing allocation. Required for all production resources."
}

variable "kms_key_arn" {
  type        = string
  description = "KMS key ARN for secrets encryption. Sourced from foundation outputs."
  sensitive   = true
}

variable "private_subnet_ids" {
  type        = list(string)
  description = "Private subnet IDs. Sourced from foundation outputs. Public subnets are rejected."
}

variable "tags" {
  type        = map(string)
  description = "Additional tags to apply. Mandatory tags are always injected by the module."
  default     = {}
}

Notice what’s missing from the variables: any way to enable public API access, any way to skip encryption, any way to omit the mandatory tags. This is intentional. The module’s job is to make compliance the path of least resistance.

Architect’s Note

Version your modules like you version your APIs. Use semantic versioning in a private Terraform registry (or a simple Git tag convention) and require version pinning in all consumer configurations. A floating `source = “git::https://…”` reference without a `ref` is a silent bomb. I’ve seen a breaking change in a shared module silently propagate through eight squads’ pipelines over a weekend. Pin your versions. Enforce it in CI with a pre-commit hook that rejects any module source without an explicit `version` or `ref`. This is the difference between a controlled upgrade path and a 2am incident.

Layer 3: Product Configuration — Where Teams Live

This is the only layer individual squads touch. It’s almost boring by design. It should look like configuration, not programming.

# teams/compute-squad/main.tf

# The Compute Squad never writes a raw aws_* resource.
# They instantiate pre-approved, compliant modules.

terraform {
  backend "s3" {
    bucket         = "acme-terraform-state-prod"
    key            = "teams/compute-squad/terraform.tfstate"
    region         = "eu-west-1"
    encrypt        = true
    kms_key_id     = "arn:aws:kms:eu-west-1:123456789012:key/mrk-abc123"
    dynamodb_table = "terraform-state-locks"
  }
}

# Pull shared infrastructure outputs from SSM Parameter Store.
# This decouples the squads from the foundation's backend configuration.
data "aws_ssm_parameter" "vpc_id" {
  name = "/platform/foundation/vpc_id"
}

data "aws_ssm_parameter" "private_subnet_ids" {
  name = "/platform/foundation/private_subnet_ids"
}

data "aws_ssm_parameter" "kms_key_arn" {
  name            = "/platform/foundation/kms/default_key_arn"
  with_decryption = true
}

module "compute_squad_eks" {
  source  = "app.terraform.io/acme/eks-compliant/aws"
  version = "3.2.1"  # Pinned. Always pinned.

  cluster_name        = "compute-squad-prod"
  kubernetes_version  = "1.30"
  data_classification = "confidential"
  cost_centre         = "CC-1042"
  squad               = "compute"
  environment         = "prod"
  kms_key_arn         = data.aws_ssm_parameter.kms_key_arn.value
  private_subnet_ids  = split(",", data.aws_ssm_parameter.private_subnet_ids.value)

  tags = {
    service = "compute-api"
  }
}

This is 35 lines. It deploys a fully SOC 2 and HIPAA-compliant EKS cluster. The Compute Squad cannot misconfigure it if they try.

Pitfalls & Optimisations

The “Mega-Module” Trap. The most common mistake I see when teams adopt this pattern is building one enormous “infrastructure module” that creates VPCs, EKS clusters, RDS databases, and SQS queues all at once. This feels efficient. It is a future catastrophe. Every apply locks the entire resource graph, changes in one component force a full plan of unrelated resources, and when something breaks during apply you’re debugging a 40-resource state transaction. Keep modules focused. One module, one conceptual resource. Compose them at the product configuration layer.

State File Proliferation. Yes, separate state per team means more state files to manage. The operational overhead is real. The answer is a private Terraform Cloud or Atlantis instance with consistent backend conventions, not collapsing your state back into one file. I enforce backend configuration through a Terraform module (a meta-module, if you like) that generates the backend.tf as part of team provisioning. The backend configuration itself is code.

Module Versioning Drift. Within six months, your teams will be on seven different versions of your EKS module. This isn’t a culture problem — it’s an infrastructure problem. Solve it with automation: a weekly CI job that opens PRs against team repos for module updates, combined with a policy-as-code rule (OPA or Conftest) that flags anything more than one major version behind. Enforce at the pipeline level, not through honour systems.

Circular Dependency Between Foundation and Modules. If a service module tries to create IAM policies that reference the foundation’s KMS key ARN, and the foundation references module outputs, you’ve created a circular dependency that Terraform cannot resolve. Keep the data flow strictly one-directional: Foundation outputs → Service modules consume → Product configs compose. Nothing flows upward.

Don’t Abstract Too Early. The other failure mode is building elaborate module hierarchies before you’ve deployed anything twice. Write the raw Terraform for the first two or three use cases. Only abstract into a module when you’ve seen the pattern repeat and understand which variables are genuinely configurable versus which should be hard-coded compliance controls. Premature abstraction produces modules that are either too rigid or expose every knob and provide no safety guarantees.

Unlocked: Your Key Takeaways

Three layers, strict separation: Foundation (shared, sacred, rare changes) → Service Modules (compliant, versioned, opinionated) → Product Config (team-owned, configuration only, no raw resources).
Compliance belongs in modules, not wikis. When encryption, private endpoints, and mandatory tagging are enforced by the module, they cannot be skipped by accident or under sprint pressure.
Separate state files are blast radius control. A broken apply in one team’s config cannot corrupt another team’s infrastructure. This is non-negotiable at scale.
Version pin everything. Floating module references are silent bombs. Use semantic versioning, pin at the consumer level, and automate upgrade PRs.
Data flows one way. Foundation → Modules → Config. Circular dependencies are an architecture smell, not a Terraform problem.
Don’t abstract prematurely. Write raw Terraform first. Extract to modules only when the pattern is proven and you understand which knobs are safe to expose.

At 50 teams, your Terraform is either a platform that lets engineers ship safely and fast, or it’s the thing that gets blamed every time an audit fails or a production change takes three hours to plan. The architecture above is the difference between the two.

If your team is facing this challenge, I specialise in architecting these secure, audit-ready systems.

Email me for a strategic consultation: [email protected]

Explore my projects and connect on Upwork: DevOps Unlocked on Upwork

Atif Farrukh is the founder of DevOps Unlocked, a consulting practice specialising in compliance-driven cloud infrastructure for health-tech and fintech companies. He architects SOC 2, HIPAA, and ISO 27001-ready systems on AWS using Terraform and Kubernetes.

Stop Writing Spaghetti Terraform: The Module Architecture That Scales to 50 Teams

The Architecture Context: Why Your Flat Terraform Breaks at Scale

Implementation Details

Layer 1: The Foundation — The One Thing You Get Right Once

Architect’s Note

Layer 2: Service Modules — The Paved Road

Architect’s Note

Layer 3: Product Configuration — Where Teams Live

Pitfalls & Optimisations

Unlocked: Your Key Takeaways

SOC 2 for Engineers: What It Is and Why Your Terrible Tagging Strategy Is an Audit Failure Waiting to Happen

Beyond the Public Subnet: Architecting a Bastion Host with Session Manager for Zero-Trust Access

The Architecture Context: Why Your Flat Terraform Breaks at Scale

Implementation Details

Layer 1: The Foundation — The One Thing You Get Right Once

Architect’s Note

Layer 2: Service Modules — The Paved Road

Architect’s Note

Layer 3: Product Configuration — Where Teams Live

Pitfalls & Optimisations

Unlocked: Your Key Takeaways

Similar Posts