Hallmarks of Resilient Terraform Code: How to Create & Maintain It

Index

Helpful HashiCorp Links For Resilient Code

Index – Terraform Recommended Practices – HashiCorp Developer: This guide is meant for enterprise users looking to advance their Terraform usage from a few individuals to a full organization. It covers the steps to start using Terraform tools, with special attention to the foundational practices they rely on.
Announcing the Terraform Recommended Practices Guide: This blog post introduces the Terraform Recommended Practices guide and explains why collaborative infrastructure as code is the best way to provision cloud-based infrastructure for organizations.
Opinionated Terraform Best Practices and Anti-Patterns – HashiCorp: This video presentation discusses some of the common pitfalls and best practices when using Terraform, such as using environment variables, refactoring monolithic configurations, and creating reusable modules.
Plugin Development – Best Practices | Terraform – HashiCorp Developer: This page describes best practices that generally apply to most Providers, with a brief description of each, and link to read more.
Terraform Repository Best Practices, Parts 1 & 2 (from 2020): This is a presentation on how to standardize your Terraform code and eliminate duplicate Terraform code. It reviews best practices for laying out Terraform configuration files and Terraform modules, and how to use Terraform to build and manage multiple environments supporting live applications.

Overview

When it comes to Infrastructure as Code (IaC), resiliency goes beyond merely crafting code; it’s about creating an adaptable, reliable, and sustainable ecosystem. Resilient Terraform code lies ultimately instills confidence and assurance in user interactions, from minor adjustments to major overhauls and updates.

Defining Resiliency in Terraform Codebases

Confidence in Code Changes & Speculative Plans

Resilient Terraform code means having the capability to alter the code, execute a plan, and have a high level of confidence that a successful speculative plan equates to a successful apply. This is about ensuring that a user’s expectations align with the actual outcomes.

Mitigation of Manual Configuration Modifications

A resilient codebase should ideally be able to be applied to a greenfield environment, then destroyed, and then re-applied successfully with minimal manual intervention or configuration. While achieving this level of autonomy is challenging, it should remain a central goal across the lifetime of the codebase.

Clear Ownership

Resiliency requires having a distinct team of owners who undertake the responsibility to maintain and update the code. This approach ensures the code’s relevance and continued usability.

Churn-Free Operations

The hallmark of resilient code is its stability, reflected by an absence of churn between consecutive applies. For instance, consistent changes exhibited in Terraform plans, occasionally induced by API backend alterations, exemplify churn. A resilient codebase incorporates these alterations to ensure that the only feedback provided at the time of plan and apply are expected changes based on user modifications.

Clear Intuitive File & Directory Organization

Resilient code is organized in an intuitive, user-centric format. This organization ensures the logical arrangement of dependencies from top to bottom in terraform files. It also provides clear, discernible file naming, reflecting the primary resource, service, component, or “blade” each file manages, even though Terraform itself is inherently agnostic to file structuring.

Testing & Feedback Loops

A resilient Terraform codebase incorporates rigorous testing workflows and feedback mechanisms. This means facilitating continuous testing of code alterations and prioritizing user feedback.

Scoped Accessibility

A resilient Terraform codebase values appropriate scoping. This principle ensures that users are not granted excessive permissions that would allow them to make changes across the entire codebase. Instead, users should ideally be able to add additional entries in a list or object, subject to validation through testing, with Terraform determining the necessary logic for what resources are necessary to be created as a result of those changes. This delineation in access and control guarantees that changes are intentional, accurate, and verified, mitigating the risk of inadvertent alterations, especially to areas of the code that are not applicable to a users scope of concern.

Comprehensive Trackability

A resilient Terraform codebase emphasizes the meticulous tracking of changes, including alterations to input variables per environment, versions of secrets, and user modifications. It allows stakeholders to monitor and trace the history of applies to state files. This precise trackability ensures a transparent and accountable environment, allowing teams to identify the origins of changes, resolve discrepancies, and maintain a reliable and secure codebase.

Why Resiliency Ultimately Matters

One day, something will go wrong with your Terraform. It may be an unexpected provider error, a corruption of a state file (hopefully not), an errant change that got approved and applied and destroyed something it shouldn’t have, a resource that gets hung up during an apply and prevents other resources or services from being updated, or a necessary security feature or other feature that needs to be updated in your code but requires a destroy and recreate on the resource.

The point is, resiliency in a codebase not only frees team members up to address these issues, but also means that they spend less time fixing the same issues, and are always working on new problems.

It also means that everyone’s job is easier because the terraform is doing the works it’s supposed to do, ensuring consistently successful applies within minimal churn in an incredibly stable and predictable ecosystem.

When something does go wrong with a terraform codebase, the principles outlined in this blog post will hopefully help teams address the issue sooner and with less anxiety.

Pre-Development Considerations

It’s important first to share a set of fundamental considerations that can help shape the direction, scope, and complexity of terraform code. These considerations lay the groundwork for creating resilient and maintainable Terraform configurations.

User Identification

Identifying the users directly influences how a terraform codebase is developed. It clarifies the expectations, responsibilities, and interactions between the users and the code, laying down the prerequisites for maintaining and modifying the codebase.

Some Important Questions to Help with User Identification

Roles:

Are the users operations people or application teams, or is there a blend of both?

Skill Level:

What is their level of expertise with Terraform, and what are their expectations from the code?

Accountability:

Who will be responsible for maintaining the code and fixing issues with it?

Scope of Management:

Is the user base a small team of 6-10 people, or is it an entire organization?

Scope & Lifecycle Identification

Establishing a codebase’s scope & lifecycle is pivotal to ensuring manageability and scalability.

Some Important Questions to Establish Scope and Lifecycle

Scale:

How many resources are expected to be managed by this terraform codebase?

A couple 100 resources, a couple 1000, or over 5,000? (Keep in mind that I would not recommend Terraform state files over 3000, as there tends to be a slow down in plans and applies around that benchmark. Terraform itself can support upwards of 10,000 resources, but I would NEVER recommend maintaining a codebase of anywhere over 5,000 for many very good reasons.

Dependencies:

Are there dependencies between these specific resources?

Can they be split to other state files scoped to files within that same repository?

Potential Blockers:

What would happen if there was an error in this codebase and changes to differing services within the codebase could not be made?

Would that influence the decision to differentiate resource types into varying state files?

Lifecycle: Do we expect this repository and terraform codebase to exist in:

3 months
6 months
1 year
3 years
6 years

Considerations for Scope and Lifecycle

It’s WAY easier to combine state files at a later date than to split them up later. TRUST ME on this.

So when thinking about scale, you will want to error on the side of splitting things up as much as possible where no dependencies exist, but do this within reason. For many apps, this may not be an issue as they may not conceivably ever get above 4,000 or 5,000 resources.

There are a number of situations where Terraform will NOT scale well without a lot of forward thinking. Those scenarios include:

When using multiple iterations of a Terraform module that has a for_each meta-argument on a role assignment where each member of the role assignment requires a single resource. This is where assigning assignments to GROUPS as assigned to individual user accounts or service accounts/service principals is really a requirement. Imagine adding 100 service accounts as members of a single role assignment on an org unit within a cloud platform – that’s 100 resource “entries” occupied in your state file.

Any “helper” modules where a data block is called using a for_each meta-argument iterating over a list, map, or object. These helper modules may pull information about particular cloud resources and, in so doing, occupy a single resource “entry” in your state file. You may see this particularly in GCP where a data source is used to pull individual permissions for a base role – each base role passed into such a helper module represents a single additional data source that is created in your state file.

Terraform resources that do not support an array of resources even when the deployment of many of those resources is typical. An example of this would be the google_project_service resource in GCP. In previous GCP provider versions, this resource would manage a list of project services, altogether, in a single resource entry. However, recent updates to the GCP cloud provider converted this resource to a single 1:1 relationship, resource to project service. Therefore, if you pass in 30 project services to be enabled by the resource using a for_each, that will occupy 30 resource “entries” or “slots” in your state file, multiplied by the number of projects being created.

Usage Patterns

Recognizing and understanding the expected usage patterns and most frequent operations in your Terraform codebase are crucial for optimizing workflows and minimizing toil, allowing for a smoother, more predictable development and operational experience.

Questions to Determine Expected Usage Patterns

How frequently are iterations of similar patterns created?

Are there specific, recurring tasks or modifications that are made regularly?