Hallmarks of Resilient Terraform Code: How to Create & Maintain It

Index

    Helpful HashiCorp Links For Resilient Code

    Overview

    When it comes to Infrastructure as Code (IaC), resiliency goes beyond merely crafting code; it’s about creating an adaptable, reliable, and sustainable ecosystem. Resilient Terraform code lies ultimately instills confidence and assurance in user interactions, from minor adjustments to major overhauls and updates.

    Defining Resiliency in Terraform Codebases

    Confidence in Code Changes & Speculative Plans

    Resilient Terraform code means having the capability to alter the code, execute a plan, and have a high level of confidence that a successful speculative plan equates to a successful apply. This is about ensuring that a user’s expectations align with the actual outcomes.

    Mitigation of Manual Configuration Modifications

    A resilient codebase should ideally be able to be applied to a greenfield environment, then destroyed, and then re-applied successfully with minimal manual intervention or configuration. While achieving this level of autonomy is challenging, it should remain a central goal across the lifetime of the codebase.

    Clear Ownership

    Resiliency requires having a distinct team of owners who undertake the responsibility to maintain and update the code. This approach ensures the code’s relevance and continued usability.

    Churn-Free Operations

    The hallmark of resilient code is its stability, reflected by an absence of churn between consecutive applies. For instance, consistent changes exhibited in Terraform plans, occasionally induced by API backend alterations, exemplify churn. A resilient codebase incorporates these alterations to ensure that the only feedback provided at the time of plan and apply are expected changes based on user modifications.

    Clear Intuitive File & Directory Organization

    Resilient code is organized in an intuitive, user-centric format. This organization ensures the logical arrangement of dependencies from top to bottom in terraform files. It also provides clear, discernible file naming, reflecting the primary resource, service, component, or “blade” each file manages, even though Terraform itself is inherently agnostic to file structuring.

    Testing & Feedback Loops

    A resilient Terraform codebase incorporates rigorous testing workflows and feedback mechanisms. This means facilitating continuous testing of code alterations and prioritizing user feedback.

    Scoped Accessibility

    A resilient Terraform codebase values appropriate scoping. This principle ensures that users are not granted excessive permissions that would allow them to make changes across the entire codebase. Instead, users should ideally be able to add additional entries in a list or object, subject to validation through testing, with Terraform determining the necessary logic for what resources are necessary to be created as a result of those changes. This delineation in access and control guarantees that changes are intentional, accurate, and verified, mitigating the risk of inadvertent alterations, especially to areas of the code that are not applicable to a users scope of concern.

    Comprehensive Trackability

    A resilient Terraform codebase emphasizes the meticulous tracking of changes, including alterations to input variables per environment, versions of secrets, and user modifications. It allows stakeholders to monitor and trace the history of applies to state files. This precise trackability ensures a transparent and accountable environment, allowing teams to identify the origins of changes, resolve discrepancies, and maintain a reliable and secure codebase.

    Why Resiliency Ultimately Matters

    One day, something will go wrong with your Terraform. It may be an unexpected provider error, a corruption of a state file (hopefully not), an errant change that got approved and applied and destroyed something it shouldn’t have, a resource that gets hung up during an apply and prevents other resources or services from being updated, or a necessary security feature or other feature that needs to be updated in your code but requires a destroy and recreate on the resource.

    The point is, resiliency in a codebase not only frees team members up to address these issues, but also means that they spend less time fixing the same issues, and are always working on new problems.

    It also means that everyone’s job is easier because the terraform is doing the works it’s supposed to do, ensuring consistently successful applies within minimal churn in an incredibly stable and predictable ecosystem.

    When something does go wrong with a terraform codebase, the principles outlined in this blog post will hopefully help teams address the issue sooner and with less anxiety.

    Pre-Development Considerations

    It’s important first to share a set of fundamental considerations that can help shape the direction, scope, and complexity of terraform code. These considerations lay the groundwork for creating resilient and maintainable Terraform configurations.

    User Identification

    Identifying the users directly influences how a terraform codebase is developed. It clarifies the expectations, responsibilities, and interactions between the users and the code, laying down the prerequisites for maintaining and modifying the codebase.

    Some Important Questions to Help with User Identification

    Roles:

    Are the users operations people or application teams, or is there a blend of both?

    Skill Level:

    What is their level of expertise with Terraform, and what are their expectations from the code?

    Accountability:

    Who will be responsible for maintaining the code and fixing issues with it?

    Scope of Management:

    Is the user base a small team of 6-10 people, or is it an entire organization?

    Scope & Lifecycle Identification

    Establishing a codebase’s scope & lifecycle is pivotal to ensuring manageability and scalability.

    Some Important Questions to Establish Scope and Lifecycle

    Scale:

    How many resources are expected to be managed by this terraform codebase?

    A couple 100 resources, a couple 1000, or over 5,000? (Keep in mind that I would not recommend Terraform state files over 3000, as there tends to be a slow down in plans and applies around that benchmark. Terraform itself can support upwards of 10,000 resources, but I would NEVER recommend maintaining a codebase of anywhere over 5,000 for many very good reasons.

    Dependencies:

    Are there dependencies between these specific resources?

    Can they be split to other state files scoped to files within that same repository?

    Potential Blockers:

    What would happen if there was an error in this codebase and changes to differing services within the codebase could not be made?

    Would that influence the decision to differentiate resource types into varying state files?

    Lifecycle: Do we expect this repository and terraform codebase to exist in:

    • 3 months
    • 6 months
    • 1 year
    • 3 years
    • 6 years

    Considerations for Scope and Lifecycle

    It’s WAY easier to combine state files at a later date than to split them up later. TRUST ME on this.

    So when thinking about scale, you will want to error on the side of splitting things up as much as possible where no dependencies exist, but do this within reason. For many apps, this may not be an issue as they may not conceivably ever get above 4,000 or 5,000 resources.

    There are a number of situations where Terraform will NOT scale well without a lot of forward thinking. Those scenarios include:

    When using multiple iterations of a Terraform module that has a for_each meta-argument on a role assignment where each member of the role assignment requires a single resource. This is where assigning assignments to GROUPS as assigned to individual user accounts or service accounts/service principals is really a requirement. Imagine adding 100 service accounts as members of a single role assignment on an org unit within a cloud platform – that’s 100 resource “entries” occupied in your state file.

    Any “helper” modules where a data block is called using a for_each meta-argument iterating over a list, map, or object. These helper modules may pull information about particular cloud resources and, in so doing, occupy a single resource “entry” in your state file. You may see this particularly in GCP where a data source is used to pull individual permissions for a base role – each base role passed into such a helper module represents a single additional data source that is created in your state file.

    Terraform resources that do not support an array of resources even when the deployment of many of those resources is typical. An example of this would be the google_project_service resource in GCP. In previous GCP provider versions, this resource would manage a list of project services, altogether, in a single resource entry. However, recent updates to the GCP cloud provider converted this resource to a single 1:1 relationship, resource to project service. Therefore, if you pass in 30 project services to be enabled by the resource using a for_each, that will occupy 30 resource “entries” or “slots” in your state file, multiplied by the number of projects being created.

    Usage Patterns

    Recognizing and understanding the expected usage patterns and most frequent operations in your Terraform codebase are crucial for optimizing workflows and minimizing toil, allowing for a smoother, more predictable development and operational experience.

    Questions to Determine Expected Usage Patterns

    How frequently are iterations of similar patterns created?

    Are there specific, recurring tasks or modifications that are made regularly?

    Comments are closed