Managing Terraform State With Google Cloud Storage Object Versioning

Introduction

Terraform is an extremely useful tool for managing cloud resources as it maintains a state and allows changes to be made when comparing against that state. Terraform has two pieces of functionality to maintain this state, a statefile and to avoid multiple processes interacting with a state, a lock file. Additionally, to allow the state to persist beyond the local execution of terraform (it has a CLI), the statefile can be stored remotely. Often, for Google Cloud terraform resources, a Google Cloud storage bucket is used to store the state file.

Additionally, Google Cloud Storage allows you to persist old versions of objects and roll-back and restore old versions. This is especially useful if you want to roll back, and is often used for the bucket that contains the statefiles. Together with version controlling the terraform configuration, this provides and effective way to manage the state.

However, this combination of state management and object versioning sets up the scenario where the

Setup

Configure your terraform state file to point to the GCS bucket as below within the backend.tf file:

terraform {
   backend "gcs" {
     bucket                      = BUCKETNAME 
     prefix                      = "terraform/state"
     impersonate_service_account = "terraform-service-account@PROJECTID.iam.gserviceaccount.com"
  }
}

then add a few resources to your GCP project, maybe using Fabric.

You should see in the bucket the following objects :

ashires@cloudshell:~ (PROJECTID)$ gsutil ls gs://BUCKETNAME/terraform/state
gs://BUCKETNAME/terraform/state/default.tfstate

but realistically, you have more in the bucket:

ashires@cloudshell:~ (PROJECTID)$ gsutil ls -a gs://BUCKETNAME/terraform/state
gs://BUCKETNAME/terraform/state/default.tflock#1679481957785980
gs://BUCKETNAME/terraform/state/default.tflock#1679481976448017
gs://BUCKETNAME/terraform/state/default.tflock#1679481982792349
gs://BUCKETNAME/terraform/state/default.tfstate#1678972431855325
gs://BUCKETNAME/terraform/state/default.tfstate#1679302949130701
gs://BUCKETNAME/terraform/state/default.tfstate#1679303000787007
gs://BUCKETNAME/terraform/state/default.tfstate#1679304435736834
gs://BUCKETNAME/terraform/state/default.tfstate#1679304598531061
gs://BUCKETNAME/terraform/state/default.tfstate#1679481966396872
gs://BUCKETNAME/terraform/state/default.tfstate#1679482233534511`

where there are old versions, and the deleted default.tflock files.

Here, this is from three runs of terraform - you can see in an enterprise setting, the number of versions will grow dramatically!

Lifecycle rules

One key part of cleaning up objects is to add a lifecycle rule but often, the default lifecycle rules rely on having a newer, concurrent (live) object - so we need to add an additional rule to specifically clean up the default.tflock files:

lifecycle_rules = [{
      action = {
         type = "Delete"
      }
      condition = {
         daysSinceNoncurrentTime = 1 
         matchesSuffix = [".tflock"]
      }
}]

Testing

In order to test this, we need to apply the lifecycle rule to the bucket, either through terraform, the console or the gcloud storage command line:

gcloud storage buckets update gs://BUCKETNAME --lifecycle-file=LIFECYCLE_CONFIG_FILE

Lifecycle rules can take up to 24hr to apply, so with a bit of patience, we can come back and check that all the default.tflock files have been deleted!

Next steps

Always test your lifecycle rules before deploying to production!