Operations

Day-to-day workflows for working with Config Connector resources: adding, updating, troubleshooting

KCC is GitOps. The flow for every change is the same: edit YAML, open a PR, merge, ArgoCD syncs, KCC reconciles. There is no kubectl apply step and no gcloud step.

Repository rule. From AGENTS.md:

Do not ever try to kubectl apply Kubernetes manifests or use terraform apply directly. This repository follows GitOps principles. All changes to resources are handled via CI/CD. You may use the kustomize command to verify changes in kustomization overlays.

Add a new GCP resource

  1. Pick the right tree and namespace. Use the table in Managed resources → How to find a resource to choose where the manifest lives. The directory’s API group dictates which CRD apiVersion to write.
  2. Write the manifest. Set the resource’s namespace to the namespace that already has a ConfigConnectorContext (e.g. tb-infra-mgmt-project, tb-platform-dev, tb-platform-vpc-prod).
  3. Register it with kustomize. Add the new file to the closest kustomization.yaml. If you created a new directory, also add it to the parent’s resource list.
  4. Validate locally. Render the affected overlay with kustomize build:
    kustomize build k8s/tb-platform-infra/env/dev | less
    
    Make sure the resource appears, the namespace is correct, and no other resource was accidentally dropped.
  5. Open a PR. Atlantis runs terraform plan for Terraform changes; for KCC, review is by humans. Once merged, ArgoCD syncs the Application (or the ApplicationSet generates/refreshes the Application) and KCC creates the GCP resource.

Cross-env changes (env/base) require careful review. A change to env/base/iam/identity-access.yaml will roll into dev, qa, and prod in sequence as their Applications sync.

Update an existing GCP resource

  1. Edit the resource YAML in the same directory it already lives.
  2. kustomize build the affected overlay to confirm the change renders as expected.
  3. Open a PR. After merge, ArgoCD syncs and KCC reconciles. Most updates are picked up within ~3 minutes; large IAM policy updates can take longer because KCC throttles itself.

Field ownership and Server-Side Apply

Every KCC Application and ApplicationSet in this repo uses syncOptions: [ServerSideApply=true]. That means Kubernetes tracks which fields the ArgoCD/Argo apply controller owns vs. fields owned by the KCC controller. If you remove a field from a manifest, Argo will release ownership of it but will not unset it unless something else claims the field with the SSA force flag.

If you really need to drop a default, set the field to its zero value (enabled: false, count: 0, [], etc.) explicitly rather than removing the key.

Delete a GCP resource

KCC defaults to abandon-on-delete: removing the Kubernetes manifest will let KCC stop managing the resource but will not delete the underlying GCP object. To actually delete the GCP resource you must annotate the manifest so KCC tears it down:

metadata:
  annotations:
    cnrm.cloud.google.com/deletion-policy: "abandon"  # default
    # OR
    cnrm.cloud.google.com/deletion-policy: "delete"   # actually delete in GCP

Workflow for an actual delete:

  1. Add cnrm.cloud.google.com/deletion-policy: "delete" to the resource, open and merge a PR.
  2. After ArgoCD has synced and the KCC status shows Healthy, open a second PR that removes the manifest and its entry from the kustomization.
  3. Verify the underlying GCP resource is gone (Cloud Console or a read-only gcloud query is fine for verification, but never for mutation).

The two-step pattern exists so that “remove the file” never silently turns into “delete the production database”.

The config-connector-operator Application has automated.prune: false for exactly this reason - we never want a botched merge to delete the operator’s CRDs and orphan the entire fleet.

Inspect a resource’s reconcile state

You can shell into a developer cluster and check KCC’s view of any resource:

kubectl -n tb-platform-dev get gcp           # list everything KCC owns in this namespace
kubectl -n tb-platform-dev describe iamserviceaccount platform-runner

Look for:

  • status.conditions[].type=ReadyTrue means KCC successfully reconciled.
  • status.conditions[].reason=UpdateFailed / DependencyNotReady indicates a fixable problem - the message field tells you what.
  • status.observedGeneration should match metadata.generation.

KCC also writes the cnrm.cloud.google.com/management-conflict annotation when two managers try to own the same resource - usually a sign that the same GCP resource was created out of band (Terraform, gcloud) before KCC.

Troubleshooting

KCC resource stuck in Updating

  1. Check kubectl describe <kind>/<name> for the latest condition message.
  2. If the message references a missing dependency (e.g. an IAMServiceAccount that doesn’t exist yet), find or create it. KCC will retry automatically.
  3. If the message references a permission error, the workload-identity GSA (gke-platform-infra@tb-infra-mgmt-gke-prod-uk-40fd.iam.gserviceaccount.com) is missing a role on the target project. Add the binding via the appropriate KCC IAMPartialPolicy/IAMPolicyMember and re-sync.

Argo shows OutOfSync for a KCC Application

Almost always a SSA field-ownership conflict: another manager (KCC itself, the GCP API defaulting a field, or a manual kubectl edit) wrote a field the manifest doesn’t declare. Fix by either declaring the field in Git or by removing the offending live edit.

Never resolve drift by manually changing the live object. The next sync will overwrite it and you will have lost the audit trail.

ConfigConnectorContext not Healthy

If a namespace’s CCC reports status.healthy: false, the most common causes are:

  • The referenced GSA does not exist (typo in spec.googleServiceAccount).
  • The GKE workload-identity binding is missing - the per-namespace controller manager’s KSA (cnrm-controller-manager in cnrm-system) needs roles/iam.workloadIdentityUser on the GSA.
  • The operator is itself unhealthy: kubectl -n configconnector-operator-system get pods should show the operator running.

Operator upgrade

To upgrade the operator manifest version:

  1. Fetch the new manifest for the GKE Autopilot operator from Google.
  2. Replace k8s/infra-services/gcp-config-connector/autopilot-configconnector-operator.yaml with the new file.
  3. Verify the diff is what you expect - new CRD versions, RBAC tweaks, deployment image bumps.
  4. kustomize build k8s/infra-services/gcp-config-connector to confirm it still composes with the existing configconnector.yaml.
  5. Open a PR. The same Kustomization is delivered to every cluster that runs KCC (hub + the three tb-platform spokes via the tb-platform-config-connector-operator ApplicationSet), so the version is bumped fleet-wide in one merge.

Because the config-connector-operator Application is not pruned automatically, a botched upgrade will not strip CRDs - you can revert the PR and Argo will re-apply the previous manifest.

What you can do safely without merging

  • kustomize build <path> - render a tree to YAML for review.
  • kubectl --context=<cluster> get <kind> / describe / logs against any live KCC resource.
  • kubectl -n cnrm-system logs deployment/cnrm-resource-stats-recorder for fleet-wide KCC counts.
  • kubectl -n <ns> get events --sort-by=.lastTimestamp to see KCC’s recent activity in a namespace.

What you must not do

  • Run gcloud or kubectl to mutate state directly.
  • kubectl apply a KCC manifest from your laptop.
  • kubectl delete a KCC resource without first removing the manifest from Git - Argo will re-sync the deletion away, or worse, prune the resource on its own schedule.
  • Edit any of the auto-generated argocd-ha manifests in k8s/infra-services/argocd/base/ or the KCC operator CRDs in autopilot-configconnector-operator.yaml (regenerate by replacing the whole file with the upstream version).