Operations
KCC is GitOps. The flow for every change is the same: edit YAML, open a PR, merge, ArgoCD syncs, KCC reconciles. There is no kubectl apply step and no gcloud step.
Repository rule. From
AGENTS.md:Do not ever try to
kubectl applyKubernetes manifests or useterraform applydirectly. This repository follows GitOps principles. All changes to resources are handled via CI/CD. You may use the kustomize command to verify changes in kustomization overlays.
Add a new GCP resource
- Pick the right tree and namespace. Use the table in Managed resources → How to find a resource to choose where the manifest lives. The directory’s API group dictates which CRD apiVersion to write.
- Write the manifest. Set the resource’s
namespaceto the namespace that already has aConfigConnectorContext(e.g.tb-infra-mgmt-project,tb-platform-dev,tb-platform-vpc-prod). - Register it with kustomize. Add the new file to the closest
kustomization.yaml. If you created a new directory, also add it to the parent’s resource list. - Validate locally. Render the affected overlay with
kustomize build:Make sure the resource appears, the namespace is correct, and no other resource was accidentally dropped.kustomize build k8s/tb-platform-infra/env/dev | less - Open a PR. Atlantis runs
terraform planfor Terraform changes; for KCC, review is by humans. Once merged, ArgoCD syncs the Application (or the ApplicationSet generates/refreshes the Application) and KCC creates the GCP resource.
Cross-env changes (env/base) require careful review. A change to
env/base/iam/identity-access.yamlwill roll into dev, qa, and prod in sequence as their Applications sync.
Update an existing GCP resource
- Edit the resource YAML in the same directory it already lives.
kustomize buildthe affected overlay to confirm the change renders as expected.- Open a PR. After merge, ArgoCD syncs and KCC reconciles. Most updates are picked up within ~3 minutes; large IAM policy updates can take longer because KCC throttles itself.
Field ownership and Server-Side Apply
Every KCC Application and ApplicationSet in this repo uses syncOptions: [ServerSideApply=true]. That means Kubernetes tracks which fields the ArgoCD/Argo apply controller owns vs. fields owned by the KCC controller. If you remove a field from a manifest, Argo will release ownership of it but will not unset it unless something else claims the field with the SSA force flag.
If you really need to drop a default, set the field to its zero value (enabled: false, count: 0, [], etc.) explicitly rather than removing the key.
Delete a GCP resource
KCC defaults to abandon-on-delete: removing the Kubernetes manifest will let KCC stop managing the resource but will not delete the underlying GCP object. To actually delete the GCP resource you must annotate the manifest so KCC tears it down:
metadata:
annotations:
cnrm.cloud.google.com/deletion-policy: "abandon" # default
# OR
cnrm.cloud.google.com/deletion-policy: "delete" # actually delete in GCP
Workflow for an actual delete:
- Add
cnrm.cloud.google.com/deletion-policy: "delete"to the resource, open and merge a PR. - After ArgoCD has synced and the KCC
statusshowsHealthy, open a second PR that removes the manifest and its entry from the kustomization. - Verify the underlying GCP resource is gone (Cloud Console or a read-only
gcloudquery is fine for verification, but never for mutation).
The two-step pattern exists so that “remove the file” never silently turns into “delete the production database”.
The
config-connector-operatorApplication hasautomated.prune: falsefor exactly this reason - we never want a botched merge to delete the operator’s CRDs and orphan the entire fleet.
Inspect a resource’s reconcile state
You can shell into a developer cluster and check KCC’s view of any resource:
kubectl -n tb-platform-dev get gcp # list everything KCC owns in this namespace
kubectl -n tb-platform-dev describe iamserviceaccount platform-runner
Look for:
status.conditions[].type=Ready→Truemeans KCC successfully reconciled.status.conditions[].reason=UpdateFailed/DependencyNotReadyindicates a fixable problem - the message field tells you what.status.observedGenerationshould matchmetadata.generation.
KCC also writes the cnrm.cloud.google.com/management-conflict annotation when two managers try to own the same resource - usually a sign that the same GCP resource was created out of band (Terraform, gcloud) before KCC.
Troubleshooting
KCC resource stuck in Updating
- Check
kubectl describe <kind>/<name>for the latest condition message. - If the message references a missing dependency (e.g. an
IAMServiceAccountthat doesn’t exist yet), find or create it. KCC will retry automatically. - If the message references a permission error, the workload-identity GSA (
gke-platform-infra@tb-infra-mgmt-gke-prod-uk-40fd.iam.gserviceaccount.com) is missing a role on the target project. Add the binding via the appropriate KCCIAMPartialPolicy/IAMPolicyMemberand re-sync.
Argo shows OutOfSync for a KCC Application
Almost always a SSA field-ownership conflict: another manager (KCC itself, the GCP API defaulting a field, or a manual kubectl edit) wrote a field the manifest doesn’t declare. Fix by either declaring the field in Git or by removing the offending live edit.
Never resolve drift by manually changing the live object. The next sync will overwrite it and you will have lost the audit trail.
ConfigConnectorContext not Healthy
If a namespace’s CCC reports status.healthy: false, the most common causes are:
- The referenced GSA does not exist (typo in
spec.googleServiceAccount). - The GKE workload-identity binding is missing - the per-namespace controller manager’s KSA (
cnrm-controller-managerincnrm-system) needsroles/iam.workloadIdentityUseron the GSA. - The operator is itself unhealthy:
kubectl -n configconnector-operator-system get podsshould show the operator running.
Operator upgrade
To upgrade the operator manifest version:
- Fetch the new manifest for the GKE Autopilot operator from Google.
- Replace
k8s/infra-services/gcp-config-connector/autopilot-configconnector-operator.yamlwith the new file. - Verify the diff is what you expect - new CRD versions, RBAC tweaks, deployment image bumps.
kustomize build k8s/infra-services/gcp-config-connectorto confirm it still composes with the existingconfigconnector.yaml.- Open a PR. The same Kustomization is delivered to every cluster that runs KCC (hub + the three tb-platform spokes via the
tb-platform-config-connector-operatorApplicationSet), so the version is bumped fleet-wide in one merge.
Because the config-connector-operator Application is not pruned automatically, a botched upgrade will not strip CRDs - you can revert the PR and Argo will re-apply the previous manifest.
What you can do safely without merging
kustomize build <path>- render a tree to YAML for review.kubectl --context=<cluster> get <kind>/describe/logsagainst any live KCC resource.kubectl -n cnrm-system logs deployment/cnrm-resource-stats-recorderfor fleet-wide KCC counts.kubectl -n <ns> get events --sort-by=.lastTimestampto see KCC’s recent activity in a namespace.
What you must not do
- Run
gcloudorkubectlto mutate state directly. kubectl applya KCC manifest from your laptop.kubectl deletea KCC resource without first removing the manifest from Git - Argo will re-sync the deletion away, or worse, prune the resource on its own schedule.- Edit any of the auto-generated argocd-ha manifests in
k8s/infra-services/argocd/base/or the KCC operator CRDs inautopilot-configconnector-operator.yaml(regenerate by replacing the whole file with the upstream version).
Related
- Config Connector overview - mode, IAM, hub-and-spoke architecture.
- Managed resources - what lives where.
- Managed Services → Config Connector - ArgoCD
Application/ApplicationSetdefinitions. - Upstream: GCP Config Connector operations guide.