Monitoring Stack

Grafana, Loki, Alloy, and Tempo for observability

The observability stack on the Infra Management Cluster consists of Grafana for dashboards, Loki for logs, Alloy for telemetry collection, and Tempo for distributed tracing.

ArgoCD Resources

ApplicationNamespaceSource TypeChart/Path
grafanamonitoringKustomizek8s/infra-services/grafana/base
grafana-lokigrafana-lokiHelmloki (grafana.github.io)
grafana-alloy-hubgrafana-alloyHelmalloy (grafana.github.io)
grafana-tempografana-tempoHelmtempo-distributed (grafana.github.io)

File Paths

ApplicationFile
grafanaapps/monitoring-services.yaml
grafana-lokiapps/grafana-loki.yaml
grafana-alloy-hubapps/grafana-alloy.yaml
grafana-tempoapps/grafana-tempo.yaml

Grafana

Dashboards and visualisation platform.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: grafana
spec:
  syncPolicy:
    automated: {}
  destination:
    namespace: monitoring
    server: https://kubernetes.default.svc
  project: infra-services
  source:
    path: k8s/infra-services/grafana/base
    repoURL: https://github.com/Titanbay/infra-services
    targetRevision: 'main'

Source Structure:

k8s/infra-services/grafana/
└── base/
    └── ... (Grafana manifests)

Grafana Loki

Log aggregation system deployed in distributed mode.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: grafana-loki
spec:
  syncPolicy:
    automated:
      prune: true
    syncOptions:
      - ServerSideApply=true
  destination:
    namespace: grafana-loki
    server: https://kubernetes.default.svc
  project: infra-services
  source:
    chart: loki
    repoURL: https://grafana.github.io/helm-charts
    targetRevision: 6.31.0
    helm:
      valuesObject:
        deploymentMode: Distributed
        loki:
          auth_enabled: false
          storage:
            type: gcs
            bucketNames:
              chunks: tb-grafana-loki
              ruler: tb-grafana-loki
              admin: tb-grafana-loki
        serviceAccount:
          create: false
          name: grafana-loki

Key Configuration:

  • Deployment mode: Distributed (for production scale)
  • Storage: GCS bucket tb-grafana-loki
  • Auth: Disabled (internal use)
  • Tracing: Enabled with OTEL export to Alloy

Grafana Alloy

Telemetry collection agent (successor to Grafana Agent).

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: grafana-alloy-hub
spec:
  syncPolicy:
    automated:
      prune: true
    syncOptions:
      - ServerSideApply=true
  destination:
    namespace: grafana-alloy
    server: https://kubernetes.default.svc
  project: infra-services
  source:
    chart: alloy
    repoURL: https://grafana.github.io/helm-charts
    targetRevision: 1.1.2
    helm:
      valuesObject:
        fullnameOverride: grafana-alloy
        alloy:
          configMap:
            create: false
            name: grafana-alloy
            key: config.alloy
          clustering:
            enabled: true
            name: "grafana-alloy-hub"
          extraPorts:
          - name: "otel-grpc"
            port: 4317
          - name: "otel-http"
            port: 4318
        controller:
          type: 'deployment'
          replicas: 2
        serviceAccount:
          create: false
          name: grafana-alloy

Key Configuration:

  • Clustering enabled for HA
  • OTEL ports exposed (4317 gRPC, 4318 HTTP)
  • External ConfigMap for configuration
  • 2 replicas with topology spread

Grafana Tempo

Distributed tracing backend.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: grafana-tempo
spec:
  syncPolicy:
    automated: {}
  destination:
    namespace: grafana-tempo
    server: https://kubernetes.default.svc
  project: infra-services
  source:
    chart: tempo-distributed
    repoURL: https://grafana.github.io/helm-charts
    targetRevision: 1.46.0
    helm:
      valuesObject:
        fullnameOverride: 'grafana-tempo'
        serviceAccount:
          create: false
          name: tempo-gcs
        ingester:
          replicas: 2
        metricsGenerator:
          enabled: true
        distributor:
          replicas: 1
        compactor:
          replicas: 2
        querier:
          replicas: 2

Key Configuration:

  • Distributed deployment mode
  • Metrics generator enabled
  • GCS storage via Workload Identity
  • Multiple replicas for HA

Source Structure

k8s/infra-services/
├── grafana/
│   └── base/                    # Grafana Kustomize manifests
├── grafana-alloy/               # Alloy ConfigMaps and resources
│   └── ... 
└── loki/                        # Additional Loki resources
    └── ...

How to Update

Upgrading Helm Chart Versions

  1. Update targetRevision in the Application YAML
  2. Review the chart’s changelog for breaking changes
  3. Update valuesObject if needed
  4. Commit and push to main

Modifying Configuration

For Helm-based applications:

  1. Edit the valuesObject in the Application YAML
  2. Commit and push to main

For Grafana (Kustomize):

  1. Edit manifests in k8s/infra-services/grafana/base/
  2. Commit and push to main

Alloy Configuration

Alloy uses an external ConfigMap for its configuration:

  1. Edit k8s/infra-services/grafana-alloy/ resources
  2. The ConfigMap grafana-alloy contains the Alloy config
  3. Commit and push to main

Integration

All components are integrated:

graph LR
    A[Applications] -->|logs| B[Alloy]
    A -->|traces| B
    A -->|metrics| B
    B -->|logs| C[Loki]
    B -->|traces| D[Tempo]
    B -->|metrics| E[GCP Monitoring]
    C --> F[Grafana]
    D --> F
    E --> F
ResourcePurpose
grafana-loki ServiceAccountWorkload Identity for GCS access
tempo-gcs ServiceAccountWorkload Identity for GCS access
grafana-alloy ConfigMapAlloy pipeline configuration