Kubernetes has become the de facto standard for deploying containerized applications, and analytics platforms are no exception. Running ClickHouse, Matomo, Plausible, and related tools on Kubernetes provides scalability, resilience, and operational efficiency that's hard to achieve with traditional deployments.

This guide covers everything you need to know about deploying and operating analytics workloads on Kubernetes, from Helm charts to production best practices.

⚠️ Important Notice: PostHog officially sunsetted Kubernetes support in May 2023. If you're looking to deploy PostHog, use their cloud offering or the Docker Compose "hobby" deployment instead. This guide has been updated to focus on ClickHouse and other analytics tools that actively support Kubernetes deployments.

Why Kubernetes for Analytics?

Before diving into implementation, let's understand why Kubernetes is well-suited for analytics workloads:

  • Elastic scaling: Handle traffic spikes during product launches or marketing campaigns
  • Self-healing: Automatic pod restarts and rescheduling on failures
  • Resource efficiency: Bin-packing optimizes hardware utilization
  • Declarative configuration: Version-controlled, reproducible deployments
  • Rich ecosystem: Operators, service meshes, and observability tools

Analytics Platform Options for Kubernetes

Several open-source analytics platforms actively support Kubernetes deployments:

  • ClickHouse: High-performance columnar database ideal for real-time analytics (fully supported via Altinity Operator)
  • Matomo: Full-featured Google Analytics alternative with self-hosting support
  • Plausible: Lightweight, privacy-focused web analytics
  • Metabase: Business intelligence and data visualization platform
  • Apache Superset: Modern data exploration and visualization platform

Prerequisites

Before deploying analytics workloads on Kubernetes, ensure you have:

  • Kubernetes cluster (1.28+) with at least 4 nodes — currently supported versions are 1.33, 1.34, and 1.35
  • kubectl and Helm 3 installed
  • Storage class supporting dynamic provisioning
  • Ingress controller (nginx, traefik, or cloud provider)
  • At least 8GB RAM per node for ClickHouse workloads

ClickHouse on Kubernetes

ClickHouse is the backbone of many analytics platforms and requires special attention due to its stateful nature and performance requirements.

Using the Altinity ClickHouse Operator

The Altinity ClickHouse Operator (currently v0.25.6) is the most popular and recommended way to run ClickHouse in Kubernetes. As of version 0.24.0, it includes native support for ClickHouse Keeper, eliminating the need for external ZooKeeper installations.

# Install the operator via Helm (recommended)
helm repo add clickhouse-operator https://docs.altinity.com/clickhouse-operator
helm repo update

# Install the operator
helm upgrade --install --create-namespace \
  --namespace clickhouse \
  clickhouse-operator \
  clickhouse-operator/altinity-clickhouse-operator

# Verify installation
kubectl get pods -n clickhouse

ClickHouse Cluster with ClickHouse Keeper

Modern deployments should use ClickHouse Keeper instead of ZooKeeper for coordination:

# clickhouse-keeper.yaml
apiVersion: clickhouse-keeper.altinity.com/v1
kind: ClickHouseKeeperInstallation
metadata:
  name: keeper-analytics
  namespace: analytics
spec:
  configuration:
    clusters:
      - name: keeper
        layout:
          replicasCount: 3
  templates:
    podTemplates:
      - name: keeper-pod
        spec:
          containers:
            - name: clickhouse-keeper
              resources:
                requests:
                  cpu: "500m"
                  memory: 1Gi
                limits:
                  cpu: "1"
                  memory: 2Gi
    volumeClaimTemplates:
      - name: keeper-storage
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 10Gi
          storageClassName: fast-ssd
# clickhouse-cluster.yaml
apiVersion: clickhouse.altinity.com/v1
kind: ClickHouseInstallation
metadata:
  name: analytics-clickhouse
  namespace: analytics
spec:
  configuration:
    clusters:
      - name: analytics
        layout:
          shardsCount: 2
          replicasCount: 2
        templates:
          podTemplate: clickhouse-pod
          volumeClaimTemplate: clickhouse-storage
    zookeeper:
      nodes:
        - host: keeper-keeper-analytics
          port: 2181
    users:
      admin/password_sha256_hex: "your-sha256-password-hash"
      admin/networks/ip:
        - 10.0.0.0/8
      readonly/password: "readonly-password"
      readonly/profile: readonly
    profiles:
      readonly:
        readonly: 1
  templates:
    podTemplates:
      - name: clickhouse-pod
        spec:
          containers:
            - name: clickhouse
              resources:
                requests:
                  cpu: "2"
                  memory: 8Gi
                limits:
                  cpu: "4"
                  memory: 16Gi
    volumeClaimTemplates:
      - name: clickhouse-storage
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 500Gi
          storageClassName: fast-ssd

Matomo on Kubernetes

Matomo is a mature, full-featured analytics platform with comprehensive Kubernetes support:

# Add Bitnami Helm repository
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update

# Create namespace
kubectl create namespace matomo

# Install Matomo
helm install matomo bitnami/matomo \
  --namespace matomo \
  --values matomo-values.yaml
# matomo-values.yaml
replicaCount: 2

resources:
  requests:
    cpu: 250m
    memory: 512Mi
  limits:
    cpu: 1000m
    memory: 1Gi

mariadb:
  enabled: true
  architecture: replication
  auth:
    rootPassword: "secure-root-password"
    database: matomo
  primary:
    persistence:
      size: 50Gi
      storageClass: fast-ssd

ingress:
  enabled: true
  hostname: analytics.example.com
  annotations:
    kubernetes.io/ingress.class: nginx
    cert-manager.io/cluster-issuer: letsencrypt-prod
  tls: true

persistence:
  enabled: true
  size: 10Gi
  storageClass: standard

Plausible on Kubernetes

Plausible is a lightweight, privacy-focused alternative that's easy to deploy:

# plausible-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: plausible
  namespace: analytics
spec:
  replicas: 2
  selector:
    matchLabels:
      app: plausible
  template:
    metadata:
      labels:
        app: plausible
    spec:
      containers:
        - name: plausible
          image: ghcr.io/plausible/community-edition:v2.1
          ports:
            - containerPort: 8000
          env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: plausible-secrets
                  key: database-url
            - name: CLICKHOUSE_DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: plausible-secrets
                  key: clickhouse-url
            - name: SECRET_KEY_BASE
              valueFrom:
                secretKeyRef:
                  name: plausible-secrets
                  key: secret-key
            - name: BASE_URL
              value: "https://analytics.example.com"
          resources:
            requests:
              cpu: 250m
              memory: 512Mi
            limits:
              cpu: 1000m
              memory: 2Gi

Resource Management

Proper resource allocation is critical for analytics workloads:

CPU and Memory Guidelines

Component CPU Request Memory Request Notes
ClickHouse 2000m 8Gi More RAM = better query performance
ClickHouse Keeper 500m 1Gi 3 replicas minimum for HA
Matomo Web 250m 512Mi Scale horizontally for traffic
Plausible 250m 512Mi Lightweight, easy to scale
PostgreSQL 1000m 2Gi Depends on metadata volume
Redis 250m 1Gi Size based on cache needs
Kafka 1000m 4Gi Per broker, scale brokers for throughput

Resource Quotas

Protect your cluster with resource quotas:

# resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: analytics-quota
  namespace: analytics
spec:
  hard:
    requests.cpu: "32"
    requests.memory: 64Gi
    limits.cpu: "64"
    limits.memory: 128Gi
    persistentvolumeclaims: "20"
    requests.storage: 2Ti

Auto-Scaling Configuration

Configure Horizontal Pod Autoscaler for variable workloads:

HPA for Web Services

# hpa-analytics.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: analytics-web-hpa
  namespace: analytics
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: analytics-web
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 2
          periodSeconds: 120

KEDA for Event-Driven Scaling

KEDA (Kubernetes Event-Driven Autoscaling) enables scaling based on external metrics like Kafka consumer lag. KEDA is a CNCF graduated project with 70+ built-in scalers:

# Install KEDA via Helm
helm repo add kedacore https://kedacore.github.io/charts
helm repo update

helm install keda kedacore/keda \
  --namespace keda \
  --create-namespace
# keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: analytics-worker-scaler
  namespace: analytics
spec:
  scaleTargetRef:
    name: analytics-worker
  minReplicaCount: 2
  maxReplicaCount: 20
  pollingInterval: 30
  cooldownPeriod: 300
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka:9092
        consumerGroup: analytics-workers
        topic: events_ingestion
        lagThreshold: "1000"
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: http_requests_total
        query: sum(rate(http_requests_total{job="analytics"}[2m]))
        threshold: "100"

Persistent Storage

Analytics workloads require reliable, performant storage:

Storage Class Selection

  • ClickHouse: Use fastest available storage (gp3/io2 on AWS, pd-ssd on GCP)
  • Kafka: High-throughput storage with good IOPS
  • PostgreSQL: Standard SSD storage is usually sufficient
# storage-class.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "10000"
  throughput: "500"
reclaimPolicy: Retain
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer

Volume Snapshots for Backup

# volume-snapshot.yaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: clickhouse-backup-daily
  namespace: analytics
spec:
  volumeSnapshotClassName: csi-aws-vsc
  source:
    persistentVolumeClaimName: clickhouse-data-0

Networking Best Practices

Network Policies

Restrict traffic between components:

# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: clickhouse-network-policy
  namespace: analytics
spec:
  podSelector:
    matchLabels:
      app: clickhouse
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: analytics-web
        - podSelector:
            matchLabels:
              app: clickhouse
      ports:
        - port: 8123
        - port: 9000
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: clickhouse
        - podSelector:
            matchLabels:
              app: clickhouse-keeper
      ports:
        - port: 9000
        - port: 2181

Service Mesh Considerations

If using Istio or Linkerd:

  • Exclude ClickHouse native protocol (port 9000) from mTLS initially
  • Configure proper retry policies for transient failures
  • Set appropriate timeouts for long-running queries (ClickHouse queries can run for minutes)

Monitoring Stack

Comprehensive monitoring is essential for production operations:

Prometheus ServiceMonitor

# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: clickhouse-monitor
  namespace: analytics
spec:
  selector:
    matchLabels:
      app: clickhouse
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

Key Metrics to Monitor

  • ClickHouse: ClickHouseProfileEvents_Query, ClickHouseMetrics_ReplicasMaxQueueSize, ClickHouseAsyncMetrics_ReplicasSumQueueSize
  • Kafka: kafka_consumer_lag, kafka_messages_in_per_sec
  • Kubernetes: Pod restarts, resource utilization, PVC usage
  • Application: Request latency, error rates, ingestion throughput

Grafana Dashboards

Deploy pre-built dashboards for visibility. The Altinity Operator includes Prometheus alerting rules and Grafana dashboards:

# grafana-dashboard-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: clickhouse-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  clickhouse-overview.json: |
    {
      "title": "ClickHouse Overview",
      "panels": [...]
    }

Production Best Practices

Pod Disruption Budgets

Ensure availability during maintenance:

# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: clickhouse-pdb
  namespace: analytics
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: clickhouse

Pod Anti-Affinity

Spread replicas across nodes and zones:

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: clickhouse
        topologyKey: kubernetes.io/hostname
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app: clickhouse
          topologyKey: topology.kubernetes.io/zone

Priority Classes

Ensure critical components get resources first:

# priority-class.yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: analytics-critical
value: 1000000
globalDefault: false
description: "Priority class for critical analytics components"

Security Considerations

Secrets Management

Use external secrets operators for sensitive data:

# external-secret.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: clickhouse-credentials
  namespace: analytics
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: clickhouse-credentials
  data:
    - secretKey: admin-password
      remoteRef:
        key: analytics/clickhouse
        property: admin-password

Pod Security Standards

securityContext:
  runAsNonRoot: true
  runAsUser: 101
  fsGroup: 101
  seccompProfile:
    type: RuntimeDefault
containerSecurityContext:
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  capabilities:
    drop:
      - ALL

Upgrade Strategies

Rolling Updates

Configure safe rolling updates:

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 1
    maxUnavailable: 0

Blue-Green Deployments

For major version upgrades, consider blue-green deployments:

  1. Deploy new version alongside existing
  2. Run both versions with traffic splitting
  3. Validate data consistency and performance
  4. Switch traffic to new version
  5. Keep old version for quick rollback

ClickHouse Operator Upgrades

The Altinity Operator now uses SYSTEM SHUTDOWN instead of pod recreation when applying configuration changes, significantly speeding up updates on nodes with large volumes.

Troubleshooting Common Issues

Pod Evictions

If pods are being evicted:

  • Check resource limits and requests
  • Review node resource pressure with kubectl describe node
  • Consider using priority classes
  • Check for memory leaks in long-running queries

Storage Performance

If queries are slow:

  • Verify storage class IOPS limits
  • Check for throttling on cloud provider
  • Monitor disk utilization metrics
  • Consider using local NVMe storage for ClickHouse

Network Timeouts

For connection issues:

  • Verify network policies allow required traffic
  • Check service mesh sidecar logs
  • Review DNS resolution with CoreDNS
  • Ensure ClickHouse Keeper/ZooKeeper is healthy

ClickHouse CrashLoopBackOff

If ClickHouse pods fail to start:

  • Check configuration templates for syntax errors
  • Verify backward compatibility when upgrading versions
  • Change the entrypoint to debug: modify pod to use sleep infinity to access the container
  • Review Altinity's troubleshooting guide for specific error messages

Cost Optimization

Right-sizing Resources

  • Use Vertical Pod Autoscaler (VPA) in recommendation mode to identify optimal resource requests
  • Implement KEDA scale-to-zero for development environments
  • Use spot/preemptible instances for non-critical workloads

Storage Tiering

ClickHouse supports tiered storage — use cold storage for historical data:

<storage_configuration>
  <disks>
    <hot>
      <path>/var/lib/clickhouse/hot/</path>
    </hot>
    <cold>
      <type>s3</type>
      <endpoint>https://s3.amazonaws.com/bucket/</endpoint>
    </cold>
  </disks>
  <policies>
    <tiered>
      <volumes>
        <hot><disk>hot</disk></hot>
        <cold><disk>cold</disk></cold>
      </volumes>
      <move_factor>0.1</move_factor>
    </tiered>
  </policies>
</storage_configuration>

Next Steps

Successfully running analytics on Kubernetes requires ongoing attention:

  1. Start with the official Helm charts and customize incrementally
  2. Implement comprehensive monitoring before going to production
  3. Document runbooks for common operational tasks
  4. Practice disaster recovery procedures regularly
  5. Stay updated with upstream releases and security patches
  6. Consider managed services (Altinity.Cloud, ClickHouse Cloud) if operational burden is too high

Additional Resources

Kubernetes provides a powerful foundation for scalable analytics infrastructure, but it requires investment in operational expertise to realize its full potential.