analytics on kubernetes

Kubernetes has become the de facto standard for deploying containerized applications, and analytics platforms are no exception. Running ClickHouse, Matomo, Plausible, and related tools on Kubernetes provides scalability, resilience, and operational efficiency that's hard to achieve with traditional deployments.

This guide covers everything you need to know about deploying and operating analytics workloads on Kubernetes, from Helm charts to production best practices.

⚠️ Important Notice: PostHog officially sunsetted Kubernetes support in May 2023. If you're looking to deploy PostHog, use their cloud offering or the Docker Compose "hobby" deployment instead. This guide has been updated to focus on ClickHouse and other analytics tools that actively support Kubernetes deployments.

Why Kubernetes for Analytics?

Before diving into implementation, let's understand why Kubernetes is well-suited for analytics workloads:

Elastic scaling: Handle traffic spikes during product launches or marketing campaigns
Self-healing: Automatic pod restarts and rescheduling on failures
Resource efficiency: Bin-packing optimizes hardware utilization
Declarative configuration: Version-controlled, reproducible deployments
Rich ecosystem: Operators, service meshes, and observability tools

Analytics Platform Options for Kubernetes

Several open-source analytics platforms actively support Kubernetes deployments:

ClickHouse: High-performance columnar database ideal for real-time analytics (fully supported via Altinity Operator)
Matomo: Full-featured Google Analytics alternative with self-hosting support
Plausible: Lightweight, privacy-focused web analytics
Metabase: Business intelligence and data visualization platform
Apache Superset: Modern data exploration and visualization platform

Prerequisites

Before deploying analytics workloads on Kubernetes, ensure you have:

Kubernetes cluster (1.28+) with at least 4 nodes — currently supported versions are 1.33, 1.34, and 1.35
kubectl and Helm 3 installed
Storage class supporting dynamic provisioning
Ingress controller (nginx, traefik, or cloud provider)
At least 8GB RAM per node for ClickHouse workloads

ClickHouse on Kubernetes

ClickHouse is the backbone of many analytics platforms and requires special attention due to its stateful nature and performance requirements.

Using the Altinity ClickHouse Operator

The Altinity ClickHouse Operator (currently v0.25.6) is the most popular and recommended way to run ClickHouse in Kubernetes. As of version 0.24.0, it includes native support for ClickHouse Keeper, eliminating the need for external ZooKeeper installations.

# Install the operator via Helm (recommended)
helm repo add clickhouse-operator https://docs.altinity.com/clickhouse-operator
helm repo update

# Install the operator
helm upgrade --install --create-namespace \
  --namespace clickhouse \
  clickhouse-operator \
  clickhouse-operator/altinity-clickhouse-operator

# Verify installation
kubectl get pods -n clickhouse

ClickHouse Cluster with ClickHouse Keeper

Modern deployments should use ClickHouse Keeper instead of ZooKeeper for coordination:

# clickhouse-keeper.yaml
apiVersion: clickhouse-keeper.altinity.com/v1
kind: ClickHouseKeeperInstallation
metadata:
  name: keeper-analytics
  namespace: analytics
spec:
  configuration:
    clusters:
      - name: keeper
        layout:
          replicasCount: 3
  templates:
    podTemplates:
      - name: keeper-pod
        spec:
          containers:
            - name: clickhouse-keeper
              resources:
                requests:
                  cpu: "500m"
                  memory: 1Gi
                limits:
                  cpu: "1"
                  memory: 2Gi
    volumeClaimTemplates:
      - name: keeper-storage
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 10Gi
          storageClassName: fast-ssd

# clickhouse-cluster.yaml
apiVersion: clickhouse.altinity.com/v1
kind: ClickHouseInstallation
metadata:
  name: analytics-clickhouse
  namespace: analytics
spec:
  configuration:
    clusters:
      - name: analytics
        layout:
          shardsCount: 2
          replicasCount: 2
        templates:
          podTemplate: clickhouse-pod
          volumeClaimTemplate: clickhouse-storage
    zookeeper:
      nodes:
        - host: keeper-keeper-analytics
          port: 2181
    users:
      admin/password_sha256_hex: "your-sha256-password-hash"
      admin/networks/ip:
        - 10.0.0.0/8
      readonly/password: "readonly-password"
      readonly/profile: readonly
    profiles:
      readonly:
        readonly: 1
  templates:
    podTemplates:
      - name: clickhouse-pod
        spec:
          containers:
            - name: clickhouse
              resources:
                requests:
                  cpu: "2"
                  memory: 8Gi
                limits:
                  cpu: "4"
                  memory: 16Gi
    volumeClaimTemplates:
      - name: clickhouse-storage
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 500Gi
          storageClassName: fast-ssd

Matomo on Kubernetes

Matomo is a mature, full-featured analytics platform with comprehensive Kubernetes support:

# Add Bitnami Helm repository
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update

# Create namespace
kubectl create namespace matomo

# Install Matomo
helm install matomo bitnami/matomo \
  --namespace matomo \
  --values matomo-values.yaml

# matomo-values.yaml
replicaCount: 2

resources:
  requests:
    cpu: 250m
    memory: 512Mi
  limits:
    cpu: 1000m
    memory: 1Gi

mariadb:
  enabled: true
  architecture: replication
  auth:
    rootPassword: "secure-root-password"
    database: matomo
  primary:
    persistence:
      size: 50Gi
      storageClass: fast-ssd

ingress:
  enabled: true
  hostname: analytics.example.com
  annotations:
    kubernetes.io/ingress.class: nginx
    cert-manager.io/cluster-issuer: letsencrypt-prod
  tls: true

persistence:
  enabled: true
  size: 10Gi
  storageClass: standard

Plausible on Kubernetes

Plausible is a lightweight, privacy-focused alternative that's easy to deploy:

# plausible-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: plausible
  namespace: analytics
spec:
  replicas: 2
  selector:
    matchLabels:
      app: plausible
  template:
    metadata:
      labels:
        app: plausible
    spec:
      containers:
        - name: plausible
          image: ghcr.io/plausible/community-edition:v2.1
          ports:
            - containerPort: 8000
          env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: plausible-secrets
                  key: database-url
            - name: CLICKHOUSE_DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: plausible-secrets
                  key: clickhouse-url
            - name: SECRET_KEY_BASE
              valueFrom:
                secretKeyRef:
                  name: plausible-secrets
                  key: secret-key
            - name: BASE_URL
              value: "https://analytics.example.com"
          resources:
            requests:
              cpu: 250m
              memory: 512Mi
            limits:
              cpu: 1000m
              memory: 2Gi

Resource Management

Proper resource allocation is critical for analytics workloads:

CPU and Memory Guidelines

Component	CPU Request	Memory Request	Notes
ClickHouse	2000m	8Gi	More RAM = better query performance
ClickHouse Keeper	500m	1Gi	3 replicas minimum for HA
Matomo Web	250m	512Mi	Scale horizontally for traffic
Plausible	250m	512Mi	Lightweight, easy to scale
PostgreSQL	1000m	2Gi	Depends on metadata volume
Redis	250m	1Gi	Size based on cache needs
Kafka	1000m	4Gi	Per broker, scale brokers for throughput

Resource Quotas

Protect your cluster with resource quotas:

# resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: analytics-quota
  namespace: analytics
spec:
  hard:
    requests.cpu: "32"
    requests.memory: 64Gi
    limits.cpu: "64"
    limits.memory: 128Gi
    persistentvolumeclaims: "20"
    requests.storage: 2Ti

Auto-Scaling Configuration

Configure Horizontal Pod Autoscaler for variable workloads:

HPA for Web Services

# hpa-analytics.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: analytics-web-hpa
  namespace: analytics
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: analytics-web
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 2
          periodSeconds: 120

KEDA for Event-Driven Scaling

KEDA (Kubernetes Event-Driven Autoscaling) enables scaling based on external metrics like Kafka consumer lag. KEDA is a CNCF graduated project with 70+ built-in scalers:

# Install KEDA via Helm
helm repo add kedacore https://kedacore.github.io/charts
helm repo update

helm install keda kedacore/keda \
  --namespace keda \
  --create-namespace

# keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: analytics-worker-scaler
  namespace: analytics
spec:
  scaleTargetRef:
    name: analytics-worker
  minReplicaCount: 2
  maxReplicaCount: 20
  pollingInterval: 30
  cooldownPeriod: 300
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka:9092
        consumerGroup: analytics-workers
        topic: events_ingestion
        lagThreshold: "1000"
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: http_requests_total
        query: sum(rate(http_requests_total{job="analytics"}[2m]))
        threshold: "100"

Persistent Storage

Analytics workloads require reliable, performant storage:

Storage Class Selection

ClickHouse: Use fastest available storage (gp3/io2 on AWS, pd-ssd on GCP)
Kafka: High-throughput storage with good IOPS
PostgreSQL: Standard SSD storage is usually sufficient

# storage-class.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "10000"
  throughput: "500"
reclaimPolicy: Retain
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer

Volume Snapshots for Backup

# volume-snapshot.yaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: clickhouse-backup-daily
  namespace: analytics
spec:
  volumeSnapshotClassName: csi-aws-vsc
  source:
    persistentVolumeClaimName: clickhouse-data-0

Networking Best Practices

Network Policies

Restrict traffic between components:

# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: clickhouse-network-policy
  namespace: analytics
spec:
  podSelector:
    matchLabels:
      app: clickhouse
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: analytics-web
        - podSelector:
            matchLabels:
              app: clickhouse
      ports:
        - port: 8123
        - port: 9000
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: clickhouse
        - podSelector:
            matchLabels:
              app: clickhouse-keeper
      ports:
        - port: 9000
        - port: 2181

Service Mesh Considerations

If using Istio or Linkerd:

Exclude ClickHouse native protocol (port 9000) from mTLS initially
Configure proper retry policies for transient failures
Set appropriate timeouts for long-running queries (ClickHouse queries can run for minutes)

Monitoring Stack

Comprehensive monitoring is essential for production operations:

Prometheus ServiceMonitor

# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: clickhouse-monitor
  namespace: analytics
spec:
  selector:
    matchLabels:
      app: clickhouse
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

Key Metrics to Monitor

ClickHouse: ClickHouseProfileEvents_Query, ClickHouseMetrics_ReplicasMaxQueueSize, ClickHouseAsyncMetrics_ReplicasSumQueueSize
Kafka: kafka_consumer_lag, kafka_messages_in_per_sec
Kubernetes: Pod restarts, resource utilization, PVC usage
Application: Request latency, error rates, ingestion throughput

Grafana Dashboards

Deploy pre-built dashboards for visibility. The Altinity Operator includes Prometheus alerting rules and Grafana dashboards:

# grafana-dashboard-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: clickhouse-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  clickhouse-overview.json: |
    {
      "title": "ClickHouse Overview",
      "panels": [...]
    }

Production Best Practices

Pod Disruption Budgets

Ensure availability during maintenance:

# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: clickhouse-pdb
  namespace: analytics
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: clickhouse

Pod Anti-Affinity

Spread replicas across nodes and zones:

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: clickhouse
        topologyKey: kubernetes.io/hostname
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app: clickhouse
          topologyKey: topology.kubernetes.io/zone

Priority Classes

Ensure critical components get resources first:

# priority-class.yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: analytics-critical
value: 1000000
globalDefault: false
description: "Priority class for critical analytics components"

Security Considerations

Secrets Management

Use external secrets operators for sensitive data:

# external-secret.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: clickhouse-credentials
  namespace: analytics
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: clickhouse-credentials
  data:
    - secretKey: admin-password
      remoteRef:
        key: analytics/clickhouse
        property: admin-password

Pod Security Standards

securityContext:
  runAsNonRoot: true
  runAsUser: 101
  fsGroup: 101
  seccompProfile:
    type: RuntimeDefault
containerSecurityContext:
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  capabilities:
    drop:
      - ALL

Upgrade Strategies

Rolling Updates

Configure safe rolling updates:

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 1
    maxUnavailable: 0

Blue-Green Deployments

For major version upgrades, consider blue-green deployments:

Deploy new version alongside existing
Run both versions with traffic splitting
Validate data consistency and performance
Switch traffic to new version
Keep old version for quick rollback

ClickHouse Operator Upgrades

The Altinity Operator now uses SYSTEM SHUTDOWN instead of pod recreation when applying configuration changes, significantly speeding up updates on nodes with large volumes.

Troubleshooting Common Issues

Pod Evictions

If pods are being evicted:

Check resource limits and requests
Review node resource pressure with kubectl describe node
Consider using priority classes
Check for memory leaks in long-running queries

Storage Performance

If queries are slow:

Verify storage class IOPS limits
Check for throttling on cloud provider
Monitor disk utilization metrics
Consider using local NVMe storage for ClickHouse

Network Timeouts

For connection issues:

Verify network policies allow required traffic
Check service mesh sidecar logs
Review DNS resolution with CoreDNS
Ensure ClickHouse Keeper/ZooKeeper is healthy

ClickHouse CrashLoopBackOff

If ClickHouse pods fail to start:

Check configuration templates for syntax errors
Verify backward compatibility when upgrading versions
Change the entrypoint to debug: modify pod to use sleep infinity to access the container
Review Altinity's troubleshooting guide for specific error messages

Cost Optimization

Right-sizing Resources

Use Vertical Pod Autoscaler (VPA) in recommendation mode to identify optimal resource requests
Implement KEDA scale-to-zero for development environments
Use spot/preemptible instances for non-critical workloads

Storage Tiering

ClickHouse supports tiered storage — use cold storage for historical data:

<storage_configuration>
  <disks>
    <hot>
      <path>/var/lib/clickhouse/hot/</path>
    </hot>
    <cold>
      <type>s3</type>
      <endpoint>https://s3.amazonaws.com/bucket/</endpoint>
    </cold>
  </disks>
  <policies>
    <tiered>
      <volumes>
        <hot><disk>hot</disk></hot>
        <cold><disk>cold</disk></cold>
      </volumes>
      <move_factor>0.1</move_factor>
    </tiered>
  </policies>
</storage_configuration>

Next Steps

Successfully running analytics on Kubernetes requires ongoing attention:

Start with the official Helm charts and customize incrementally
Implement comprehensive monitoring before going to production
Document runbooks for common operational tasks
Practice disaster recovery procedures regularly
Stay updated with upstream releases and security patches
Consider managed services (Altinity.Cloud, ClickHouse Cloud) if operational burden is too high

Additional Resources

Kubernetes provides a powerful foundation for scalable analytics infrastructure, but it requires investment in operational expertise to realize its full potential.