Kubernetes has become the de facto standard for deploying containerized applications, and analytics platforms are no exception. Running ClickHouse, Matomo, Plausible, and related tools on Kubernetes provides scalability, resilience, and operational efficiency that's hard to achieve with traditional deployments.
This guide covers everything you need to know about deploying and operating analytics workloads on Kubernetes, from Helm charts to production best practices.
Why Kubernetes for Analytics?
Before diving into implementation, let's understand why Kubernetes is well-suited for analytics workloads:
- Elastic scaling: Handle traffic spikes during product launches or marketing campaigns
- Self-healing: Automatic pod restarts and rescheduling on failures
- Resource efficiency: Bin-packing optimizes hardware utilization
- Declarative configuration: Version-controlled, reproducible deployments
- Rich ecosystem: Operators, service meshes, and observability tools
Analytics Platform Options for Kubernetes
Several open-source analytics platforms actively support Kubernetes deployments:
- ClickHouse: High-performance columnar database ideal for real-time analytics (fully supported via Altinity Operator)
- Matomo: Full-featured Google Analytics alternative with self-hosting support
- Plausible: Lightweight, privacy-focused web analytics
- Metabase: Business intelligence and data visualization platform
- Apache Superset: Modern data exploration and visualization platform
Prerequisites
Before deploying analytics workloads on Kubernetes, ensure you have:
- Kubernetes cluster (1.28+) with at least 4 nodes — currently supported versions are 1.33, 1.34, and 1.35
- kubectl and Helm 3 installed
- Storage class supporting dynamic provisioning
- Ingress controller (nginx, traefik, or cloud provider)
- At least 8GB RAM per node for ClickHouse workloads
ClickHouse on Kubernetes
ClickHouse is the backbone of many analytics platforms and requires special attention due to its stateful nature and performance requirements.
Using the Altinity ClickHouse Operator
The Altinity ClickHouse Operator (currently v0.25.6) is the most popular and recommended way to run ClickHouse in Kubernetes. As of version 0.24.0, it includes native support for ClickHouse Keeper, eliminating the need for external ZooKeeper installations.
# Install the operator via Helm (recommended)
helm repo add clickhouse-operator https://docs.altinity.com/clickhouse-operator
helm repo update
# Install the operator
helm upgrade --install --create-namespace \
--namespace clickhouse \
clickhouse-operator \
clickhouse-operator/altinity-clickhouse-operator
# Verify installation
kubectl get pods -n clickhouse
ClickHouse Cluster with ClickHouse Keeper
Modern deployments should use ClickHouse Keeper instead of ZooKeeper for coordination:
# clickhouse-keeper.yaml
apiVersion: clickhouse-keeper.altinity.com/v1
kind: ClickHouseKeeperInstallation
metadata:
name: keeper-analytics
namespace: analytics
spec:
configuration:
clusters:
- name: keeper
layout:
replicasCount: 3
templates:
podTemplates:
- name: keeper-pod
spec:
containers:
- name: clickhouse-keeper
resources:
requests:
cpu: "500m"
memory: 1Gi
limits:
cpu: "1"
memory: 2Gi
volumeClaimTemplates:
- name: keeper-storage
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: fast-ssd
# clickhouse-cluster.yaml
apiVersion: clickhouse.altinity.com/v1
kind: ClickHouseInstallation
metadata:
name: analytics-clickhouse
namespace: analytics
spec:
configuration:
clusters:
- name: analytics
layout:
shardsCount: 2
replicasCount: 2
templates:
podTemplate: clickhouse-pod
volumeClaimTemplate: clickhouse-storage
zookeeper:
nodes:
- host: keeper-keeper-analytics
port: 2181
users:
admin/password_sha256_hex: "your-sha256-password-hash"
admin/networks/ip:
- 10.0.0.0/8
readonly/password: "readonly-password"
readonly/profile: readonly
profiles:
readonly:
readonly: 1
templates:
podTemplates:
- name: clickhouse-pod
spec:
containers:
- name: clickhouse
resources:
requests:
cpu: "2"
memory: 8Gi
limits:
cpu: "4"
memory: 16Gi
volumeClaimTemplates:
- name: clickhouse-storage
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 500Gi
storageClassName: fast-ssd
Matomo on Kubernetes
Matomo is a mature, full-featured analytics platform with comprehensive Kubernetes support:
# Add Bitnami Helm repository
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update
# Create namespace
kubectl create namespace matomo
# Install Matomo
helm install matomo bitnami/matomo \
--namespace matomo \
--values matomo-values.yaml
# matomo-values.yaml
replicaCount: 2
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
mariadb:
enabled: true
architecture: replication
auth:
rootPassword: "secure-root-password"
database: matomo
primary:
persistence:
size: 50Gi
storageClass: fast-ssd
ingress:
enabled: true
hostname: analytics.example.com
annotations:
kubernetes.io/ingress.class: nginx
cert-manager.io/cluster-issuer: letsencrypt-prod
tls: true
persistence:
enabled: true
size: 10Gi
storageClass: standard
Plausible on Kubernetes
Plausible is a lightweight, privacy-focused alternative that's easy to deploy:
# plausible-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: plausible
namespace: analytics
spec:
replicas: 2
selector:
matchLabels:
app: plausible
template:
metadata:
labels:
app: plausible
spec:
containers:
- name: plausible
image: ghcr.io/plausible/community-edition:v2.1
ports:
- containerPort: 8000
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: plausible-secrets
key: database-url
- name: CLICKHOUSE_DATABASE_URL
valueFrom:
secretKeyRef:
name: plausible-secrets
key: clickhouse-url
- name: SECRET_KEY_BASE
valueFrom:
secretKeyRef:
name: plausible-secrets
key: secret-key
- name: BASE_URL
value: "https://analytics.example.com"
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 1000m
memory: 2Gi
Resource Management
Proper resource allocation is critical for analytics workloads:
CPU and Memory Guidelines
| Component | CPU Request | Memory Request | Notes |
|---|---|---|---|
| ClickHouse | 2000m | 8Gi | More RAM = better query performance |
| ClickHouse Keeper | 500m | 1Gi | 3 replicas minimum for HA |
| Matomo Web | 250m | 512Mi | Scale horizontally for traffic |
| Plausible | 250m | 512Mi | Lightweight, easy to scale |
| PostgreSQL | 1000m | 2Gi | Depends on metadata volume |
| Redis | 250m | 1Gi | Size based on cache needs |
| Kafka | 1000m | 4Gi | Per broker, scale brokers for throughput |
Resource Quotas
Protect your cluster with resource quotas:
# resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: analytics-quota
namespace: analytics
spec:
hard:
requests.cpu: "32"
requests.memory: 64Gi
limits.cpu: "64"
limits.memory: 128Gi
persistentvolumeclaims: "20"
requests.storage: 2Ti
Auto-Scaling Configuration
Configure Horizontal Pod Autoscaler for variable workloads:
HPA for Web Services
# hpa-analytics.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: analytics-web-hpa
namespace: analytics
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: analytics-web
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 4
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 2
periodSeconds: 120
KEDA for Event-Driven Scaling
KEDA (Kubernetes Event-Driven Autoscaling) enables scaling based on external metrics like Kafka consumer lag. KEDA is a CNCF graduated project with 70+ built-in scalers:
# Install KEDA via Helm
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda \
--namespace keda \
--create-namespace
# keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: analytics-worker-scaler
namespace: analytics
spec:
scaleTargetRef:
name: analytics-worker
minReplicaCount: 2
maxReplicaCount: 20
pollingInterval: 30
cooldownPeriod: 300
triggers:
- type: kafka
metadata:
bootstrapServers: kafka:9092
consumerGroup: analytics-workers
topic: events_ingestion
lagThreshold: "1000"
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: http_requests_total
query: sum(rate(http_requests_total{job="analytics"}[2m]))
threshold: "100"
Persistent Storage
Analytics workloads require reliable, performant storage:
Storage Class Selection
- ClickHouse: Use fastest available storage (gp3/io2 on AWS, pd-ssd on GCP)
- Kafka: High-throughput storage with good IOPS
- PostgreSQL: Standard SSD storage is usually sufficient
# storage-class.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "10000"
throughput: "500"
reclaimPolicy: Retain
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
Volume Snapshots for Backup
# volume-snapshot.yaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: clickhouse-backup-daily
namespace: analytics
spec:
volumeSnapshotClassName: csi-aws-vsc
source:
persistentVolumeClaimName: clickhouse-data-0
Networking Best Practices
Network Policies
Restrict traffic between components:
# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: clickhouse-network-policy
namespace: analytics
spec:
podSelector:
matchLabels:
app: clickhouse
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: analytics-web
- podSelector:
matchLabels:
app: clickhouse
ports:
- port: 8123
- port: 9000
egress:
- to:
- podSelector:
matchLabels:
app: clickhouse
- podSelector:
matchLabels:
app: clickhouse-keeper
ports:
- port: 9000
- port: 2181
Service Mesh Considerations
If using Istio or Linkerd:
- Exclude ClickHouse native protocol (port 9000) from mTLS initially
- Configure proper retry policies for transient failures
- Set appropriate timeouts for long-running queries (ClickHouse queries can run for minutes)
Monitoring Stack
Comprehensive monitoring is essential for production operations:
Prometheus ServiceMonitor
# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: clickhouse-monitor
namespace: analytics
spec:
selector:
matchLabels:
app: clickhouse
endpoints:
- port: metrics
interval: 30s
path: /metrics
Key Metrics to Monitor
- ClickHouse:
ClickHouseProfileEvents_Query,ClickHouseMetrics_ReplicasMaxQueueSize,ClickHouseAsyncMetrics_ReplicasSumQueueSize - Kafka:
kafka_consumer_lag,kafka_messages_in_per_sec - Kubernetes: Pod restarts, resource utilization, PVC usage
- Application: Request latency, error rates, ingestion throughput
Grafana Dashboards
Deploy pre-built dashboards for visibility. The Altinity Operator includes Prometheus alerting rules and Grafana dashboards:
# grafana-dashboard-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: clickhouse-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
clickhouse-overview.json: |
{
"title": "ClickHouse Overview",
"panels": [...]
}
Production Best Practices
Pod Disruption Budgets
Ensure availability during maintenance:
# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: clickhouse-pdb
namespace: analytics
spec:
minAvailable: 2
selector:
matchLabels:
app: clickhouse
Pod Anti-Affinity
Spread replicas across nodes and zones:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: clickhouse
topologyKey: kubernetes.io/hostname
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: clickhouse
topologyKey: topology.kubernetes.io/zone
Priority Classes
Ensure critical components get resources first:
# priority-class.yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: analytics-critical
value: 1000000
globalDefault: false
description: "Priority class for critical analytics components"
Security Considerations
Secrets Management
Use external secrets operators for sensitive data:
# external-secret.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: clickhouse-credentials
namespace: analytics
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-manager
kind: ClusterSecretStore
target:
name: clickhouse-credentials
data:
- secretKey: admin-password
remoteRef:
key: analytics/clickhouse
property: admin-password
Pod Security Standards
securityContext:
runAsNonRoot: true
runAsUser: 101
fsGroup: 101
seccompProfile:
type: RuntimeDefault
containerSecurityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
Upgrade Strategies
Rolling Updates
Configure safe rolling updates:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
Blue-Green Deployments
For major version upgrades, consider blue-green deployments:
- Deploy new version alongside existing
- Run both versions with traffic splitting
- Validate data consistency and performance
- Switch traffic to new version
- Keep old version for quick rollback
ClickHouse Operator Upgrades
The Altinity Operator now uses SYSTEM SHUTDOWN instead of pod recreation when applying configuration changes, significantly speeding up updates on nodes with large volumes.
Troubleshooting Common Issues
Pod Evictions
If pods are being evicted:
- Check resource limits and requests
- Review node resource pressure with
kubectl describe node - Consider using priority classes
- Check for memory leaks in long-running queries
Storage Performance
If queries are slow:
- Verify storage class IOPS limits
- Check for throttling on cloud provider
- Monitor disk utilization metrics
- Consider using local NVMe storage for ClickHouse
Network Timeouts
For connection issues:
- Verify network policies allow required traffic
- Check service mesh sidecar logs
- Review DNS resolution with CoreDNS
- Ensure ClickHouse Keeper/ZooKeeper is healthy
ClickHouse CrashLoopBackOff
If ClickHouse pods fail to start:
- Check configuration templates for syntax errors
- Verify backward compatibility when upgrading versions
- Change the entrypoint to debug: modify pod to use
sleep infinityto access the container - Review Altinity's troubleshooting guide for specific error messages
Cost Optimization
Right-sizing Resources
- Use Vertical Pod Autoscaler (VPA) in recommendation mode to identify optimal resource requests
- Implement KEDA scale-to-zero for development environments
- Use spot/preemptible instances for non-critical workloads
Storage Tiering
ClickHouse supports tiered storage — use cold storage for historical data:
<storage_configuration>
<disks>
<hot>
<path>/var/lib/clickhouse/hot/</path>
</hot>
<cold>
<type>s3</type>
<endpoint>https://s3.amazonaws.com/bucket/</endpoint>
</cold>
</disks>
<policies>
<tiered>
<volumes>
<hot><disk>hot</disk></hot>
<cold><disk>cold</disk></cold>
</volumes>
<move_factor>0.1</move_factor>
</tiered>
</policies>
</storage_configuration>
Next Steps
Successfully running analytics on Kubernetes requires ongoing attention:
- Start with the official Helm charts and customize incrementally
- Implement comprehensive monitoring before going to production
- Document runbooks for common operational tasks
- Practice disaster recovery procedures regularly
- Stay updated with upstream releases and security patches
- Consider managed services (Altinity.Cloud, ClickHouse Cloud) if operational burden is too high
Additional Resources
- Altinity Kubernetes Operator Documentation
- ClickHouse Official Documentation
- KEDA Documentation
- Matomo Self-Hosting Guide
- Plausible Self-Hosting Documentation
Kubernetes provides a powerful foundation for scalable analytics infrastructure, but it requires investment in operational expertise to realize its full potential.