Self-hosting your analytics infrastructure gives you complete control over your data, eliminates vendor lock-in, and can significantly reduce costs at scale. However, designing a robust self-hosted architecture requires careful planning across multiple dimensions: high availability, scalability, security, and operational efficiency.
This guide covers the architectural patterns and best practices for deploying ClickHouse, Matomo, and related analytics components in your own infrastructure.
Architecture Overview
A production-ready self-hosted analytics stack typically consists of several interconnected layers:
- Ingestion Layer: Handles incoming events from SDKs and APIs
- Processing Layer: Transforms, enriches, and validates event data
- Storage Layer: Persists data for querying and long-term retention
- Query Layer: Serves dashboards, reports, and ad-hoc analysis
- Application Layer: The analytics UI and API endpoints
High Availability Patterns
For production workloads, single points of failure are unacceptable. Here's how to design for high availability:
Multi-Node Deployments
Run multiple instances of each component behind load balancers:
# Example: HAProxy configuration for analytics frontend
frontend analytics_frontend
bind *:443 ssl crt /etc/ssl/analytics.pem
default_backend analytics_servers
backend analytics_servers
balance roundrobin
option httpchk GET /health
server analytics1 10.0.1.10:8000 check
server analytics2 10.0.1.11:8000 check
server analytics3 10.0.1.12:8000 check
Database Replication
Configure PostgreSQL with streaming replication for the application database:
- Primary: Handles all write operations
- Standby(s): Synchronous or asynchronous replicas for reads and failover
- Automated failover: Use Patroni (recommended for complex deployments), repmgr, or pg_auto_failover (simplest option)
Patroni is the most fully-featured option, leveraging distributed configuration stores like etcd for consensus. For simpler setups, pg_auto_failover provides an easier path with its monitor-based architecture.
ClickHouse Cluster Design
ClickHouse is the powerhouse behind high-performance analytics queries. For high availability, use ClickHouse Keeper (recommended) instead of ZooKeeper for cluster coordination. ClickHouse Keeper provides better reliability, uses fewer resources, and is purpose-built for ClickHouse:
<!-- clickhouse-config.xml -->
<remote_servers>
<analytics_cluster>
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>clickhouse-01</host>
<port>9000</port>
</replica>
<replica>
<host>clickhouse-02</host>
<port>9000</port>
</replica>
<replica>
<host>clickhouse-03</host>
<port>9000</port>
</replica>
</shard>
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>clickhouse-04</host>
<port>9000</port>
</replica>
<replica>
<host>clickhouse-05</host>
<port>9000</port>
</replica>
<replica>
<host>clickhouse-06</host>
<port>9000</port>
</replica>
</shard>
</analytics_cluster>
</remote_servers>
<!-- ClickHouse Keeper configuration (on dedicated keeper nodes) -->
<keeper_server>
<tcp_port>2181</tcp_port>
<server_id>1</server_id>
<raft_configuration>
<server>
<id>1</id>
<hostname>keeper-01</hostname>
<port>9234</port>
</server>
<server>
<id>2</id>
<hostname>keeper-02</hostname>
<port>9234</port>
</server>
<server>
<id>3</id>
<hostname>keeper-03</hostname>
<port>9234</port>
</server>
</raft_configuration>
</keeper_server>
Key considerations for ClickHouse clusters:
- Sharding: Distribute data across shards for horizontal write scaling
- Replication: Each shard should have 3 replicas for high availability (allows one failure while maintaining recovery capacity)
- ClickHouse Keeper: Deploy as a 3 or 5 node ensemble on dedicated hosts; Keeper is sensitive to disk I/O latency, so use SSDs
- Availability Zones: Distribute replicas across different availability zones with less than 20ms round-trip latency
- Table Engines: Always use ReplicatedMergeTree or similar replicated engines for data you want to persist
Matomo Architecture
Matomo remains fully supported for self-hosted production deployments and can handle over 1 billion pageviews per month with proper architecture.
Standard Deployment
# Matomo components
- PHP Application (multiple pods behind LB)
- MySQL/MariaDB (primary + replica)
- Redis (session storage, caching)
- Archiving Cron (dedicated worker)
High-Volume Configuration
For sites processing millions of pageviews, Matomo recommends a tiered architecture:
- Single server: Suitable for smaller sites with adequate hardware (4 CPU, 8GB RAM, 250GB SSD minimum)
- Multi-server setup: 2+ Matomo servers with shared storage and a single database
- Enterprise architecture: Separate tracking servers from reporting/API servers, dedicated archiving workers, database cluster with failover
Essential plugins and configurations for high traffic:
- QueuedTracking plugin: Buffer tracking requests in Redis before processing to handle traffic spikes
- Redis for sessions: Configure Redis for session handling in load-balanced environments
- Disable browser archiving: Set
enable_browser_archiving_triggering = 0and use cron-based archiving - Tune archiving interval: Increase
time_before_today_archive_considered_outdated(e.g., 3600 seconds for hourly processing)
Networking Architecture
Proper network design is critical for security and performance:
Network Segmentation
# Example network topology
Public Subnet (DMZ):
- Load Balancers
- CDN Edge
- WAF/DDoS Protection
Private Subnet (Application):
- Matomo PHP
- API Gateways
- Application Servers
Private Subnet (Data):
- ClickHouse Cluster
- PostgreSQL
- Redis Cluster
- Kafka Cluster (if used)
Management Subnet:
- ClickHouse Keeper nodes
- Monitoring (Prometheus, Grafana)
- Bastion hosts
Security Considerations
- TLS everywhere: Encrypt all traffic, including internal services; use TLS 1.3 where possible
- Network policies: Restrict traffic between components using security groups or Kubernetes network policies
- Private endpoints: Keep databases off public networks; use VPC endpoints for cloud services
- VPN/Bastion: Secure administrative access through jump hosts or VPN
- Secrets management: Use HashiCorp Vault, AWS Secrets Manager, or similar for credential management
Ingestion Endpoints
For global deployments, consider edge ingestion:
- Deploy lightweight collectors in multiple regions
- Use Kafka or similar for reliable cross-region transport with exactly-once semantics
- Centralize processing in your primary datacenter
- Consider CDN-level data collection for reduced latency
Storage Architecture
Analytics workloads are storage-intensive. Plan accordingly:
ClickHouse Storage
- NVMe SSDs: Essential for query performance on hot data
- Tiered storage: Hot data on SSD, cold data on HDD or object storage (S3/GCS)
- Compression: LZ4 for speed, ZSTD for better compression ratios on cold data
Configure tiered storage policies in XML configuration (not SQL):
<!-- /etc/clickhouse-server/config.d/storage.xml -->
<clickhouse>
<storage_configuration>
<disks>
<nvme_disk>
<path>/var/lib/clickhouse/hot/</path>
</nvme_disk>
<s3_disk>
<type>s3</type>
<endpoint>https://s3.amazonaws.com/your-bucket/clickhouse/</endpoint>
<access_key_id>YOUR_ACCESS_KEY</access_key_id>
<secret_access_key>YOUR_SECRET_KEY</secret_access_key>
<metadata_path>/var/lib/clickhouse/disks/s3/</metadata_path>
</s3_disk>
</disks>
<policies>
<tiered_policy>
<volumes>
<hot>
<disk>nvme_disk</disk>
<move_factor>0.1</move_factor>
</hot>
<cold>
<disk>s3_disk</disk>
<prefer_not_to_merge>true</prefer_not_to_merge>
</cold>
</volumes>
</tiered_policy>
</policies>
</storage_configuration>
</clickhouse>
Apply storage policy to tables using TTL rules:
CREATE TABLE events (
event_date Date,
event_time DateTime,
user_id UInt64,
event_type String,
properties String
) ENGINE = ReplicatedMergeTree
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_type, event_date, user_id)
TTL event_date + INTERVAL 30 DAY TO VOLUME 'cold'
SETTINGS storage_policy = 'tiered_policy';
Capacity Planning
Estimate storage needs based on event volume:
- Raw events: ~0.5-1KB per event (compressed with LZ4)
- Session recordings: ~100KB-1MB per session (if applicable)
- Retention: Plan for 1-3 years of queryable data
- Growth buffer: Add 30-50% overhead for merges and temporary data
Example calculation:
Monthly events: 100M
Storage per event: 0.75KB (compressed average)
Monthly growth: 75GB
3-year retention: 2.7TB
With 40% buffer: ~3.8TB hot storage
Cold storage (S3): Plan for 5x for historical data
Backup Strategies
Data loss is catastrophic. Implement comprehensive backup strategies:
ClickHouse Backups
ClickHouse provides native backup commands that support local disk, S3, and Azure Blob Storage:
-- Full database backup to S3
BACKUP DATABASE analytics TO S3(
'https://s3.amazonaws.com/your-bucket/backups/full_2025_01_22',
'ACCESS_KEY',
'SECRET_KEY'
);
-- Incremental backup (reference previous backup)
BACKUP DATABASE analytics TO S3(
'https://s3.amazonaws.com/your-bucket/backups/incr_2025_01_23',
'ACCESS_KEY',
'SECRET_KEY'
) SETTINGS base_backup = S3(
'https://s3.amazonaws.com/your-bucket/backups/full_2025_01_22',
'ACCESS_KEY',
'SECRET_KEY'
);
-- Restore from backup
RESTORE DATABASE analytics FROM S3(
'https://s3.amazonaws.com/your-bucket/backups/full_2025_01_22',
'ACCESS_KEY',
'SECRET_KEY'
);
For more complex backup workflows, the clickhouse-backup tool by Altinity provides additional features like parallel uploads, compression options, and integration with various storage backends.
Database Backups
- PostgreSQL: pg_dump for logical backups, pg_basebackup for physical; consider pgBackRest for production
- MySQL/MariaDB: mysqldump or Percona XtraBackup for hot backups
Backup Schedule
# Recommended backup schedule
- Full backup: Weekly (Sunday 02:00 UTC)
- Incremental: Daily (02:00 UTC)
- Transaction logs/WAL: Continuous streaming
- Retention: 30 days for daily + monthly archives for 1 year
- Off-site replication: Immediately after backup completion
Backup Validation
Untested backups are not backups:
- Automated restore tests: Weekly restore to staging environment
- Integrity checks: Verify backup checksums and data consistency
- RTO/RPO validation: Document and test that recovery meets business requirements
- Runbook testing: Ensure operations team can execute recovery procedures
Disaster Recovery
Plan for complete infrastructure failure:
- Cross-region replication: Replicate critical data to secondary region
- Infrastructure as Code: Terraform, Pulumi, or OpenTofu for rapid re-deployment
- Runbooks: Documented procedures for recovery scenarios with clear ownership
- Regular DR drills: Test recovery procedures quarterly; document results and improvements
Monitoring and Observability
You can't manage what you can't measure:
Recommended Monitoring Stack
- Metrics: Prometheus + Grafana (or Victoria Metrics for scale)
- Logs: Loki, Elasticsearch, or ClickHouse itself
- Tracing: Jaeger or Tempo for distributed tracing
- Alerting: Alertmanager with PagerDuty/Opsgenie integration
Key Metrics
- Ingestion: Events per second, latency percentiles (p50/p95/p99), error rates, queue depth
- Storage: Disk usage by volume, replication lag, merge queue size, parts count
- Queries: Query latency p50/p95/p99, concurrent queries, failed queries, slow query log
- System: CPU utilization, memory usage, network throughput, disk I/O latency
- Cluster health: Node availability, Keeper/ZooKeeper session status, replication status
Alerting Thresholds
# Example alerting rules
- Ingestion latency p95 > 5s for 5 minutes: Warning
- Ingestion latency p95 > 30s for 5 minutes: Critical
- Replication lag > 30s: Warning, > 5 minutes: Critical
- Disk usage > 70%: Warning, > 85%: Critical
- Query p99 > 30s: Warning
- Keeper quorum lost: Critical
- Any replica in readonly mode: Critical
- Failed backup: Critical
Scaling Considerations
Design for growth from the start:
Horizontal Scaling
- Stateless services: Add replicas behind load balancer; use Kubernetes HPA for auto-scaling
- ClickHouse: Add shards for write scaling, replicas for read scaling; plan sharding strategy early as resharding is complex
- Kafka: Add partitions for parallelism, brokers for throughput
Vertical Scaling
- ClickHouse: Benefits significantly from more RAM (for caches) and faster storage; tune
max_server_memory_usageappropriately - PostgreSQL: More RAM for shared_buffers (typically 25% of RAM) and work_mem
Sharding Strategy
Choose a sharding key that distributes data evenly. Avoid keys that create hotspots (e.g., timestamps, sequential IDs). Good candidates include user_id hashes or composite keys. Apply schema changes across all shards using the ON CLUSTER clause.
Cost Optimization
Self-hosting should be cost-effective:
- Right-size instances: Monitor actual usage and adjust; avoid over-provisioning
- Reserved capacity: Commit to 1-3 year terms for 30-60% savings on compute
- Spot/Preemptible instances: Use for non-critical batch processing and development
- Data lifecycle: Implement automated TTL policies; archive or delete data that's no longer needed
- Storage tiering: Move cold data to cheaper object storage
- Compression: Use ZSTD for cold data to reduce storage costs
Kubernetes Deployment Considerations
If deploying on Kubernetes:
- Operators: Use the Altinity Operator for ClickHouse for simplified cluster management
- StatefulSets: Required for stateful components (ClickHouse, PostgreSQL, Kafka)
- Persistent Volumes: Use high-performance storage classes; avoid network-attached storage for ClickHouse if possible
- Pod Disruption Budgets: Ensure availability during node maintenance
- Resource requests/limits: Set appropriately to prevent noisy neighbor issues
- Anti-affinity rules: Distribute replicas across nodes and availability zones
Next Steps
Building a self-hosted analytics infrastructure is a significant undertaking. Start with:
- Document requirements: Volume projections, retention policies, availability SLAs, compliance requirements
- Choose deployment platform: VMs, Kubernetes, or hybrid based on team expertise
- Start small and iterate: Begin with a minimal viable deployment; avoid premature optimization
- Invest in automation: Infrastructure as Code from day one; GitOps for configuration management
- Plan for operations: Monitoring dashboards, alerting, backup verification, incident response procedures
- Build expertise: Invest in training; consider vendor support contracts for critical components
A well-designed self-hosted architecture provides the foundation for analytics that scale with your business while keeping your data under your control. Remember that the operational burden is significant—carefully weigh the benefits of self-hosting against managed services for your specific use case.