self hosted analytics architecture

Self-hosting your analytics infrastructure gives you complete control over your data, eliminates vendor lock-in, and can significantly reduce costs at scale. However, designing a robust self-hosted architecture requires careful planning across multiple dimensions: high availability, scalability, security, and operational efficiency.

This guide covers the architectural patterns and best practices for deploying ClickHouse, Matomo, and related analytics components in your own infrastructure.

⚠️ Important Note on PostHog Self-Hosting: As of 2024, PostHog no longer supports production-scale self-hosted deployments. Their Kubernetes deployment was sunset (security updates ended May 2024), and the open-source Docker Compose deployment is intended only for hobby use (approximately 100,000 events/month). For production workloads, PostHog recommends their Cloud offering. This guide focuses on ClickHouse and Matomo for self-hosted analytics, which remain fully supported for production deployments.

Architecture Overview

A production-ready self-hosted analytics stack typically consists of several interconnected layers:

Ingestion Layer: Handles incoming events from SDKs and APIs
Processing Layer: Transforms, enriches, and validates event data
Storage Layer: Persists data for querying and long-term retention
Query Layer: Serves dashboards, reports, and ad-hoc analysis
Application Layer: The analytics UI and API endpoints

High Availability Patterns

For production workloads, single points of failure are unacceptable. Here's how to design for high availability:

Multi-Node Deployments

Run multiple instances of each component behind load balancers:

# Example: HAProxy configuration for analytics frontend
frontend analytics_frontend
    bind *:443 ssl crt /etc/ssl/analytics.pem
    default_backend analytics_servers

backend analytics_servers
    balance roundrobin
    option httpchk GET /health
    server analytics1 10.0.1.10:8000 check
    server analytics2 10.0.1.11:8000 check
    server analytics3 10.0.1.12:8000 check

Database Replication

Configure PostgreSQL with streaming replication for the application database:

Primary: Handles all write operations
Standby(s): Synchronous or asynchronous replicas for reads and failover
Automated failover: Use Patroni (recommended for complex deployments), repmgr, or pg_auto_failover (simplest option)

Patroni is the most fully-featured option, leveraging distributed configuration stores like etcd for consensus. For simpler setups, pg_auto_failover provides an easier path with its monitor-based architecture.

ClickHouse Cluster Design

ClickHouse is the powerhouse behind high-performance analytics queries. For high availability, use ClickHouse Keeper (recommended) instead of ZooKeeper for cluster coordination. ClickHouse Keeper provides better reliability, uses fewer resources, and is purpose-built for ClickHouse:

<!-- clickhouse-config.xml -->
<remote_servers>
    <analytics_cluster>
        <shard>
            <internal_replication>true</internal_replication>
            <replica>
                <host>clickhouse-01</host>
                <port>9000</port>
            </replica>
            <replica>
                <host>clickhouse-02</host>
                <port>9000</port>
            </replica>
            <replica>
                <host>clickhouse-03</host>
                <port>9000</port>
            </replica>
        </shard>
        <shard>
            <internal_replication>true</internal_replication>
            <replica>
                <host>clickhouse-04</host>
                <port>9000</port>
            </replica>
            <replica>
                <host>clickhouse-05</host>
                <port>9000</port>
            </replica>
            <replica>
                <host>clickhouse-06</host>
                <port>9000</port>
            </replica>
        </shard>
    </analytics_cluster>
</remote_servers>

<!-- ClickHouse Keeper configuration (on dedicated keeper nodes) -->
<keeper_server>
    <tcp_port>2181</tcp_port>
    <server_id>1</server_id>
    <raft_configuration>
        <server>
            <id>1</id>
            <hostname>keeper-01</hostname>
            <port>9234</port>
        </server>
        <server>
            <id>2</id>
            <hostname>keeper-02</hostname>
            <port>9234</port>
        </server>
        <server>
            <id>3</id>
            <hostname>keeper-03</hostname>
            <port>9234</port>
        </server>
    </raft_configuration>
</keeper_server>

Key considerations for ClickHouse clusters:

Sharding: Distribute data across shards for horizontal write scaling
Replication: Each shard should have 3 replicas for high availability (allows one failure while maintaining recovery capacity)
ClickHouse Keeper: Deploy as a 3 or 5 node ensemble on dedicated hosts; Keeper is sensitive to disk I/O latency, so use SSDs
Availability Zones: Distribute replicas across different availability zones with less than 20ms round-trip latency
Table Engines: Always use ReplicatedMergeTree or similar replicated engines for data you want to persist

Matomo Architecture

Matomo remains fully supported for self-hosted production deployments and can handle over 1 billion pageviews per month with proper architecture.

Standard Deployment

# Matomo components
- PHP Application (multiple pods behind LB)
- MySQL/MariaDB (primary + replica)
- Redis (session storage, caching)
- Archiving Cron (dedicated worker)

High-Volume Configuration

For sites processing millions of pageviews, Matomo recommends a tiered architecture:

Single server: Suitable for smaller sites with adequate hardware (4 CPU, 8GB RAM, 250GB SSD minimum)
Multi-server setup: 2+ Matomo servers with shared storage and a single database
Enterprise architecture: Separate tracking servers from reporting/API servers, dedicated archiving workers, database cluster with failover

Essential plugins and configurations for high traffic:

QueuedTracking plugin: Buffer tracking requests in Redis before processing to handle traffic spikes
Redis for sessions: Configure Redis for session handling in load-balanced environments
Disable browser archiving: Set enable_browser_archiving_triggering = 0 and use cron-based archiving
Tune archiving interval: Increase time_before_today_archive_considered_outdated (e.g., 3600 seconds for hourly processing)

Networking Architecture

Proper network design is critical for security and performance:

Network Segmentation

# Example network topology
Public Subnet (DMZ):
  - Load Balancers
  - CDN Edge
  - WAF/DDoS Protection

Private Subnet (Application):
  - Matomo PHP
  - API Gateways
  - Application Servers

Private Subnet (Data):
  - ClickHouse Cluster
  - PostgreSQL
  - Redis Cluster
  - Kafka Cluster (if used)

Management Subnet:
  - ClickHouse Keeper nodes
  - Monitoring (Prometheus, Grafana)
  - Bastion hosts

Security Considerations

TLS everywhere: Encrypt all traffic, including internal services; use TLS 1.3 where possible
Network policies: Restrict traffic between components using security groups or Kubernetes network policies
Private endpoints: Keep databases off public networks; use VPC endpoints for cloud services
VPN/Bastion: Secure administrative access through jump hosts or VPN
Secrets management: Use HashiCorp Vault, AWS Secrets Manager, or similar for credential management

Ingestion Endpoints

For global deployments, consider edge ingestion:

Deploy lightweight collectors in multiple regions
Use Kafka or similar for reliable cross-region transport with exactly-once semantics
Centralize processing in your primary datacenter
Consider CDN-level data collection for reduced latency

Storage Architecture

Analytics workloads are storage-intensive. Plan accordingly:

ClickHouse Storage

NVMe SSDs: Essential for query performance on hot data
Tiered storage: Hot data on SSD, cold data on HDD or object storage (S3/GCS)
Compression: LZ4 for speed, ZSTD for better compression ratios on cold data

Configure tiered storage policies in XML configuration (not SQL):

<!-- /etc/clickhouse-server/config.d/storage.xml -->
<clickhouse>
    <storage_configuration>
        <disks>
            <nvme_disk>
                <path>/var/lib/clickhouse/hot/</path>
            </nvme_disk>
            <s3_disk>
                <type>s3</type>
                <endpoint>https://s3.amazonaws.com/your-bucket/clickhouse/</endpoint>
                <access_key_id>YOUR_ACCESS_KEY</access_key_id>
                <secret_access_key>YOUR_SECRET_KEY</secret_access_key>
                <metadata_path>/var/lib/clickhouse/disks/s3/</metadata_path>
            </s3_disk>
        </disks>
        <policies>
            <tiered_policy>
                <volumes>
                    <hot>
                        <disk>nvme_disk</disk>
                        <move_factor>0.1</move_factor>
                    </hot>
                    <cold>
                        <disk>s3_disk</disk>
                        <prefer_not_to_merge>true</prefer_not_to_merge>
                    </cold>
                </volumes>
            </tiered_policy>
        </policies>
    </storage_configuration>
</clickhouse>

Apply storage policy to tables using TTL rules:

CREATE TABLE events (
    event_date Date,
    event_time DateTime,
    user_id UInt64,
    event_type String,
    properties String
) ENGINE = ReplicatedMergeTree
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_type, event_date, user_id)
TTL event_date + INTERVAL 30 DAY TO VOLUME 'cold'
SETTINGS storage_policy = 'tiered_policy';

Capacity Planning

Estimate storage needs based on event volume:

Raw events: ~0.5-1KB per event (compressed with LZ4)
Session recordings: ~100KB-1MB per session (if applicable)
Retention: Plan for 1-3 years of queryable data
Growth buffer: Add 30-50% overhead for merges and temporary data

Example calculation:

Monthly events: 100M
Storage per event: 0.75KB (compressed average)
Monthly growth: 75GB
3-year retention: 2.7TB
With 40% buffer: ~3.8TB hot storage
Cold storage (S3): Plan for 5x for historical data

Backup Strategies

Data loss is catastrophic. Implement comprehensive backup strategies:

ClickHouse Backups

ClickHouse provides native backup commands that support local disk, S3, and Azure Blob Storage:

-- Full database backup to S3
BACKUP DATABASE analytics TO S3(
    'https://s3.amazonaws.com/your-bucket/backups/full_2025_01_22',
    'ACCESS_KEY',
    'SECRET_KEY'
);

-- Incremental backup (reference previous backup)
BACKUP DATABASE analytics TO S3(
    'https://s3.amazonaws.com/your-bucket/backups/incr_2025_01_23',
    'ACCESS_KEY',
    'SECRET_KEY'
) SETTINGS base_backup = S3(
    'https://s3.amazonaws.com/your-bucket/backups/full_2025_01_22',
    'ACCESS_KEY',
    'SECRET_KEY'
);

-- Restore from backup
RESTORE DATABASE analytics FROM S3(
    'https://s3.amazonaws.com/your-bucket/backups/full_2025_01_22',
    'ACCESS_KEY',
    'SECRET_KEY'
);

For more complex backup workflows, the clickhouse-backup tool by Altinity provides additional features like parallel uploads, compression options, and integration with various storage backends.

Database Backups

PostgreSQL: pg_dump for logical backups, pg_basebackup for physical; consider pgBackRest for production
MySQL/MariaDB: mysqldump or Percona XtraBackup for hot backups

Backup Schedule

# Recommended backup schedule
- Full backup: Weekly (Sunday 02:00 UTC)
- Incremental: Daily (02:00 UTC)
- Transaction logs/WAL: Continuous streaming
- Retention: 30 days for daily + monthly archives for 1 year
- Off-site replication: Immediately after backup completion

Backup Validation

Untested backups are not backups:

Automated restore tests: Weekly restore to staging environment
Integrity checks: Verify backup checksums and data consistency
RTO/RPO validation: Document and test that recovery meets business requirements
Runbook testing: Ensure operations team can execute recovery procedures

Disaster Recovery

Plan for complete infrastructure failure:

Cross-region replication: Replicate critical data to secondary region
Infrastructure as Code: Terraform, Pulumi, or OpenTofu for rapid re-deployment
Runbooks: Documented procedures for recovery scenarios with clear ownership
Regular DR drills: Test recovery procedures quarterly; document results and improvements

Monitoring and Observability

You can't manage what you can't measure:

Recommended Monitoring Stack

Metrics: Prometheus + Grafana (or Victoria Metrics for scale)
Logs: Loki, Elasticsearch, or ClickHouse itself
Tracing: Jaeger or Tempo for distributed tracing
Alerting: Alertmanager with PagerDuty/Opsgenie integration

Key Metrics

Ingestion: Events per second, latency percentiles (p50/p95/p99), error rates, queue depth
Storage: Disk usage by volume, replication lag, merge queue size, parts count
Queries: Query latency p50/p95/p99, concurrent queries, failed queries, slow query log
System: CPU utilization, memory usage, network throughput, disk I/O latency
Cluster health: Node availability, Keeper/ZooKeeper session status, replication status

Alerting Thresholds

# Example alerting rules
- Ingestion latency p95 > 5s for 5 minutes: Warning
- Ingestion latency p95 > 30s for 5 minutes: Critical
- Replication lag > 30s: Warning, > 5 minutes: Critical
- Disk usage > 70%: Warning, > 85%: Critical
- Query p99 > 30s: Warning
- Keeper quorum lost: Critical
- Any replica in readonly mode: Critical
- Failed backup: Critical

Scaling Considerations

Design for growth from the start:

Horizontal Scaling

Stateless services: Add replicas behind load balancer; use Kubernetes HPA for auto-scaling
ClickHouse: Add shards for write scaling, replicas for read scaling; plan sharding strategy early as resharding is complex
Kafka: Add partitions for parallelism, brokers for throughput

Vertical Scaling

ClickHouse: Benefits significantly from more RAM (for caches) and faster storage; tune max_server_memory_usage appropriately
PostgreSQL: More RAM for shared_buffers (typically 25% of RAM) and work_mem

Sharding Strategy

Choose a sharding key that distributes data evenly. Avoid keys that create hotspots (e.g., timestamps, sequential IDs). Good candidates include user_id hashes or composite keys. Apply schema changes across all shards using the ON CLUSTER clause.

Cost Optimization

Self-hosting should be cost-effective:

Right-size instances: Monitor actual usage and adjust; avoid over-provisioning
Reserved capacity: Commit to 1-3 year terms for 30-60% savings on compute
Spot/Preemptible instances: Use for non-critical batch processing and development
Data lifecycle: Implement automated TTL policies; archive or delete data that's no longer needed
Storage tiering: Move cold data to cheaper object storage
Compression: Use ZSTD for cold data to reduce storage costs

Kubernetes Deployment Considerations

If deploying on Kubernetes:

Operators: Use the Altinity Operator for ClickHouse for simplified cluster management
StatefulSets: Required for stateful components (ClickHouse, PostgreSQL, Kafka)
Persistent Volumes: Use high-performance storage classes; avoid network-attached storage for ClickHouse if possible
Pod Disruption Budgets: Ensure availability during node maintenance
Resource requests/limits: Set appropriately to prevent noisy neighbor issues
Anti-affinity rules: Distribute replicas across nodes and availability zones

Next Steps

Building a self-hosted analytics infrastructure is a significant undertaking. Start with:

Document requirements: Volume projections, retention policies, availability SLAs, compliance requirements
Choose deployment platform: VMs, Kubernetes, or hybrid based on team expertise
Start small and iterate: Begin with a minimal viable deployment; avoid premature optimization
Invest in automation: Infrastructure as Code from day one; GitOps for configuration management
Plan for operations: Monitoring dashboards, alerting, backup verification, incident response procedures
Build expertise: Invest in training; consider vendor support contracts for critical components

A well-designed self-hosted architecture provides the foundation for analytics that scale with your business while keeping your data under your control. Remember that the operational burden is significant—carefully weigh the benefits of self-hosting against managed services for your specific use case.

Designing Self-Hosted Analytics Architecture