ClickStack Architecture Documentation
Version: 1.0.0 Last Updated: 2025-11-28 Status: Production-Ready (Local Testing) Maintainer: PNE Infrastructure Team
Table of Contents
- Overview
- Architecture Diagrams
- Component Specifications
- Configuration Reference
- Scaling Considerations
- Improvement Proposals
- Buffering & Resilience
- Operations Guide
- VPS Observability Stack
- Reference Configurations
Overview
Purpose
ClickStack is a local OpenTelemetry observability stack for testing and validating trace pipelines before deploying to production. The primary goals are:
- Trace Loss Verification: Ensure 0% trace loss in the pipeline
- Configuration Testing: Validate otelcol configs before production
- Integration Testing: Test nginx ngx_otel_module behavior
- Performance Benchmarking: Measure throughput and latency
Technology Stack
| Component | Version | Base Image | Purpose |
|---|---|---|---|
| nginx | 1.27.3 | AlmaLinux 9.3 | Trace generation (ngx_otel_module) |
| otelcol-contrib | 0.116.0 | AlmaLinux 9.3 | Telemetry routing |
| ClickHouse | 24.8 | Official | Trace/Log/Metric storage |
| Grafana | 11.2.0 | Official | Visualization |
Port Mapping
| Service | Internal | External | Protocol |
|---|---|---|---|
| nginx | 80 | 18080 | HTTP |
| otelcol gRPC | 4317 | 4317 | gRPC |
| otelcol HTTP | 4318 | 4318 | HTTP |
| otelcol Health | 13133 | 13133 | HTTP |
| otelcol Metrics | 8888 | 8888 | HTTP |
| ClickHouse HTTP | 8123 | 18123 | HTTP |
| ClickHouse Native | 9000 | 19000 | TCP |
| Grafana | 3000 | 13000 | HTTP |
Architecture Diagrams
High-Level Architecture
flowchart TB
subgraph Client["Client Layer"]
Browser["Browser/curl"]
TestGen["trace-generator.py"]
end
subgraph Ingress["Ingress Layer"]
nginx["nginx:18080<br/>ngx_otel_module"]
end
subgraph Collector["Telemetry Collector"]
otelcol["otelcol-contrib:4317/4318<br/>receivers → processors → exporters"]
end
subgraph Storage["Storage Layer"]
subgraph Local["Local Storage"]
CH["ClickHouse:18123<br/>otel.otel_traces<br/>otel.otel_logs<br/>otel.otel_metrics"]
FileExport["file/traces<br/>traces.json"]
end
subgraph External["External Storage"]
Uptrace["uptrace-2.pne.io<br/>Distributed Tracing"]
VLogs["logs.pnetest.biz:9428<br/>VictoriaLogs"]
end
end
subgraph Visualization["Visualization"]
Grafana["Grafana:13000<br/>ClickHouse Datasource"]
end
Browser -->|HTTP| nginx
TestGen -->|OTLP HTTP| otelcol
nginx -->|OTLP gRPC| otelcol
otelcol -->|traces| CH
otelcol -->|traces| FileExport
otelcol -->|traces| Uptrace
otelcol -->|logs| CH
otelcol -->|logs| VLogs
otelcol -->|metrics| CH
CH -->|SQL| Grafana
Data Flow Diagram
sequenceDiagram
participant C as Client
participant N as nginx
participant O as otelcol
participant CH as ClickHouse
participant U as Uptrace
participant F as File
C->>N: HTTP Request
activate N
N->>N: Generate Span (ngx_otel)
N->>C: HTTP Response
N->>O: OTLP gRPC (trace)
deactivate N
activate O
O->>O: memory_limiter
O->>O: attributes processor
O->>O: resource processor
O->>O: batch processor
par Fan-out Export
O->>CH: ClickHouse (primary)
O->>U: Uptrace (backup)
O->>F: File (verification)
end
deactivate O
Note over CH: TTL: 30 days
Note over F: Rotation: 100MB/1day
Component Interaction
graph LR
subgraph podman["podman-compose network: clickstack (172.31.0.0/16)"]
N[nginx<br/>172.31.0.x]
O[otelcol<br/>172.31.0.x]
CH[ClickHouse<br/>172.31.0.x]
G[Grafana<br/>172.31.0.x]
end
subgraph host["Host Network"]
H[host.containers.internal]
end
N -->|"otel_exporter<br/>host.containers.internal:4317"| H
H -->|"port forward"| O
O -->|"tcp://clickhouse:9000"| CH
G -->|"clickhouse:8123"| CH
style N fill:#f9f,stroke:#333
style O fill:#bbf,stroke:#333
style CH fill:#bfb,stroke:#333
style G fill:#fbb,stroke:#333
OTEL Collector Pipeline
flowchart LR
subgraph Receivers
OTLP_GRPC["otlp/grpc<br/>:4317"]
OTLP_HTTP["otlp/http<br/>:4318"]
end
subgraph Processors
ML["memory_limiter<br/>512MiB limit"]
ATTR["attributes<br/>upstream extraction"]
RES["resource<br/>env + namespace"]
BATCH["batch<br/>10s / 1024 items"]
end
subgraph Exporters
CH["clickhouse<br/>tcp://:9000"]
UP["otlphttp/uptrace<br/>https://uptrace-2.pne.io"]
VL["otlphttp/victorialogs<br/>http://logs.pnetest.biz"]
FILE["file/traces<br/>/var/lib/otelcol/traces"]
DBG["debug<br/>sampling: 5/200"]
end
OTLP_GRPC --> ML
OTLP_HTTP --> ML
ML --> ATTR
ATTR --> RES
RES --> BATCH
BATCH --> CH
BATCH --> UP
BATCH --> VL
BATCH --> FILE
Component Specifications
nginx with ngx_otel_module
Build: Multi-stage on AlmaLinux 9.3
Stage 1 (Builder):
- Compiles ngx_otel_module from source
- Dependencies: grpc-devel, protobuf-devel, abseil-cpp (EPEL + CRB)
Stage 2 (Runtime):
- nginx 1.27.3 mainline from nginx.org RPM
- Runtime: grpc, re2, c-ares, abseil-cpp
Configuration:
# otel.conf
otel_exporter {
endpoint host.containers.internal:4317; # podman rootless
}
otel_service_name "nginx-clickstack";
otel_trace on;
otel_trace_context propagate;
Key Insight: podman rootless requires host.containers.internal for inter-container communication.
otelcol-contrib
Build: Simple binary extraction on AlmaLinux 9.3
Pipelines:
| Pipeline | Receivers | Processors | Exporters |
|---|---|---|---|
| traces | otlp | memory_limiter → attributes → resource → batch | clickhouse, uptrace, file |
| logs | otlp | memory_limiter → resource → batch | clickhouse, victorialogs |
| metrics | otlp | memory_limiter → batch | clickhouse |
Resource Limits:
- Memory: 512 MiB limit, 128 MiB spike
- Batch: 10s timeout, 1024 items, 2048 max
ClickHouse
Version: 24.8 (Official Docker image)
Schema:
- Database:
otel - Tables:
otel_traces,otel_logs,otel_metrics - TTL: 720h (30 days)
- Engine: MergeTree (automatic)
Connection:
- Native: tcp://clickhouse:9000 (internal)
- HTTP: http://localhost:18123 (external)
Grafana
Version: 11.2.0-oss
Plugins: grafana-clickhouse-datasource
Provisioning: /etc/grafana/provisioning/datasources/clickhouse.yaml
Configuration Reference
Directory Structure
infra/clickstack-local/
├── podman-compose.yml # Service definitions
├── configs/
│ ├── otelcol/
│ │ └── config.yaml # OTEL Collector config
│ ├── nginx/
│ │ ├── nginx.conf # Main nginx config
│ │ ├── otel.conf # OTEL module config
│ │ └── default.conf # Virtual host
│ └── grafana/
│ └── provisioning/
│ └── datasources/
│ └── clickhouse.yaml
├── docker/
│ ├── otelcol/
│ │ ├── Dockerfile # AlmaLinux 9.3 based
│ │ └── otelcol-contrib_0.116.0_linux_amd64.tar.gz
│ └── nginx/
│ └── Dockerfile # Multi-stage AlmaLinux 9.3
├── scripts/
│ ├── start.sh # Start stack
│ ├── test-traces.sh # Generate test traces
│ ├── check-trace-loss.sh # Verify no loss
│ └── generate_traces.py # Python OTEL SDK generator
├── www/
│ └── index.html # Test page
└── docs/
└── kb/
└── CLICKSTACK-OTEL-TROUBLESHOOTING.md
Scaling Considerations
Current Limits
| Resource | Current | Max Recommended |
|---|---|---|
| Traces/sec | ~100 | ~1000 |
| Memory (otelcol) | 512 MiB | 2 GiB |
| ClickHouse disk | Unlimited | Monitor |
| Retention | 30 days | Adjust per needs |
Horizontal Scaling
flowchart TB
subgraph LB["Load Balancer"]
HAProxy["HAProxy/Traefik"]
end
subgraph Collectors["OTEL Collector Pool"]
O1["otelcol-1"]
O2["otelcol-2"]
O3["otelcol-3"]
end
subgraph Storage["ClickHouse Cluster"]
CH1["CH Shard 1"]
CH2["CH Shard 2"]
end
HAProxy --> O1
HAProxy --> O2
HAProxy --> O3
O1 --> CH1
O2 --> CH1
O3 --> CH2
Improvement Proposals
P1: Critical Improvements
1.1 Add Sampling for Production
# Add to otelcol config
processors:
probabilistic_sampler:
sampling_percentage: 10 # 10% sampling
tail_sampling:
decision_wait: 10s
num_traces: 100
expected_new_traces_per_sec: 10
policies:
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow-traces
type: latency
latency: {threshold_ms: 1000}
1.2 Add Trace Loss Alerting
# Prometheus alert rule
groups:
- name: otelcol
rules:
- alert: TraceLoss
expr: |
rate(otelcol_exporter_send_failed_spans[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "OTEL Collector dropping traces"
P2: Security Improvements
2.1 Enable TLS Between Components
# otelcol config
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
tls:
cert_file: /etc/otelcol/tls/server.crt
key_file: /etc/otelcol/tls/server.key
2.2 Add Authentication
# nginx → otelcol authentication
extensions:
bearertokenauth:
token: "${OTEL_AUTH_TOKEN}"
receivers:
otlp:
protocols:
grpc:
auth:
authenticator: bearertokenauth
P3: Observability Improvements
3.1 Self-Monitoring Stack
flowchart LR
otelcol -->|:8888/metrics| Prometheus
Prometheus --> Grafana
Prometheus --> AlertManager
AlertManager --> Slack/PagerDuty
3.2 Add Trace Context to Logs
# otelcol transform processor
processors:
transform:
log_statements:
- context: log
statements:
- set(attributes["trace_id"], trace_id)
- set(attributes["span_id"], span_id)
P4: Architecture Improvements
4.1 Add Kafka Buffer
flowchart LR
nginx --> otelcol
otelcol --> Kafka["Kafka<br/>(buffer)"]
Kafka --> Consumer["otelcol-consumer"]
Consumer --> ClickHouse
Benefits:
- Decouple ingestion from storage
- Handle traffic spikes
- Replay failed exports
4.2 Add Redis Caching
flowchart TB
nginx -->|trace| otelcol
otelcol -->|dedupe check| Redis
Redis -->|unique| otelcol
otelcol --> ClickHouse
Benefits:
- Deduplicate traces
- Rate limiting
- Circuit breaker state
Buffering & Resilience
Implemented: File Storage (WAL)
The OTEL Collector is configured with persistent queues using the file_storage extension. This provides:
- Write-Ahead Log (WAL): All pending exports are persisted to disk
- Automatic recovery: After restart, queued data is re-exported
- Compaction: Automatic cleanup of processed entries
- fsync: Forced sync to disk for durability
# Current configuration in configs/otelcol/config.yaml
extensions:
file_storage/traces:
directory: /var/lib/otelcol/storage
timeout: 10s
create_directory: true
compaction:
on_start: true
on_rebound: true
directory: /var/lib/otelcol/storage/compaction
max_transaction_size: 65536
fsync: true
exporters:
clickhouse:
sending_queue:
enabled: true
num_consumers: 2
queue_size: 5000
storage: file_storage/traces # Links to WAL
Buffering Architecture
flowchart TB
subgraph nginx["nginx (trace generator)"]
N[ngx_otel_module]
end
subgraph otelcol["OTEL Collector"]
R[Receivers]
P[Processors]
Q[Sending Queue<br/>In-Memory + WAL]
E[Exporters]
subgraph storage["Persistent Storage"]
WAL[file_storage/traces<br/>/var/lib/otelcol/storage]
end
end
subgraph backends["Backends"]
CH[ClickHouse]
UP[Uptrace]
VL[VictoriaLogs]
end
N -->|OTLP gRPC| R
R --> P
P --> Q
Q <-->|persist/recover| WAL
Q --> E
E -->|retry on failure| CH
E -->|retry on failure| UP
E -->|retry on failure| VL
style WAL fill:#ffeb3b,stroke:#333
style Q fill:#4caf50,stroke:#333
Buffering Options Comparison
| Feature | file_storage (WAL) | Redis | Kafka |
|---|---|---|---|
| Complexity | Low | Medium | High |
| Durability | Good | Good | Excellent |
| Capacity | Disk-limited | RAM + AOF | Unlimited |
| Best for | Single collector | Moderate outages | Distributed |
| Recovery | Auto on restart | External dep | External dep |
| Latency | Low | Low | Medium |
Optional: Redis Storage Extension
Redis can be used as an alternative buffer backend:
# configs/otelcol/config-redis.yaml (optional)
extensions:
redisstorage:
address: "redis:6379"
backlog_check_interval: 30
process_key_expiration: 120
exporters:
clickhouse:
sending_queue:
enabled: true
storage: redisstorage
queue_size: 10000
requeue_enabled: true
Enable Redis profile:
podman-compose --profile redis up -d
Testing Resilience
Use the disconnect test script to verify buffering:
# Test with default settings (50 traces, 20s outage)
./scripts/test-disconnect.sh
# Custom test (100 traces, 30s outage)
./scripts/test-disconnect.sh 100 30
The test:
- Generates traces before outage
- Stops ClickHouse (simulates backend failure)
- Generates traces during outage
- Restores ClickHouse
- Verifies all traces delivered (zero loss)
Retry Configuration
All exporters have retry enabled:
retry_on_failure:
enabled: true
initial_interval: 5s # First retry after 5s
max_interval: 30s # Max backoff 30s
max_elapsed_time: 300s # Give up after 5 min
Queue Monitoring
Monitor queue health via Prometheus metrics:
# Queue size
curl -s http://localhost:8888/metrics | grep otelcol_exporter_queue_size
# Failed exports
curl -s http://localhost:8888/metrics | grep otelcol_exporter_send_failed
# Successful exports
curl -s http://localhost:8888/metrics | grep otelcol_exporter_sent
Operations Guide
Quick Commands
# Start stack
cd /opt/work/pne/infra/clickstack-local
podman-compose up -d
# Check status
podman-compose ps
podman ps | grep clickstack
# Health checks
curl http://localhost:18080/health # nginx
curl http://localhost:13133/ # otelcol
curl http://localhost:18123/ping # ClickHouse
curl http://localhost:13000/api/health # Grafana
# Generate test traces
./scripts/test-traces.sh 100
# Check trace loss
./scripts/check-trace-loss.sh 1000 10
# View logs
podman-compose logs -f otelcol
podman-compose logs -f nginx
# Restart
podman-compose restart
# Full rebuild
podman-compose down
podman-compose build --no-cache
podman-compose up -d
Monitoring Queries
-- Total traces
SELECT COUNT(*) FROM otel.otel_traces;
-- Traces by service
SELECT ServiceName, COUNT(*) as count
FROM otel.otel_traces
GROUP BY ServiceName
ORDER BY count DESC;
-- Slow traces (>1s)
SELECT TraceId, SpanName, Duration/1000000 as duration_ms
FROM otel.otel_traces
WHERE Duration > 1000000000
ORDER BY Duration DESC
LIMIT 10;
-- Trace rate (last hour)
SELECT
toStartOfMinute(Timestamp) as minute,
COUNT(*) as traces
FROM otel.otel_traces
WHERE Timestamp > now() - INTERVAL 1 HOUR
GROUP BY minute
ORDER BY minute;
Troubleshooting
See: docs/kb/CLICKSTACK-OTEL-TROUBLESHOOTING.md
References
VPS Observability Stack (pnetest.biz)
Architecture
The VPS stack on 65.109.106.217 provides external observability services accessible via Cloudflare Tunnel.
flowchart TB
subgraph QA["QA Servers (clubber.me)"]
qa8["qa8<br/>nginx 1.28.0 + otelcol"]
qa10["qa10<br/>nginx 1.28.0 + otelcol"]
end
subgraph PROD["Production (pne.io)"]
uptrace_prod["uptrace-2.pne.io"]
clickstack_prod["clickstack-2.pne.io"]
end
subgraph VPS["VPS 65.109.106.217 (pnetest.biz)"]
subgraph CF["Cloudflare Tunnel"]
tunnel["clickstack-docs<br/>55dbf3ec-..."]
end
subgraph Services["Observability Services"]
grafana["grafana.pnetest.biz<br/>Grafana"]
uptrace["uptrace.pnetest.biz<br/>Uptrace"]
metrics["metrics.pnetest.biz<br/>VictoriaMetrics"]
logs["logs.pnetest.biz<br/>VictoriaLogs"]
docs["docs.pnetest.biz<br/>Documentation"]
end
end
qa8 -->|"traces (PROD)"| uptrace_prod
qa8 -->|"traces (PROD)"| clickstack_prod
qa8 -->|"logs (VPS)"| logs
qa10 -->|"traces (PROD)"| uptrace_prod
qa10 -->|"traces (PROD)"| clickstack_prod
tunnel --> grafana
tunnel --> uptrace
tunnel --> metrics
tunnel --> logs
tunnel --> docs
VPS Services
| Service | Domain | Port | Purpose |
|---|---|---|---|
| Grafana | grafana.pnetest.biz | 3000 | Dashboards |
| Uptrace | uptrace.pnetest.biz | 14317 | Distributed tracing |
| VictoriaMetrics | metrics.pnetest.biz | 8428 | Metrics storage |
| VictoriaLogs | logs.pnetest.biz | 9428 | Log storage |
| Documentation | docs.pnetest.biz | 8080 | This site |
Reference Configurations
qa8 OTEL Collector Config
# /opt/otelcol/config.yaml on qa8
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
attributes:
actions:
- key: upstream_addr
pattern: '^(?P<upstream_primary>[^,\s]+)(?:,\s*(?P<upstream_secondary>[^,\s]+))?'
action: extract
batch:
timeout: 10s
send_batch_size: 1024
send_batch_max_size: 2048
exporters:
# PROD - always enabled
otlphttp/uptrace:
traces_endpoint: https://uptrace-2.pne.io/v1/traces
headers:
uptrace-dsn: "http://***@uptrace-2.pne.io?grpc=4317"
compression: gzip
timeout: 30s
otlphttp/clickstack:
traces_endpoint: https://clickstack-2.pne.io/v1/traces
logs_endpoint: https://clickstack-2.pne.io/v1/logs
headers:
Authorization: "***"
compression: gzip
timeout: 30s
# VPS - for testing
otlphttp/victorialogs:
logs_endpoint: http://logs.pnetest.biz:9428/insert/opentelemetry/v1/logs
compression: gzip
timeout: 30s
# Local - for debugging
otlphttp/signoz:
traces_endpoint: http://localhost:5318/v1/traces
logs_endpoint: http://localhost:5318/v1/logs
service:
pipelines:
traces:
receivers: [otlp]
processors: [attributes, batch]
exporters: [otlphttp/uptrace, otlphttp/clickstack, otlphttp/signoz]
logs:
receivers: [otlp]
processors: [batch]
exporters: [otlphttp/clickstack, otlphttp/victorialogs, otlphttp/signoz]
qa10 OTEL Collector Config
# /opt/otelcol-1/config.yaml on qa10
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
attributes:
actions:
- key: upstream_addr
pattern: '^(?P<upstream_primary>[^,\s]+)(?:,\s*(?P<upstream_secondary>[^,\s]+))?'
action: extract
batch:
timeout: 10s
send_batch_size: 1024
send_batch_max_size: 2048
exporters:
otlphttp/uptrace:
traces_endpoint: https://uptrace-2.pne.io/v1/traces
logs_endpoint: https://uptrace-2.pne.io/v1/logs
headers:
uptrace-dsn: "http://***@uptrace-2.pne.io?grpc=4317"
compression: gzip
timeout: 30s
sending_queue:
storage: file_storage/otc # ← Only qa10 has this!
otlphttp/openobserve:
endpoint: https://openobserve-2.pne.io/api/default
headers:
Authorization: Basic ***
otlphttp/clickstack:
traces_endpoint: https://clickstack-2.pne.io/v1/traces
logs_endpoint: https://clickstack-2.pne.io/v1/logs
headers:
Authorization: "***"
compression: gzip
timeout: 30s
extensions:
file_storage/otc:
directory: ./traces
timeout: 10s
create_directory: true
health_check:
endpoint: 0.0.0.0:13133
service:
extensions: [health_check, file_storage/otc]
pipelines:
traces:
receivers: [otlp]
processors: [attributes, batch]
exporters: [otlphttp/uptrace, otlphttp/clickstack, otlphttp/openobserve]
logs:
receivers: [otlp]
processors: [batch]
exporters: [otlphttp/uptrace, otlphttp/clickstack, otlphttp/openobserve]
nginx OTEL Module Config (qa8/qa10)
# /etc/nginx/nginx.conf
load_module modules/ngx_otel_module.so;
http {
otel_exporter {
endpoint 127.0.0.1:4317;
}
otel_service_name "nginx-qa8"; # or nginx-qa10
otel_span_attr host_name "$hostname";
otel_span_attr upstream_addr "$upstream_addr";
otel_span_attr upstream_response_time "$upstream_response_time";
otel_span_attr upstream_status "$upstream_status";
otel_span_attr http_user_agent "$http_user_agent";
otel_trace_context propagate;
otel_trace on;
# ... rest of config
}
References
- OpenTelemetry Collector
- nginx ngx_otel_module
- ClickHouse OTEL Exporter
- Grafana ClickHouse Plugin
- VictoriaLogs
Document generated: 2025-12-05 Stack version: ClickStack Local 1.0.0 + VPS 1.0.0