Skip to main content

ClickStack Architecture Documentation

Version: 1.0.0 Last Updated: 2025-11-28 Status: Production-Ready (Local Testing) Maintainer: PNE Infrastructure Team


Table of Contents

  1. Overview
  2. Architecture Diagrams
  3. Component Specifications
  4. Configuration Reference
  5. Scaling Considerations
  6. Improvement Proposals
  7. Buffering & Resilience
  8. Operations Guide
  9. VPS Observability Stack
  10. Reference Configurations

Overview

Purpose

ClickStack is a local OpenTelemetry observability stack for testing and validating trace pipelines before deploying to production. The primary goals are:

  1. Trace Loss Verification: Ensure 0% trace loss in the pipeline
  2. Configuration Testing: Validate otelcol configs before production
  3. Integration Testing: Test nginx ngx_otel_module behavior
  4. Performance Benchmarking: Measure throughput and latency

Technology Stack

ComponentVersionBase ImagePurpose
nginx1.27.3AlmaLinux 9.3Trace generation (ngx_otel_module)
otelcol-contrib0.116.0AlmaLinux 9.3Telemetry routing
ClickHouse24.8OfficialTrace/Log/Metric storage
Grafana11.2.0OfficialVisualization

Port Mapping

ServiceInternalExternalProtocol
nginx8018080HTTP
otelcol gRPC43174317gRPC
otelcol HTTP43184318HTTP
otelcol Health1313313133HTTP
otelcol Metrics88888888HTTP
ClickHouse HTTP812318123HTTP
ClickHouse Native900019000TCP
Grafana300013000HTTP

Architecture Diagrams

High-Level Architecture

flowchart TB
subgraph Client["Client Layer"]
Browser["Browser/curl"]
TestGen["trace-generator.py"]
end

subgraph Ingress["Ingress Layer"]
nginx["nginx:18080<br/>ngx_otel_module"]
end

subgraph Collector["Telemetry Collector"]
otelcol["otelcol-contrib:4317/4318<br/>receivers → processors → exporters"]
end

subgraph Storage["Storage Layer"]
subgraph Local["Local Storage"]
CH["ClickHouse:18123<br/>otel.otel_traces<br/>otel.otel_logs<br/>otel.otel_metrics"]
FileExport["file/traces<br/>traces.json"]
end
subgraph External["External Storage"]
Uptrace["uptrace-2.pne.io<br/>Distributed Tracing"]
VLogs["logs.pnetest.biz:9428<br/>VictoriaLogs"]
end
end

subgraph Visualization["Visualization"]
Grafana["Grafana:13000<br/>ClickHouse Datasource"]
end

Browser -->|HTTP| nginx
TestGen -->|OTLP HTTP| otelcol
nginx -->|OTLP gRPC| otelcol

otelcol -->|traces| CH
otelcol -->|traces| FileExport
otelcol -->|traces| Uptrace
otelcol -->|logs| CH
otelcol -->|logs| VLogs
otelcol -->|metrics| CH

CH -->|SQL| Grafana

Data Flow Diagram

sequenceDiagram
participant C as Client
participant N as nginx
participant O as otelcol
participant CH as ClickHouse
participant U as Uptrace
participant F as File

C->>N: HTTP Request
activate N
N->>N: Generate Span (ngx_otel)
N->>C: HTTP Response
N->>O: OTLP gRPC (trace)
deactivate N

activate O
O->>O: memory_limiter
O->>O: attributes processor
O->>O: resource processor
O->>O: batch processor

par Fan-out Export
O->>CH: ClickHouse (primary)
O->>U: Uptrace (backup)
O->>F: File (verification)
end
deactivate O

Note over CH: TTL: 30 days
Note over F: Rotation: 100MB/1day

Component Interaction

graph LR
subgraph podman["podman-compose network: clickstack (172.31.0.0/16)"]
N[nginx<br/>172.31.0.x]
O[otelcol<br/>172.31.0.x]
CH[ClickHouse<br/>172.31.0.x]
G[Grafana<br/>172.31.0.x]
end

subgraph host["Host Network"]
H[host.containers.internal]
end

N -->|"otel_exporter<br/>host.containers.internal:4317"| H
H -->|"port forward"| O
O -->|"tcp://clickhouse:9000"| CH
G -->|"clickhouse:8123"| CH

style N fill:#f9f,stroke:#333
style O fill:#bbf,stroke:#333
style CH fill:#bfb,stroke:#333
style G fill:#fbb,stroke:#333

OTEL Collector Pipeline

flowchart LR
subgraph Receivers
OTLP_GRPC["otlp/grpc<br/>:4317"]
OTLP_HTTP["otlp/http<br/>:4318"]
end

subgraph Processors
ML["memory_limiter<br/>512MiB limit"]
ATTR["attributes<br/>upstream extraction"]
RES["resource<br/>env + namespace"]
BATCH["batch<br/>10s / 1024 items"]
end

subgraph Exporters
CH["clickhouse<br/>tcp://:9000"]
UP["otlphttp/uptrace<br/>https://uptrace-2.pne.io"]
VL["otlphttp/victorialogs<br/>http://logs.pnetest.biz"]
FILE["file/traces<br/>/var/lib/otelcol/traces"]
DBG["debug<br/>sampling: 5/200"]
end

OTLP_GRPC --> ML
OTLP_HTTP --> ML
ML --> ATTR
ATTR --> RES
RES --> BATCH

BATCH --> CH
BATCH --> UP
BATCH --> VL
BATCH --> FILE

Component Specifications

nginx with ngx_otel_module

Build: Multi-stage on AlmaLinux 9.3

Stage 1 (Builder):

  • Compiles ngx_otel_module from source
  • Dependencies: grpc-devel, protobuf-devel, abseil-cpp (EPEL + CRB)

Stage 2 (Runtime):

  • nginx 1.27.3 mainline from nginx.org RPM
  • Runtime: grpc, re2, c-ares, abseil-cpp

Configuration:

# otel.conf
otel_exporter {
endpoint host.containers.internal:4317; # podman rootless
}
otel_service_name "nginx-clickstack";
otel_trace on;
otel_trace_context propagate;

Key Insight: podman rootless requires host.containers.internal for inter-container communication.

otelcol-contrib

Build: Simple binary extraction on AlmaLinux 9.3

Pipelines:

PipelineReceiversProcessorsExporters
tracesotlpmemory_limiter → attributes → resource → batchclickhouse, uptrace, file
logsotlpmemory_limiter → resource → batchclickhouse, victorialogs
metricsotlpmemory_limiter → batchclickhouse

Resource Limits:

  • Memory: 512 MiB limit, 128 MiB spike
  • Batch: 10s timeout, 1024 items, 2048 max

ClickHouse

Version: 24.8 (Official Docker image)

Schema:

  • Database: otel
  • Tables: otel_traces, otel_logs, otel_metrics
  • TTL: 720h (30 days)
  • Engine: MergeTree (automatic)

Connection:

Grafana

Version: 11.2.0-oss

Plugins: grafana-clickhouse-datasource

Provisioning: /etc/grafana/provisioning/datasources/clickhouse.yaml


Configuration Reference

Directory Structure

infra/clickstack-local/
├── podman-compose.yml # Service definitions
├── configs/
│ ├── otelcol/
│ │ └── config.yaml # OTEL Collector config
│ ├── nginx/
│ │ ├── nginx.conf # Main nginx config
│ │ ├── otel.conf # OTEL module config
│ │ └── default.conf # Virtual host
│ └── grafana/
│ └── provisioning/
│ └── datasources/
│ └── clickhouse.yaml
├── docker/
│ ├── otelcol/
│ │ ├── Dockerfile # AlmaLinux 9.3 based
│ │ └── otelcol-contrib_0.116.0_linux_amd64.tar.gz
│ └── nginx/
│ └── Dockerfile # Multi-stage AlmaLinux 9.3
├── scripts/
│ ├── start.sh # Start stack
│ ├── test-traces.sh # Generate test traces
│ ├── check-trace-loss.sh # Verify no loss
│ └── generate_traces.py # Python OTEL SDK generator
├── www/
│ └── index.html # Test page
└── docs/
└── kb/
└── CLICKSTACK-OTEL-TROUBLESHOOTING.md

Scaling Considerations

Current Limits

ResourceCurrentMax Recommended
Traces/sec~100~1000
Memory (otelcol)512 MiB2 GiB
ClickHouse diskUnlimitedMonitor
Retention30 daysAdjust per needs

Horizontal Scaling

flowchart TB
subgraph LB["Load Balancer"]
HAProxy["HAProxy/Traefik"]
end

subgraph Collectors["OTEL Collector Pool"]
O1["otelcol-1"]
O2["otelcol-2"]
O3["otelcol-3"]
end

subgraph Storage["ClickHouse Cluster"]
CH1["CH Shard 1"]
CH2["CH Shard 2"]
end

HAProxy --> O1
HAProxy --> O2
HAProxy --> O3

O1 --> CH1
O2 --> CH1
O3 --> CH2

Improvement Proposals

P1: Critical Improvements

1.1 Add Sampling for Production

# Add to otelcol config
processors:
probabilistic_sampler:
sampling_percentage: 10 # 10% sampling

tail_sampling:
decision_wait: 10s
num_traces: 100
expected_new_traces_per_sec: 10
policies:
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow-traces
type: latency
latency: {threshold_ms: 1000}

1.2 Add Trace Loss Alerting

# Prometheus alert rule
groups:
- name: otelcol
rules:
- alert: TraceLoss
expr: |
rate(otelcol_exporter_send_failed_spans[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "OTEL Collector dropping traces"

P2: Security Improvements

2.1 Enable TLS Between Components

# otelcol config
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
tls:
cert_file: /etc/otelcol/tls/server.crt
key_file: /etc/otelcol/tls/server.key

2.2 Add Authentication

# nginx → otelcol authentication
extensions:
bearertokenauth:
token: "${OTEL_AUTH_TOKEN}"

receivers:
otlp:
protocols:
grpc:
auth:
authenticator: bearertokenauth

P3: Observability Improvements

3.1 Self-Monitoring Stack

flowchart LR
otelcol -->|:8888/metrics| Prometheus
Prometheus --> Grafana
Prometheus --> AlertManager
AlertManager --> Slack/PagerDuty

3.2 Add Trace Context to Logs

# otelcol transform processor
processors:
transform:
log_statements:
- context: log
statements:
- set(attributes["trace_id"], trace_id)
- set(attributes["span_id"], span_id)

P4: Architecture Improvements

4.1 Add Kafka Buffer

flowchart LR
nginx --> otelcol
otelcol --> Kafka["Kafka<br/>(buffer)"]
Kafka --> Consumer["otelcol-consumer"]
Consumer --> ClickHouse

Benefits:

  • Decouple ingestion from storage
  • Handle traffic spikes
  • Replay failed exports

4.2 Add Redis Caching

flowchart TB
nginx -->|trace| otelcol
otelcol -->|dedupe check| Redis
Redis -->|unique| otelcol
otelcol --> ClickHouse

Benefits:

  • Deduplicate traces
  • Rate limiting
  • Circuit breaker state

Buffering & Resilience

Implemented: File Storage (WAL)

The OTEL Collector is configured with persistent queues using the file_storage extension. This provides:

  • Write-Ahead Log (WAL): All pending exports are persisted to disk
  • Automatic recovery: After restart, queued data is re-exported
  • Compaction: Automatic cleanup of processed entries
  • fsync: Forced sync to disk for durability
# Current configuration in configs/otelcol/config.yaml
extensions:
file_storage/traces:
directory: /var/lib/otelcol/storage
timeout: 10s
create_directory: true
compaction:
on_start: true
on_rebound: true
directory: /var/lib/otelcol/storage/compaction
max_transaction_size: 65536
fsync: true

exporters:
clickhouse:
sending_queue:
enabled: true
num_consumers: 2
queue_size: 5000
storage: file_storage/traces # Links to WAL

Buffering Architecture

flowchart TB
subgraph nginx["nginx (trace generator)"]
N[ngx_otel_module]
end

subgraph otelcol["OTEL Collector"]
R[Receivers]
P[Processors]
Q[Sending Queue<br/>In-Memory + WAL]
E[Exporters]

subgraph storage["Persistent Storage"]
WAL[file_storage/traces<br/>/var/lib/otelcol/storage]
end
end

subgraph backends["Backends"]
CH[ClickHouse]
UP[Uptrace]
VL[VictoriaLogs]
end

N -->|OTLP gRPC| R
R --> P
P --> Q
Q <-->|persist/recover| WAL
Q --> E
E -->|retry on failure| CH
E -->|retry on failure| UP
E -->|retry on failure| VL

style WAL fill:#ffeb3b,stroke:#333
style Q fill:#4caf50,stroke:#333

Buffering Options Comparison

Featurefile_storage (WAL)RedisKafka
ComplexityLowMediumHigh
DurabilityGoodGoodExcellent
CapacityDisk-limitedRAM + AOFUnlimited
Best forSingle collectorModerate outagesDistributed
RecoveryAuto on restartExternal depExternal dep
LatencyLowLowMedium

Optional: Redis Storage Extension

Redis can be used as an alternative buffer backend:

# configs/otelcol/config-redis.yaml (optional)
extensions:
redisstorage:
address: "redis:6379"
backlog_check_interval: 30
process_key_expiration: 120

exporters:
clickhouse:
sending_queue:
enabled: true
storage: redisstorage
queue_size: 10000
requeue_enabled: true

Enable Redis profile:

podman-compose --profile redis up -d

Testing Resilience

Use the disconnect test script to verify buffering:

# Test with default settings (50 traces, 20s outage)
./scripts/test-disconnect.sh

# Custom test (100 traces, 30s outage)
./scripts/test-disconnect.sh 100 30

The test:

  1. Generates traces before outage
  2. Stops ClickHouse (simulates backend failure)
  3. Generates traces during outage
  4. Restores ClickHouse
  5. Verifies all traces delivered (zero loss)

Retry Configuration

All exporters have retry enabled:

retry_on_failure:
enabled: true
initial_interval: 5s # First retry after 5s
max_interval: 30s # Max backoff 30s
max_elapsed_time: 300s # Give up after 5 min

Queue Monitoring

Monitor queue health via Prometheus metrics:

# Queue size
curl -s http://localhost:8888/metrics | grep otelcol_exporter_queue_size

# Failed exports
curl -s http://localhost:8888/metrics | grep otelcol_exporter_send_failed

# Successful exports
curl -s http://localhost:8888/metrics | grep otelcol_exporter_sent

Operations Guide

Quick Commands

# Start stack
cd /opt/work/pne/infra/clickstack-local
podman-compose up -d

# Check status
podman-compose ps
podman ps | grep clickstack

# Health checks
curl http://localhost:18080/health # nginx
curl http://localhost:13133/ # otelcol
curl http://localhost:18123/ping # ClickHouse
curl http://localhost:13000/api/health # Grafana

# Generate test traces
./scripts/test-traces.sh 100

# Check trace loss
./scripts/check-trace-loss.sh 1000 10

# View logs
podman-compose logs -f otelcol
podman-compose logs -f nginx

# Restart
podman-compose restart

# Full rebuild
podman-compose down
podman-compose build --no-cache
podman-compose up -d

Monitoring Queries

-- Total traces
SELECT COUNT(*) FROM otel.otel_traces;

-- Traces by service
SELECT ServiceName, COUNT(*) as count
FROM otel.otel_traces
GROUP BY ServiceName
ORDER BY count DESC;

-- Slow traces (>1s)
SELECT TraceId, SpanName, Duration/1000000 as duration_ms
FROM otel.otel_traces
WHERE Duration > 1000000000
ORDER BY Duration DESC
LIMIT 10;

-- Trace rate (last hour)
SELECT
toStartOfMinute(Timestamp) as minute,
COUNT(*) as traces
FROM otel.otel_traces
WHERE Timestamp > now() - INTERVAL 1 HOUR
GROUP BY minute
ORDER BY minute;

Troubleshooting

See: docs/kb/CLICKSTACK-OTEL-TROUBLESHOOTING.md


References


VPS Observability Stack (pnetest.biz)

Architecture

The VPS stack on 65.109.106.217 provides external observability services accessible via Cloudflare Tunnel.

flowchart TB
subgraph QA["QA Servers (clubber.me)"]
qa8["qa8<br/>nginx 1.28.0 + otelcol"]
qa10["qa10<br/>nginx 1.28.0 + otelcol"]
end

subgraph PROD["Production (pne.io)"]
uptrace_prod["uptrace-2.pne.io"]
clickstack_prod["clickstack-2.pne.io"]
end

subgraph VPS["VPS 65.109.106.217 (pnetest.biz)"]
subgraph CF["Cloudflare Tunnel"]
tunnel["clickstack-docs<br/>55dbf3ec-..."]
end

subgraph Services["Observability Services"]
grafana["grafana.pnetest.biz<br/>Grafana"]
uptrace["uptrace.pnetest.biz<br/>Uptrace"]
metrics["metrics.pnetest.biz<br/>VictoriaMetrics"]
logs["logs.pnetest.biz<br/>VictoriaLogs"]
docs["docs.pnetest.biz<br/>Documentation"]
end
end

qa8 -->|"traces (PROD)"| uptrace_prod
qa8 -->|"traces (PROD)"| clickstack_prod
qa8 -->|"logs (VPS)"| logs

qa10 -->|"traces (PROD)"| uptrace_prod
qa10 -->|"traces (PROD)"| clickstack_prod

tunnel --> grafana
tunnel --> uptrace
tunnel --> metrics
tunnel --> logs
tunnel --> docs

VPS Services

ServiceDomainPortPurpose
Grafanagrafana.pnetest.biz3000Dashboards
Uptraceuptrace.pnetest.biz14317Distributed tracing
VictoriaMetricsmetrics.pnetest.biz8428Metrics storage
VictoriaLogslogs.pnetest.biz9428Log storage
Documentationdocs.pnetest.biz8080This site

Reference Configurations

qa8 OTEL Collector Config

# /opt/otelcol/config.yaml on qa8
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318

processors:
attributes:
actions:
- key: upstream_addr
pattern: '^(?P<upstream_primary>[^,\s]+)(?:,\s*(?P<upstream_secondary>[^,\s]+))?'
action: extract

batch:
timeout: 10s
send_batch_size: 1024
send_batch_max_size: 2048

exporters:
# PROD - always enabled
otlphttp/uptrace:
traces_endpoint: https://uptrace-2.pne.io/v1/traces
headers:
uptrace-dsn: "http://***@uptrace-2.pne.io?grpc=4317"
compression: gzip
timeout: 30s

otlphttp/clickstack:
traces_endpoint: https://clickstack-2.pne.io/v1/traces
logs_endpoint: https://clickstack-2.pne.io/v1/logs
headers:
Authorization: "***"
compression: gzip
timeout: 30s

# VPS - for testing
otlphttp/victorialogs:
logs_endpoint: http://logs.pnetest.biz:9428/insert/opentelemetry/v1/logs
compression: gzip
timeout: 30s

# Local - for debugging
otlphttp/signoz:
traces_endpoint: http://localhost:5318/v1/traces
logs_endpoint: http://localhost:5318/v1/logs

service:
pipelines:
traces:
receivers: [otlp]
processors: [attributes, batch]
exporters: [otlphttp/uptrace, otlphttp/clickstack, otlphttp/signoz]

logs:
receivers: [otlp]
processors: [batch]
exporters: [otlphttp/clickstack, otlphttp/victorialogs, otlphttp/signoz]

qa10 OTEL Collector Config

# /opt/otelcol-1/config.yaml on qa10
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318

processors:
attributes:
actions:
- key: upstream_addr
pattern: '^(?P<upstream_primary>[^,\s]+)(?:,\s*(?P<upstream_secondary>[^,\s]+))?'
action: extract

batch:
timeout: 10s
send_batch_size: 1024
send_batch_max_size: 2048

exporters:
otlphttp/uptrace:
traces_endpoint: https://uptrace-2.pne.io/v1/traces
logs_endpoint: https://uptrace-2.pne.io/v1/logs
headers:
uptrace-dsn: "http://***@uptrace-2.pne.io?grpc=4317"
compression: gzip
timeout: 30s
sending_queue:
storage: file_storage/otc # ← Only qa10 has this!

otlphttp/openobserve:
endpoint: https://openobserve-2.pne.io/api/default
headers:
Authorization: Basic ***

otlphttp/clickstack:
traces_endpoint: https://clickstack-2.pne.io/v1/traces
logs_endpoint: https://clickstack-2.pne.io/v1/logs
headers:
Authorization: "***"
compression: gzip
timeout: 30s

extensions:
file_storage/otc:
directory: ./traces
timeout: 10s
create_directory: true

health_check:
endpoint: 0.0.0.0:13133

service:
extensions: [health_check, file_storage/otc]

pipelines:
traces:
receivers: [otlp]
processors: [attributes, batch]
exporters: [otlphttp/uptrace, otlphttp/clickstack, otlphttp/openobserve]

logs:
receivers: [otlp]
processors: [batch]
exporters: [otlphttp/uptrace, otlphttp/clickstack, otlphttp/openobserve]

nginx OTEL Module Config (qa8/qa10)

# /etc/nginx/nginx.conf
load_module modules/ngx_otel_module.so;

http {
otel_exporter {
endpoint 127.0.0.1:4317;
}

otel_service_name "nginx-qa8"; # or nginx-qa10
otel_span_attr host_name "$hostname";
otel_span_attr upstream_addr "$upstream_addr";
otel_span_attr upstream_response_time "$upstream_response_time";
otel_span_attr upstream_status "$upstream_status";
otel_span_attr http_user_agent "$http_user_agent";
otel_trace_context propagate;
otel_trace on;

# ... rest of config
}

References


Document generated: 2025-12-05 Stack version: ClickStack Local 1.0.0 + VPS 1.0.0