Kubernetes Deployment
Production-ready Kubernetes deployment.
Table of contents
- Overview
- Quick Start
- Helm Chart Values
- Rivven Operator (CRDs)
- Manual Deployment
- Scaling
- Monitoring
- Backup & Recovery
- Troubleshooting
- Production Checklist
- Next Steps
Overview
Rivven is designed for cloud-native deployment with:
- StatefulSet for ordered, stable pod identities
- Headless Service for peer discovery
- PersistentVolumeClaims for data durability
- Horizontal Pod Autoscaler for dynamic scaling
- Prometheus ServiceMonitor for observability
Quick Start
Helm Installation
# Add Rivven Helm repository
helm repo add rivven https://rivven.hupe1980.github.io/rivven/helm
helm repo update
# Install Rivven cluster
helm install rivven rivven/rivven \
--namespace rivven \
--create-namespace \
--set cluster.replicas=3 \
--set storage.size=100Gi
Verify Deployment
# Check pods
kubectl get pods -n rivven -w
# Expected output:
# NAME READY STATUS RESTARTS AGE
# rivven-0 1/1 Running 0 2m
# rivven-1 1/1 Running 0 1m
# rivven-2 1/1 Running 0 30s
# Check cluster status
kubectl exec -n rivven rivven-0 -- rivven cluster status
Helm Chart Values
Basic Configuration
# values.yaml
cluster:
replicas: 3
image:
repository: ghcr.io/hupe1980/rivven
tag: latest
pullPolicy: IfNotPresent
storage:
size: 100Gi
storageClass: "" # Use default
resources:
requests:
cpu: "1"
memory: 2Gi
limits:
cpu: "4"
memory: 8Gi
service:
type: ClusterIP
port: 9292
High Availability
# ha-values.yaml
cluster:
replicas: 5
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/name: rivven
topologyKey: kubernetes.io/hostname
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app.kubernetes.io/name: rivven
persistence:
storageClass: premium-ssd
size: 500Gi
TLS Configuration
# tls-values.yaml
tls:
enabled: true
certManager:
enabled: true
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
# Or use existing secret
# existingSecret: rivven-tls
mtls:
enabled: true
clientCA:
secretName: rivven-client-ca
Authentication
auth:
enabled: true
mechanism: mtls # mtls, scram, token
# For SCRAM
scram:
existingSecret: rivven-users
rbac:
enabled: true
configMapName: rivven-rbac
Rivven Operator (CRDs)
The Rivven Operator provides declarative management via Custom Resource Definitions.
Leader Election
When deployed with multiple replicas, the operator uses Kubernetes Lease-based leader election to ensure only one instance actively reconciles resources. Enable it with --leader-election:
args:
- --leader-election
- Lease duration: 15 seconds
- Renewal interval: 10 seconds
- Lease resource:
rivven-operator-leaderin the operator’s namespace (orkube-system) - Non-leader replicas wait passively until they acquire the lease
- When running a single replica, omit
--leader-election(a warning is logged) - Lease-loss detection: The operator tracks consecutive lease renewal failures. After exceeding the threshold (
ceil(lease_duration / renew_interval)), the process exits to allow Kubernetes to restart it with a clean state — preventing split-brain scenarios where a stale leader continues reconciling
Install the Operator
# Install CRDs
kubectl apply -f https://github.com/hupe1980/rivven/releases/latest/download/rivven-crds.yaml
# Deploy operator
kubectl apply -f https://github.com/hupe1980/rivven/releases/latest/download/rivven-operator.yaml
RivvenCluster CRD
apiVersion: rivven.hupe1980.github.io/v1alpha1
kind: RivvenCluster
metadata:
name: production
spec:
replicas: 3
version: "0.0.22"
storage:
size: 100Gi
storageClassName: fast-ssd
config:
defaultPartitions: 3
defaultReplicationFactor: 2
metrics:
enabled: true
RivvenConnect CRD
Manage CDC pipelines declaratively with generic connector configs (Kafka Connect style):
apiVersion: rivven.hupe1980.github.io/v1alpha1
kind: RivvenConnect
metadata:
name: cdc-pipeline
spec:
clusterRef:
name: production
replicas: 2
sources:
- name: postgres-cdc
connector: postgres-cdc
topic: cdc.events
configSecretRef: postgres-credentials
# All connector config in generic 'config' field (Kafka Connect style)
config:
slotName: rivven_slot
publication: rivven_pub
snapshotMode: initial
decodingPlugin: pgoutput
# Tables are inside CDC config
tables:
- schema: public
table: orders
columns: [id, customer_id, total]
- schema: public
table: customers
columnMasks:
email: "***@***.***"
sinks:
- name: s3-archive
connector: s3
topics: ["cdc.*"]
consumerGroup: s3-archiver
configSecretRef: s3-credentials
# All connector config in generic 'config' field
config:
bucket: data-lake
format: parquet
compression: zstd
Advanced CDC Configuration
For production CDC deployments, configure advanced features using the generic config field:
apiVersion: rivven.hupe1980.github.io/v1alpha1
kind: RivvenConnect
metadata:
name: production-cdc
spec:
clusterRef:
name: production
replicas: 3
sources:
- name: postgres-cdc
connector: postgres-cdc
topic: cdc.events
configSecretRef: postgres-credentials
# All connector config in generic 'config' field (Kafka Connect style)
config:
slotName: rivven_slot
publication: rivven_pub
snapshotMode: initial
tables:
- schema: public
table: orders
# Snapshot configuration
snapshot:
batchSize: 20000 # Rows per batch
parallelTables: 8 # Parallel table snapshots
queryTimeoutSecs: 600 # Query timeout
maxRetries: 5 # Retry failed batches
# Incremental snapshots (non-blocking)
incrementalSnapshot:
enabled: true
chunkSize: 2048
watermarkStrategy: update_and_insert
maxConcurrentChunks: 4
# Heartbeat monitoring
heartbeat:
enabled: true
intervalSecs: 5
maxLagSecs: 60
emitEvents: true
# Deduplication (bloom filter + LRU)
deduplication:
enabled: true
bloomExpectedInsertions: 500000
bloomFpp: 0.001
lruSize: 50000
windowSecs: 7200
# Transaction metadata
transactionTopic:
enabled: true
topicName: cdc.transactions
includeDataCollections: true
# Schema change capture
schemaChangeTopic:
enabled: true
topicName: cdc.schema-changes
includeColumns: true
# Parallel processing
parallel:
enabled: true
concurrency: 8
perTableBuffer: 2000
outputBuffer: 20000
workStealing: true
# Health monitoring
health:
enabled: true
checkIntervalSecs: 5
maxLagMs: 10000
failureThreshold: 3
autoRecovery: true
# Event routing
router:
enabled: true
defaultDestination: cdc.events.default
deadLetterQueue: cdc.dlq
rules:
- conditionType: table
conditionValue: orders
destination: cdc.orders
priority: 100
- conditionType: schema
conditionValue: audit
destination: cdc.audit
priority: 50
# Custom partitioning
partitioner:
enabled: true
numPartitions: 32
strategy: key_hash
keyColumns: [id]
# SMT (Single Message Transforms)
transforms:
- type: extract_new_record_state
config:
add_fields: table,op
- type: mask_field
config:
fields: ssn,credit_card
replacement: "***"
Advanced CDC Feature Reference
| Feature | Description |
|---|---|
| Snapshot | Batch size, parallelism, timeouts, retries for initial load |
| Incremental Snapshot | Non-blocking re-snapshots without stopping CDC stream |
| Signal Table | Ad-hoc snapshot control via database table or Rivven topic |
| Heartbeat | Connection health monitoring with lag detection |
| Deduplication | Bloom filter + LRU cache for duplicate prevention |
| Transaction Topic | Transaction boundary metadata |
| Schema Change Topic | DDL change capture |
| Tombstone | Delete event handling configuration |
| Field Encryption | Field-level encryption with AES-256-GCM |
| Read-Only Replica | PostgreSQL replica support with lag tracking |
| Event Router | Dynamic routing with rules, DLQ, priorities |
| Partitioner | Custom partitioning strategies |
| SMT Transforms | 17 transform types (mask, filter, flatten, etc.) |
| Parallel | Multi-threaded processing with work stealing |
| Outbox | Transactional outbox pattern support |
| Health Monitor | Auto-recovery with exponential backoff |
Custom Connectors
For custom connectors, use the generic config field:
sources:
- name: my-custom-source
connector: my-plugin
topic: custom.events
config: # Generic JSON for custom connectors
customField: value
nested:
option: true
RivvenTopic CRD
Manage topics declaratively for GitOps workflows:
apiVersion: rivven.hupe1980.github.io/v1alpha1
kind: RivvenTopic
metadata:
name: orders-events
namespace: default
spec:
clusterRef:
name: production
partitions: 12
replicationFactor: 3
config:
retentionMs: 604800000 # 7 days
cleanupPolicy: delete
compressionType: lz4
minInsyncReplicas: 2
acls:
- principal: "user:order-service"
operations: ["Read", "Write"]
- principal: "user:analytics"
operations: ["Read"]
# Keep topic when CRD is deleted
deleteOnRemove: false
topicLabels:
team: orders
environment: production
Topic Configuration Options
| Field | Description | Default |
|---|---|---|
retentionMs |
Retention time in milliseconds | 604800000 (7 days) |
retentionBytes |
Retention size per partition (-1 = unlimited) | -1 |
segmentBytes |
Segment file size | 1073741824 (1GB) |
cleanupPolicy |
delete, compact, or delete,compact |
delete |
compressionType |
none, gzip, snappy, lz4, zstd |
lz4 |
minInsyncReplicas |
Minimum ISR for writes | 1 |
maxMessageBytes |
Maximum message size | 1048576 (1MB) |
Check Topic Status
# List all topics
kubectl get rivventopics
# NAME CLUSTER PARTITIONS REPLICATION PHASE AGE
# orders-events production 12 3 Ready 5m
# Detailed status
kubectl describe rivventopic orders-events
RivvenSchemaRegistry CRD
Deploy and manage a high-performance Schema Registry declaratively:
apiVersion: rivven.hupe1980.github.io/v1alpha1
kind: RivvenSchemaRegistry
metadata:
name: schema-registry
spec:
clusterRef:
name: production
replicas: 2
version: "0.0.22"
# Server configuration
server:
port: 8081
requestTimeoutMs: 30000
corsEnabled: true
# Schema storage in broker
storage:
mode: broker
topic: _schemas
replicationFactor: 3
# Compatibility settings
compatibility:
defaultLevel: BACKWARD
perSubject:
"order-events-value": FULL
"user-profile-value": FORWARD
# Supported formats
formats:
avro: true
jsonSchema: true
protobuf: true
# JWT authentication
auth:
enabled: true
method: jwt
jwt:
issuerUrl: "https://auth.example.com"
audience: "schema-registry"
# TLS
tls:
enabled: true
certSecretName: schema-registry-tls
# Prometheus metrics
metrics:
enabled: true
serviceMonitorEnabled: true
resources:
requests:
cpu: "500m"
memory: 1Gi
Schema Compatibility Levels
| Level | Description |
|---|---|
NONE |
No compatibility checking |
BACKWARD |
New schema can read data written with old schema |
BACKWARD_TRANSITIVE |
New schema can read data from all previous versions |
FORWARD |
Old schema can read data written with new schema |
FORWARD_TRANSITIVE |
All previous schemas can read new data |
FULL |
Both backward and forward compatible |
FULL_TRANSITIVE |
Full compatibility with all versions |
Authentication Options
Basic Auth:
auth:
enabled: true
method: basic
users:
- username: admin
passwordSecretKey: admin-password
role: admin
- username: reader
passwordSecretKey: reader-password
role: reader
JWT/OIDC Auth:
auth:
enabled: true
method: jwt
jwt:
issuerUrl: "https://auth.example.com"
jwksUrl: "https://auth.example.com/.well-known/jwks.json"
audience: "schema-registry"
Cedar Policy Auth:
auth:
enabled: true
method: cedar
cedar:
policySecretRef: cedar-policies
External Registry Sync
Sync schemas with external schema registries or AWS Glue:
externalRegistry:
enabled: true
registryType: compatible
registryUrl: "https://external-sr.example.com"
syncMode: mirror # mirror, push, bidirectional
syncSubjects:
- "orders-*"
syncIntervalSeconds: 300
credentialsSecretRef: external-creds
Check Schema Registry Status
# List schema registries
kubectl get rivvenschemaregistries
# NAME CLUSTER REPLICAS PHASE AGE
# schema-registry production 2 Running 5m
# Detailed status
kubectl describe rivvenschemaregistry schema-registry
# Test the API
kubectl port-forward svc/rivven-schema-schema-registry 8081:8081
curl http://localhost:8081/subjects
Manual Deployment
Namespace
apiVersion: v1
kind: Namespace
metadata:
name: rivven
labels:
app.kubernetes.io/name: rivven
ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: rivven-config
namespace: rivven
data:
rivven.yaml: |
node:
data_dir: /data
cluster:
bootstrap_expect: 3
transport: quic
storage:
segment_size: 1073741824 # 1 GB
retention_bytes: 107374182400 # 100 GB
observability:
metrics:
enabled: true
port: 9090
StatefulSet
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: rivven
namespace: rivven
spec:
serviceName: rivven-headless
replicas: 3
podManagementPolicy: Parallel
selector:
matchLabels:
app.kubernetes.io/name: rivven
template:
metadata:
labels:
app.kubernetes.io/name: rivven
spec:
terminationGracePeriodSeconds: 60
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
containers:
- name: rivven
image: ghcr.io/hupe1980/rivven:latest
ports:
- name: client
containerPort: 9292
- name: cluster
containerPort: 9393
- name: metrics
containerPort: 9090
env:
- name: RIVVEN_NODE_ID
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: RIVVEN_ADVERTISE_ADDR
value: "$(RIVVEN_NODE_ID).rivven-headless.rivven.svc.cluster.local:9393"
volumeMounts:
- name: data
mountPath: /data
- name: config
mountPath: /etc/rivven
resources:
requests:
cpu: "1"
memory: 2Gi
limits:
cpu: "4"
memory: 8Gi
livenessProbe:
httpGet:
path: /health/live
port: 9090
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 9090
initialDelaySeconds: 10
periodSeconds: 5
startupProbe:
httpGet:
path: /health/startup
port: 9090
failureThreshold: 30
periodSeconds: 10
volumes:
- name: config
configMap:
name: rivven-config
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
Services
# Headless service for StatefulSet
apiVersion: v1
kind: Service
metadata:
name: rivven-headless
namespace: rivven
spec:
clusterIP: None
selector:
app.kubernetes.io/name: rivven
ports:
- name: cluster
port: 9393
---
# Client service
apiVersion: v1
kind: Service
metadata:
name: rivven
namespace: rivven
spec:
type: ClusterIP
selector:
app.kubernetes.io/name: rivven
ports:
- name: client
port: 9292
targetPort: 9292
Ingress (External Access)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: rivven
namespace: rivven
annotations:
nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
ingressClassName: nginx
tls:
- hosts:
- rivven.example.com
secretName: rivven-tls
rules:
- host: rivven.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: rivven
port:
number: 9292
Scaling
Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: rivven
namespace: rivven
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: StatefulSet
name: rivven
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 60
Manual Scaling
# Scale up
kubectl scale statefulset rivven -n rivven --replicas=5
# Scale down (graceful)
kubectl scale statefulset rivven -n rivven --replicas=3
Storage Expansion
# Expand PVC (if storage class supports it)
kubectl patch pvc data-rivven-0 -n rivven \
-p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'
PVC Cleanup on Cluster Deletion
When a RivvenCluster custom resource is deleted, the operator automatically cleans up associated PersistentVolumeClaims. This prevents storage leaks from PVCs with Retain reclaim policy that would otherwise persist after cluster teardown.
The operator identifies PVCs by the label app.kubernetes.io/instance={cluster-name} and deletes them as part of the finalizer cleanup process. No manual intervention is required.
# Verify PVCs are cleaned up after cluster deletion
kubectl delete rivvencluster my-cluster -n rivven
kubectl get pvc -n rivven -l app.kubernetes.io/instance=my-cluster # Should return empty
Reconciliation Error Backoff
The operator uses exponential backoff for reconciliation errors to avoid overwhelming the API server:
- Initial delay: 30 seconds
- Maximum delay: 600 seconds (10 minutes)
- Backoff multiplier: 2×
- Reset: Error count resets on successful reconciliation
Monitoring
ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: rivven
namespace: rivven
spec:
selector:
matchLabels:
app.kubernetes.io/name: rivven
endpoints:
- port: metrics
interval: 15s
path: /metrics
PodMonitor
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: rivven
namespace: rivven
spec:
selector:
matchLabels:
app.kubernetes.io/name: rivven
podMetricsEndpoints:
- port: metrics
interval: 15s
Grafana Dashboard
Import the Rivven dashboard:
kubectl apply -f https://raw.githubusercontent.com/hupe1980/rivven/main/deploy/grafana-dashboard.yaml
Alerting Rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: rivven-alerts
namespace: rivven
spec:
groups:
- name: rivven
rules:
- alert: RivvenNodeDown
expr: up{job="rivven"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Rivven node is down"
- alert: RivvenHighLag
expr: rivven_consumer_lag_records > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "High consumer lag on "
- alert: RivvenDiskUsageHigh
expr: (rivven_storage_bytes_used / rivven_storage_bytes_total) > 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "Disk usage > 85% on "
Backup & Recovery
Scheduled Backups
apiVersion: batch/v1
kind: CronJob
metadata:
name: rivven-backup
namespace: rivven
spec:
schedule: "0 2 * * *" # Daily at 2 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: ghcr.io/hupe1980/rivven:latest
command:
- /bin/sh
- -c
- |
rivven backup create \
--output s3://backups/rivven/$(date +%Y%m%d) \
--compress
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-credentials
key: access-key
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-credentials
key: secret-key
restartPolicy: OnFailure
Recovery
# List backups
rivven backup list --source s3://backups/rivven/
# Restore to new cluster
rivven backup restore \
--source s3://backups/rivven/20260125 \
--target /data
Troubleshooting
Check Cluster Health
# Cluster status
kubectl exec -n rivven rivven-0 -- rivven cluster status
# Node info
kubectl exec -n rivven rivven-0 -- rivven cluster nodes
# Raft status
kubectl exec -n rivven rivven-0 -- rivven cluster raft-status
View Logs
# All pods
kubectl logs -n rivven -l app.kubernetes.io/name=rivven --tail=100
# Specific pod
kubectl logs -n rivven rivven-0 -f
# Previous container
kubectl logs -n rivven rivven-0 --previous
Debug Pod
# Shell into pod
kubectl exec -n rivven -it rivven-0 -- /bin/sh
# Check disk usage
kubectl exec -n rivven rivven-0 -- df -h /data
# Check network connectivity
kubectl exec -n rivven rivven-0 -- \
nc -zv rivven-1.rivven-headless.rivven.svc.cluster.local 9393
Common Issues
| Issue | Solution |
|---|---|
| Pods stuck in Pending | Check PVC binding, storage class |
| Split brain | Verify network policies, check quorum |
| High latency | Check resource limits, disk IOPS |
| OOM kills | Increase memory limits |
Production Checklist
- 3+ replicas across availability zones
- Pod anti-affinity rules configured
- PersistentVolumes with adequate IOPS
- TLS enabled for all traffic
- Network policies applied
- Resource requests and limits set
- Liveness/readiness probes configured
- Monitoring and alerting enabled
- Backup schedule configured
- Disaster recovery tested
Next Steps
- Security — Kubernetes security hardening
- Architecture — Distributed design
- Getting Started — Basic operations