Kubernetes Deployment

Production-ready Kubernetes deployment.

Overview
Quick Start
1. Helm Installation
2. Verify Deployment
Helm Chart Values
Rivven Operator (CRDs)
Manual Deployment
Scaling
Monitoring
Backup & Recovery
1. Scheduled Backups
2. Recovery
Troubleshooting
Production Checklist
Next Steps

Overview

Rivven is designed for cloud-native deployment with:

StatefulSet for ordered, stable pod identities
Headless Service for peer discovery
PersistentVolumeClaims for data durability
Horizontal Pod Autoscaler for dynamic scaling
Prometheus ServiceMonitor for observability

Quick Start

Helm Installation

# Add Rivven Helm repository
helm repo add rivven https://rivven.hupe1980.github.io/rivven/helm
helm repo update

# Install Rivven cluster
helm install rivven rivven/rivven \
  --namespace rivven \
  --create-namespace \
  --set cluster.replicas=3 \
  --set storage.size=100Gi

Verify Deployment

# Check pods
kubectl get pods -n rivven -w

# Expected output:
# NAME       READY   STATUS    RESTARTS   AGE
# rivven-0   1/1     Running   0          2m
# rivven-1   1/1     Running   0          1m
# rivven-2   1/1     Running   0          30s

# Check cluster status
kubectl exec -n rivven rivven-0 -- rivven cluster status

Helm Chart Values

Basic Configuration

# values.yaml
cluster:
  replicas: 3
  
image:
  repository: ghcr.io/hupe1980/rivven
  tag: latest
  pullPolicy: IfNotPresent

storage:
  size: 100Gi
  storageClass: ""  # Use default

resources:
  requests:
    cpu: "1"
    memory: 2Gi
  limits:
    cpu: "4"
    memory: 8Gi

service:
  type: ClusterIP
  port: 9292

High Availability

# ha-values.yaml
cluster:
  replicas: 5

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app.kubernetes.io/name: rivven
        topologyKey: kubernetes.io/hostname

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app.kubernetes.io/name: rivven

persistence:
  storageClass: premium-ssd
  size: 500Gi

TLS Configuration

# tls-values.yaml
tls:
  enabled: true
  certManager:
    enabled: true
    issuerRef:
      name: letsencrypt-prod
      kind: ClusterIssuer
  
  # Or use existing secret
  # existingSecret: rivven-tls

mtls:
  enabled: true
  clientCA:
    secretName: rivven-client-ca

Authentication

auth:
  enabled: true
  mechanism: mtls  # mtls, scram, token
  
  # For SCRAM
  scram:
    existingSecret: rivven-users
  
  rbac:
    enabled: true
    configMapName: rivven-rbac

Rivven Operator (CRDs)

The Rivven Operator provides declarative management via Custom Resource Definitions.

Leader Election

When deployed with multiple replicas, the operator uses Kubernetes Lease-based leader election to ensure only one instance actively reconciles resources. Enable it with --leader-election:

args:
  - --leader-election

Lease duration: 15 seconds
Renewal interval: 10 seconds
Lease resource: rivven-operator-leader in the operator’s namespace (or kube-system)
Non-leader replicas wait passively until they acquire the lease
When running a single replica, omit --leader-election (a warning is logged)
Lease-loss detection: The operator tracks consecutive lease renewal failures. After exceeding the threshold (ceil(lease_duration / renew_interval)), the process exits to allow Kubernetes to restart it with a clean state — preventing split-brain scenarios where a stale leader continues reconciling

Install the Operator

# Install CRDs
kubectl apply -f https://github.com/hupe1980/rivven/releases/latest/download/rivven-crds.yaml

# Deploy operator
kubectl apply -f https://github.com/hupe1980/rivven/releases/latest/download/rivven-operator.yaml

RivvenCluster CRD

apiVersion: rivven.hupe1980.github.io/v1alpha1
kind: RivvenCluster
metadata:
  name: production
spec:
  replicas: 3
  version: "0.0.22"
  storage:
    size: 100Gi
    storageClassName: fast-ssd
  config:
    defaultPartitions: 3
    defaultReplicationFactor: 2
  metrics:
    enabled: true

RivvenConnect CRD

Manage CDC pipelines declaratively with generic connector configs (Kafka Connect style):

apiVersion: rivven.hupe1980.github.io/v1alpha1
kind: RivvenConnect
metadata:
  name: cdc-pipeline
spec:
  clusterRef:
    name: production
  replicas: 2
  
  sources:
    - name: postgres-cdc
      connector: postgres-cdc
      topic: cdc.events
      configSecretRef: postgres-credentials
      # All connector config in generic 'config' field (Kafka Connect style)
      config:
        slotName: rivven_slot
        publication: rivven_pub
        snapshotMode: initial
        decodingPlugin: pgoutput
        # Tables are inside CDC config
        tables:
          - schema: public
            table: orders
            columns: [id, customer_id, total]
          - schema: public
            table: customers
            columnMasks:
              email: "***@***.***"
  
  sinks:
    - name: s3-archive
      connector: s3
      topics: ["cdc.*"]
      consumerGroup: s3-archiver
      configSecretRef: s3-credentials
      # All connector config in generic 'config' field
      config:
        bucket: data-lake
        format: parquet
        compression: zstd

Advanced CDC Configuration

For production CDC deployments, configure advanced features using the generic config field:

apiVersion: rivven.hupe1980.github.io/v1alpha1
kind: RivvenConnect
metadata:
  name: production-cdc
spec:
  clusterRef:
    name: production
  replicas: 3
  
  sources:
    - name: postgres-cdc
      connector: postgres-cdc
      topic: cdc.events
      configSecretRef: postgres-credentials
      # All connector config in generic 'config' field (Kafka Connect style)
      config:
        slotName: rivven_slot
        publication: rivven_pub
        snapshotMode: initial
        tables:
          - schema: public
            table: orders
        
        # Snapshot configuration
        snapshot:
          batchSize: 20000        # Rows per batch
          parallelTables: 8       # Parallel table snapshots
          queryTimeoutSecs: 600   # Query timeout
          maxRetries: 5           # Retry failed batches
        
        # Incremental snapshots (non-blocking)
        incrementalSnapshot:
          enabled: true
          chunkSize: 2048
          watermarkStrategy: update_and_insert
          maxConcurrentChunks: 4
        
        # Heartbeat monitoring
        heartbeat:
          enabled: true
          intervalSecs: 5
          maxLagSecs: 60
          emitEvents: true
        
        # Deduplication (bloom filter + LRU)
        deduplication:
          enabled: true
          bloomExpectedInsertions: 500000
          bloomFpp: 0.001
          lruSize: 50000
          windowSecs: 7200
        
        # Transaction metadata
        transactionTopic:
          enabled: true
          topicName: cdc.transactions
          includeDataCollections: true
        
        # Schema change capture
        schemaChangeTopic:
          enabled: true
          topicName: cdc.schema-changes
          includeColumns: true
        
        # Parallel processing
        parallel:
          enabled: true
          concurrency: 8
          perTableBuffer: 2000
          outputBuffer: 20000
          workStealing: true
        
        # Health monitoring
        health:
          enabled: true
          checkIntervalSecs: 5
          maxLagMs: 10000
          failureThreshold: 3
          autoRecovery: true
        
        # Event routing
        router:
          enabled: true
          defaultDestination: cdc.events.default
          deadLetterQueue: cdc.dlq
          rules:
            - conditionType: table
              conditionValue: orders
              destination: cdc.orders
              priority: 100
            - conditionType: schema
              conditionValue: audit
              destination: cdc.audit
              priority: 50
        
        # Custom partitioning
        partitioner:
          enabled: true
          numPartitions: 32
          strategy: key_hash
          keyColumns: [id]
        
        # SMT (Single Message Transforms)
        transforms:
          - type: extract_new_record_state
            config:
              add_fields: table,op
          - type: mask_field
            config:
              fields: ssn,credit_card
              replacement: "***"

Advanced CDC Feature Reference

Feature	Description
Snapshot	Batch size, parallelism, timeouts, retries for initial load
Incremental Snapshot	Non-blocking re-snapshots without stopping CDC stream
Signal Table	Ad-hoc snapshot control via database table or Rivven topic
Heartbeat	Connection health monitoring with lag detection
Deduplication	Bloom filter + LRU cache for duplicate prevention
Transaction Topic	Transaction boundary metadata
Schema Change Topic	DDL change capture
Tombstone	Delete event handling configuration
Field Encryption	Field-level encryption with AES-256-GCM
Read-Only Replica	PostgreSQL replica support with lag tracking
Event Router	Dynamic routing with rules, DLQ, priorities
Partitioner	Custom partitioning strategies
SMT Transforms	17 transform types (mask, filter, flatten, etc.)
Parallel	Multi-threaded processing with work stealing
Outbox	Transactional outbox pattern support
Health Monitor	Auto-recovery with exponential backoff

Custom Connectors

For custom connectors, use the generic config field:

sources:
  - name: my-custom-source
    connector: my-plugin
    topic: custom.events
    config:  # Generic JSON for custom connectors
      customField: value
      nested:
        option: true

RivvenTopic CRD

Manage topics declaratively for GitOps workflows:

apiVersion: rivven.hupe1980.github.io/v1alpha1
kind: RivvenTopic
metadata:
  name: orders-events
  namespace: default
spec:
  clusterRef:
    name: production
  
  partitions: 12
  replicationFactor: 3
  
  config:
    retentionMs: 604800000      # 7 days
    cleanupPolicy: delete
    compressionType: lz4
    minInsyncReplicas: 2
  
  acls:
    - principal: "user:order-service"
      operations: ["Read", "Write"]
    - principal: "user:analytics"
      operations: ["Read"]
  
  # Keep topic when CRD is deleted
  deleteOnRemove: false
  
  topicLabels:
    team: orders
    environment: production

Topic Configuration Options

Field	Description	Default
`retentionMs`	Retention time in milliseconds	604800000 (7 days)
`retentionBytes`	Retention size per partition (-1 = unlimited)	-1
`segmentBytes`	Segment file size	1073741824 (1GB)
`cleanupPolicy`	`delete`, `compact`, or `delete,compact`	`delete`
`compressionType`	`none`, `gzip`, `snappy`, `lz4`, `zstd`	`lz4`
`minInsyncReplicas`	Minimum ISR for writes	1
`maxMessageBytes`	Maximum message size	1048576 (1MB)

Check Topic Status

# List all topics
kubectl get rivventopics

# NAME            CLUSTER     PARTITIONS   REPLICATION   PHASE   AGE
# orders-events   production  12           3             Ready   5m

# Detailed status
kubectl describe rivventopic orders-events

RivvenSchemaRegistry CRD

Deploy and manage a high-performance Schema Registry declaratively:

apiVersion: rivven.hupe1980.github.io/v1alpha1
kind: RivvenSchemaRegistry
metadata:
  name: schema-registry
spec:
  clusterRef:
    name: production
  
  replicas: 2
  version: "0.0.22"
  
  # Server configuration
  server:
    port: 8081
    requestTimeoutMs: 30000
    corsEnabled: true
  
  # Schema storage in broker
  storage:
    mode: broker
    topic: _schemas
    replicationFactor: 3
  
  # Compatibility settings
  compatibility:
    defaultLevel: BACKWARD
    perSubject:
      "order-events-value": FULL
      "user-profile-value": FORWARD
  
  # Supported formats
  formats:
    avro: true
    jsonSchema: true
    protobuf: true
  
  # JWT authentication
  auth:
    enabled: true
    method: jwt
    jwt:
      issuerUrl: "https://auth.example.com"
      audience: "schema-registry"
  
  # TLS
  tls:
    enabled: true
    certSecretName: schema-registry-tls
  
  # Prometheus metrics
  metrics:
    enabled: true
    serviceMonitorEnabled: true
  
  resources:
    requests:
      cpu: "500m"
      memory: 1Gi

Schema Compatibility Levels

Level	Description
`NONE`	No compatibility checking
`BACKWARD`	New schema can read data written with old schema
`BACKWARD_TRANSITIVE`	New schema can read data from all previous versions
`FORWARD`	Old schema can read data written with new schema
`FORWARD_TRANSITIVE`	All previous schemas can read new data
`FULL`	Both backward and forward compatible
`FULL_TRANSITIVE`	Full compatibility with all versions

Authentication Options

Basic Auth:

auth:
  enabled: true
  method: basic
  users:
    - username: admin
      passwordSecretKey: admin-password
      role: admin
    - username: reader
      passwordSecretKey: reader-password
      role: reader

JWT/OIDC Auth:

auth:
  enabled: true
  method: jwt
  jwt:
    issuerUrl: "https://auth.example.com"
    jwksUrl: "https://auth.example.com/.well-known/jwks.json"
    audience: "schema-registry"

Cedar Policy Auth:

auth:
  enabled: true
  method: cedar
  cedar:
    policySecretRef: cedar-policies

External Registry Sync

Sync schemas with external schema registries or AWS Glue:

externalRegistry:
  enabled: true
  registryType: compatible
  registryUrl: "https://external-sr.example.com"
  syncMode: mirror  # mirror, push, bidirectional
  syncSubjects:
    - "orders-*"
  syncIntervalSeconds: 300
  credentialsSecretRef: external-creds

Check Schema Registry Status

# List schema registries
kubectl get rivvenschemaregistries

# NAME              CLUSTER     REPLICAS   PHASE     AGE
# schema-registry   production  2          Running   5m

# Detailed status
kubectl describe rivvenschemaregistry schema-registry

# Test the API
kubectl port-forward svc/rivven-schema-schema-registry 8081:8081
curl http://localhost:8081/subjects

Manual Deployment

Namespace

apiVersion: v1
kind: Namespace
metadata:
  name: rivven
  labels:
    app.kubernetes.io/name: rivven

ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: rivven-config
  namespace: rivven
data:
  rivven.yaml: |
    node:
      data_dir: /data
    
    cluster:
      bootstrap_expect: 3
      transport: quic
    
    storage:
      segment_size: 1073741824  # 1 GB
      retention_bytes: 107374182400  # 100 GB
    
    observability:
      metrics:
        enabled: true
        port: 9090

StatefulSet

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: rivven
  namespace: rivven
spec:
  serviceName: rivven-headless
  replicas: 3
  podManagementPolicy: Parallel
  selector:
    matchLabels:
      app.kubernetes.io/name: rivven
  template:
    metadata:
      labels:
        app.kubernetes.io/name: rivven
    spec:
      terminationGracePeriodSeconds: 60
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
      containers:
        - name: rivven
          image: ghcr.io/hupe1980/rivven:latest
          ports:
            - name: client
              containerPort: 9292
            - name: cluster
              containerPort: 9393
            - name: metrics
              containerPort: 9090
          env:
            - name: RIVVEN_NODE_ID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: RIVVEN_ADVERTISE_ADDR
              value: "$(RIVVEN_NODE_ID).rivven-headless.rivven.svc.cluster.local:9393"
          volumeMounts:
            - name: data
              mountPath: /data
            - name: config
              mountPath: /etc/rivven
          resources:
            requests:
              cpu: "1"
              memory: 2Gi
            limits:
              cpu: "4"
              memory: 8Gi
          livenessProbe:
            httpGet:
              path: /health/live
              port: 9090
            initialDelaySeconds: 30
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 9090
            initialDelaySeconds: 10
            periodSeconds: 5
          startupProbe:
            httpGet:
              path: /health/startup
              port: 9090
            failureThreshold: 30
            periodSeconds: 10
      volumes:
        - name: config
          configMap:
            name: rivven-config
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 100Gi

Services

# Headless service for StatefulSet
apiVersion: v1
kind: Service
metadata:
  name: rivven-headless
  namespace: rivven
spec:
  clusterIP: None
  selector:
    app.kubernetes.io/name: rivven
  ports:
    - name: cluster
      port: 9393
---
# Client service
apiVersion: v1
kind: Service
metadata:
  name: rivven
  namespace: rivven
spec:
  type: ClusterIP
  selector:
    app.kubernetes.io/name: rivven
  ports:
    - name: client
      port: 9292
      targetPort: 9292

Ingress (External Access)

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: rivven
  namespace: rivven
  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - rivven.example.com
      secretName: rivven-tls
  rules:
    - host: rivven.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: rivven
                port:
                  number: 9292

Scaling

Horizontal Pod Autoscaler

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: rivven
  namespace: rivven
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: rivven
  minReplicas: 3
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 60

Manual Scaling

# Scale up
kubectl scale statefulset rivven -n rivven --replicas=5

# Scale down (graceful)
kubectl scale statefulset rivven -n rivven --replicas=3

Storage Expansion

# Expand PVC (if storage class supports it)
kubectl patch pvc data-rivven-0 -n rivven \
  -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'

PVC Cleanup on Cluster Deletion

When a RivvenCluster custom resource is deleted, the operator automatically cleans up associated PersistentVolumeClaims. This prevents storage leaks from PVCs with Retain reclaim policy that would otherwise persist after cluster teardown.

The operator identifies PVCs by the label app.kubernetes.io/instance={cluster-name} and deletes them as part of the finalizer cleanup process. No manual intervention is required.

# Verify PVCs are cleaned up after cluster deletion
kubectl delete rivvencluster my-cluster -n rivven
kubectl get pvc -n rivven -l app.kubernetes.io/instance=my-cluster  # Should return empty

Reconciliation Error Backoff

The operator uses exponential backoff for reconciliation errors to avoid overwhelming the API server:

Initial delay: 30 seconds
Maximum delay: 600 seconds (10 minutes)
Backoff multiplier: 2×
Reset: Error count resets on successful reconciliation

Monitoring

ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: rivven
  namespace: rivven
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: rivven
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics

PodMonitor

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: rivven
  namespace: rivven
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: rivven
  podMetricsEndpoints:
    - port: metrics
      interval: 15s

Grafana Dashboard

Import the Rivven dashboard:

kubectl apply -f https://raw.githubusercontent.com/hupe1980/rivven/main/deploy/grafana-dashboard.yaml

Alerting Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: rivven-alerts
  namespace: rivven
spec:
  groups:
    - name: rivven
      rules:
        - alert: RivvenNodeDown
          expr: up{job="rivven"} == 0
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "Rivven node  is down"
        
        - alert: RivvenHighLag
          expr: rivven_consumer_lag_records > 10000
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High consumer lag on "
        
        - alert: RivvenDiskUsageHigh
          expr: (rivven_storage_bytes_used / rivven_storage_bytes_total) > 0.85
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Disk usage > 85% on "

Backup & Recovery

Scheduled Backups

apiVersion: batch/v1
kind: CronJob
metadata:
  name: rivven-backup
  namespace: rivven
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: backup
              image: ghcr.io/hupe1980/rivven:latest
              command:
                - /bin/sh
                - -c
                - |
                  rivven backup create \
                    --output s3://backups/rivven/$(date +%Y%m%d) \
                    --compress
              env:
                - name: AWS_ACCESS_KEY_ID
                  valueFrom:
                    secretKeyRef:
                      name: aws-credentials
                      key: access-key
                - name: AWS_SECRET_ACCESS_KEY
                  valueFrom:
                    secretKeyRef:
                      name: aws-credentials
                      key: secret-key
          restartPolicy: OnFailure

Recovery

# List backups
rivven backup list --source s3://backups/rivven/

# Restore to new cluster
rivven backup restore \
  --source s3://backups/rivven/20260125 \
  --target /data

Troubleshooting

Check Cluster Health

# Cluster status
kubectl exec -n rivven rivven-0 -- rivven cluster status

# Node info
kubectl exec -n rivven rivven-0 -- rivven cluster nodes

# Raft status
kubectl exec -n rivven rivven-0 -- rivven cluster raft-status

View Logs

# All pods
kubectl logs -n rivven -l app.kubernetes.io/name=rivven --tail=100

# Specific pod
kubectl logs -n rivven rivven-0 -f

# Previous container
kubectl logs -n rivven rivven-0 --previous

Debug Pod

# Shell into pod
kubectl exec -n rivven -it rivven-0 -- /bin/sh

# Check disk usage
kubectl exec -n rivven rivven-0 -- df -h /data

# Check network connectivity
kubectl exec -n rivven rivven-0 -- \
  nc -zv rivven-1.rivven-headless.rivven.svc.cluster.local 9393

Common Issues

Issue	Solution
Pods stuck in Pending	Check PVC binding, storage class
Split brain	Verify network policies, check quorum
High latency	Check resource limits, disk IOPS
OOM kills	Increase memory limits

Production Checklist

Next Steps

Security — Kubernetes security hardening
Architecture — Distributed design
Getting Started — Basic operations