Deploying Data Applications with Kubernetes: The Complete Guide

Kubernetes has become essential for running data workloads at scale. After deploying numerous data pipelines, streaming applications, and ML services on Kubernetes, I’ve learned that the orchestration platform offers tremendous benefits—but also requires understanding its patterns and pitfalls. In this guide, I’ll share everything you need to know about running data applications on Kubernetes.

Why Kubernetes for Data?

Consider the challenges of running data applications without orchestration:

Challenge	Traditional Approach	Kubernetes Solution
Scaling	Manual server provisioning	Auto-scaling (HPA/VPA)
Fault tolerance	Custom scripts	Self-healing pods
Resource isolation	Virtual machines	Namespaces, quotas
Service discovery	Hardcoded endpoints	Kubernetes Services
Configuration	Environment-specific configs	ConfigMaps, Secrets
Rolling updates	Downtime during deploys	Zero-downtime updates

When Kubernetes Makes Sense

Good fit:

Microservices architectures
Variable workloads needing auto-scaling
Multi-region deployments
Complex service dependencies
ML model serving

Maybe overkill:

Single monolithic application
Stable, predictable workloads
Small team without DevOps expertise
Simple batch jobs

Kubernetes Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                     Control Plane                            │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐    │
│  │   API    │  │          │  │          │  │          │    │
│  │  Server  │  │ etcd     │  │ Scheduler│  │ Controller│    │
│  │          │  │          │  │          │  │  Manager │    │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘    │
└─────────────────────────────────────────────────────────────┘
                            │
         ┌──────────────────┼──────────────────┐
         │                  │                  │
         ▼                  ▼                  ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│    Node 1       │ │    Node 2       │ │    Node 3       │
│  ┌───────────┐  │ │  ┌───────────┐  │ │  ┌───────────┐  │
│  │  kubelet  │  │ │  │  kubelet  │  │ │  │  kubelet  │  │
│  │           │  │ │  │           │  │ │  │           │  │
│  │ ┌─────┐   │  │ │  │ ┌─────┐   │  │ │  │ ┌─────┐   │  │
│  │ │Pod 1│   │  │ │  │ │Pod 1│   │  │ │  │ │Pod 1│   │  │
│  │ │App  │   │  │ │  │ │App  │   │  │ │  │ │App  │   │  │
│  │ └─────┘   │  │ │  │ └─────┘   │  │ │  │ └─────┘   │  │
│  │ ┌─────┐   │  │ │  │ ┌─────┐   │  │ │  │ ┌─────┐   │  │
│  │ │Pod 2│   │  │ │  │ │Pod 2│   │  │ │  │ │Pod 2│   │  │
│  │ └─────┘   │  │ │  │ └─────┘   │  │ │  │ └─────┘   │  │
│  └───────────┘  │ │  └───────────┘  │ │  └───────────┘  │
└─────────────────┘ └─────────────────┘ └─────────────────┘

Core Kubernetes Concepts

Pods: The Smallest Deployable Unit

A pod is one or more containers that share:

Network namespace (same IP/port space)
Storage volumes
Pod-level labels and annotations

# pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: data-processor
  labels:
    app: data-pipeline
    component: processor
    version: v1.0.0
spec:
  containers:
  - name: processor
    image: myregistry/data-processor:1.0.0
    ports:
    - containerPort: 8080
      name: http
    env:
    - name: DATABASE_URL
      valueFrom:
        secretKeyRef:
          name: db-credentials
          key: url
    - name: LOG_LEVEL
      value: "INFO"
    resources:
      requests:
        memory: "512Mi"
        cpu: "250m"
      limits:
        memory: "1Gi"
        cpu: "500m"
    readinessProbe:
      httpGet:
        path: /health
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 10
    livenessProbe:
      httpGet:
        path: /health
        port: 8080
      initialDelaySeconds: 15
      periodSeconds: 20
    volumeMounts:
    - name: data-volume
      mountPath: /data
  volumes:
  - name: data-volume
    persistentVolumeClaim:
      claimName: data-pvc

Deployments: Declarative Updates

Deployments manage ReplicaSets and provide:

Declarative updates
Rollback capability
Scaling

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: etl-pipeline
  namespace: data-engineering
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0  # Zero-downtime deployment
  selector:
    matchLabels:
      app: etl-pipeline
  template:
    metadata:
      labels:
        app: etl-pipeline
        version: v2.1.0
    spec:
      containers:
      - name: etl
        image: myregistry/etl-pipeline:2.1.0
        ports:
        - containerPort: 8080
        env:
        - name: ENVIRONMENT
          value: "production"
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: db-secret
              key: url
        resources:
          requests:
            memory: "1Gi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
        livenessProbe:
          exec:
            command:
            - python
            - -c
            - "import sys; sys.exit(0)"
          initialDelaySeconds: 30
          periodSeconds: 10
---
# Scale the deployment
kubectl scale deployment etl-pipeline --replicas=5

# Rolling update
kubectl set image deployment/etl-pipeline etl=myregistry/etl-pipeline:2.2.0

# Rollback if something goes wrong
kubectl rollout undo deployment/etl-pipeline

# Check rollout status
kubectl rollout status deployment/etl-pipeline

Services: Stable Networking

Services provide stable endpoints for accessing pods:

# ClusterIP - Internal access only
apiVersion: v1
kind: Service
metadata:
  name: etl-service
spec:
  selector:
    app: etl-pipeline
  ports:
  - port: 80
    targetPort: 8080
  type: ClusterIP

---
# NodePort - Accessible from outside cluster
apiVersion: v1
kind: Service
metadata:
  name: etl-external
spec:
  selector:
    app: etl-pipeline
  ports:
  - port: 80
    targetPort: 8080
    nodePort: 30080
  type: NodePort

---
# LoadBalancer - Cloud provider load balancer
apiVersion: v1
kind: Service
metadata:
  name: etl-loadbalancer
spec:
  selector:
    app: etl-pipeline
  ports:
  - port: 80
    targetPort: 8080
  type: LoadBalancer

StatefulSets: For Stateful Applications

StatefulSets manage stateful applications like databases:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: kafka-cluster
spec:
  serviceName: kafka
  replicas: 3
  selector:
    matchLabels:
      app: kafka
  template:
    metadata:
      labels:
        app: kafka
    spec:
      containers:
      - name: kafka
        image: confluentinc/cp-kafka:7.5.0
        ports:
        - containerPort: 9092
        volumeMounts:
        - name: kafka-data
          mountPath: /var/lib/kafka/data
  volumeClaimTemplates:
  - metadata:
      name: kafka-data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 100Gi
      storageClassName: fast-ssd

ConfigMaps and Secrets: Configuration Management

# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: etl-config
data:
  database_host: "postgres.database.svc.cluster.local"
  database_port: "5432"
  log_level: "INFO"
  batch_size: "1000"
  # File configuration
  spark.conf: |
    spark.executor.memory=4g
    spark.driver.memory=2g
    spark.sql.shuffle.partitions=200

---
# secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: db-credentials
type: Opaque
stringData:
  username: admin
  password: SuperSecretPassword123!
  url: "postgresql://admin:SuperSecretPassword123!@postgres:5432/analytics"

# Using in pod
env:
- name: DB_HOST
  valueFrom:
    configMapKeyRef:
      name: etl-config
      key: database_host
- name: DB_PASSWORD
  valueFrom:
    secretKeyRef:
      name: db-credentials
      key: password

Running Data Workloads on Kubernetes

Batch Processing with Jobs

# job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: daily-etl
spec:
  completions: 1
  parallelism: 1
  backoffLimit: 3
  activeDeadlineSeconds: 3600  # 1 hour timeout
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: etl
        image: myregistry/etl-job:1.0.0
        env:
        - name: EXECUTION_DATE
          value: "2026-03-04"
        resources:
          requests:
            memory: "4Gi"
            cpu: "2000m"
          limits:
            memory: "8Gi"
            cpu: "4000m"

---
# CronJob for scheduled execution
apiVersion: batch/v1
kind: CronJob
metadata:
  name: daily-etl-scheduler
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  concurrencyPolicy: Forbid  # Don't run concurrently
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
          containers:
          - name: etl
            image: myregistry/etl-job:1.0.0

Spark on Kubernetes

# spark-operator deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: spark-operator
  namespace: spark-operator
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: spark-operator
  template:
    spec:
      serviceAccountName: spark-operator
      containers:
      - name: spark-operator
        image: gcr.io/spark-operator/spark-operator:v1beta2-1.3.6-3.1.1
        args:
        - -logtostderr
        - -enable-ui-service
        - -ui-service-port=8080
        ports:
        - containerPort: 8080

---
# SparkApplication CRD
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: data-pipeline
  namespace: data-engineering
spec:
  type: Python
  pythonVersion: "3"
  mode: cluster
  image: myregistry/spark-pipeline:1.0.0
  mainApplicationFile: local:///app/main.py
  sparkVersion: "3.4.1"
  restartPolicy:
    type: OnFailure
    onFailureRetries: 3
    onFailureRetryInterval: 10
    onSubmissionFailureRetries: 3
    onSubmissionFailureRetryInterval: 10
  driver:
    cores: 2
    coreLimit: "2500m"
    memory: "4g"
    labels:
      version: 1.0.0
    serviceAccount: spark
  executor:
    cores: 2
    instances: 5
    memory: "8g"
    labels:
      version: 1.0.0
  dynamicAllocation:
    enabled: true
    initialExecutors: 2
    minExecutors: 1
    maxExecutors: 20

Airflow on Kubernetes (KubernetesExecutor)

# airflow-values.yaml (Helm chart values)
executor: KubernetesExecutor
airflowConfigAnnotations:
  kubernetes.pod_template_file: "/opt/airflow/pod_templates/pod_template.yaml"

workers:
  replicas: 2
  resources:
    requests:
      cpu: 500m
      memory: 512Mi
    limits:
      cpu: 1000m
      memory: 1Gi

redis:
  password: secret

postgresql:
  enabled: true

# In your DAG
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator

run_etl = KubernetesPodOperator(
    task_id='run_etl',
    namespace='data-engineering',
    image='myregistry/etl-pipeline:1.0.0',
    name='etl-task',
    env_vars={'EXEC_DATE': execution_date},
    resources={'limit_cpu': '1000m', 'limit_memory': '2Gi'},
    get_logs=True,
    is_delete_operator_pod=True,
)

Helm: Kubernetes Package Management

Creating a Helm Chart

# Create chart structure
helm create data-pipeline-chart

# Chart structure
data-pipeline-chart/
├── Chart.yaml
├── values.yaml
├── values-production.yaml
├── templates/
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── configmap.yaml
│   ├── secret.yaml
│   ├── _helpers.tpl
│   └── NOTES.txt

Chart.yaml

apiVersion: v2
name: data-pipeline
description: A Helm chart for data pipeline deployment
type: application
version: 1.0.0
appVersion: "1.0.0"
keywords:
  - data-engineering
  - etl
  - pipeline
maintainers:
  - name: Furkanul Islam
    email: furkan@example.com

values.yaml

# Default values
replicaCount: 2

image:
  repository: myregistry/data-pipeline
  tag: "1.0.0"
  pullPolicy: IfNotPresent

resources:
  limits:
    cpu: 1000m
    memory: 2Gi
  requests:
    cpu: 500m
    memory: 1Gi

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilizationPercentage: 80
  targetMemoryUtilizationPercentage: 80

config:
  logLevel: INFO
  batchSize: 1000

secrets:
  databaseUrl: ""  # Set via --set or secrets file

service:
  type: ClusterIP
  port: 8080

Deployment Template

# templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "data-pipeline.fullname" . }}
  labels:
    {{- include "data-pipeline.labels" . | nindent 4 }}
spec:
  {{- if not .Values.autoscaling.enabled }}
  replicas: {{ .Values.replicaCount }}
  {{- end }}
  selector:
    matchLabels:
      {{- include "data-pipeline.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      labels:
        {{- include "data-pipeline.selectorLabels" . | nindent 8 }}
    spec:
      containers:
      - name: {{ .Chart.Name }}
        image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
        imagePullPolicy: {{ .Values.image.pullPolicy }}
        ports:
        - name: http
          containerPort: {{ .Values.service.port }}
          protocol: TCP
        env:
        - name: LOG_LEVEL
          value: {{ .Values.config.logLevel }}
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: {{ include "data-pipeline.fullname" . }}-secrets
              key: database-url
        resources:
          {{- toYaml .Values.resources | nindent 10 }}
        livenessProbe:
          httpGet:
            path: /health
            port: http
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: http
          initialDelaySeconds: 5
          periodSeconds: 5
---
# HorizontalPodAutoscaler
{{- if .Values.autoscaling.enabled }}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: {{ include "data-pipeline.fullname" . }}
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: {{ include "data-pipeline.fullname" . }}
  minReplicas: {{ .Values.autoscaling.minReplicas }}
  maxReplicas: {{ .Values.autoscaling.maxReplicas }}
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: {{ .Values.autoscaling.targetCPUUtilizationPercentage }}
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: {{ .Values.autoscaling.targetMemoryUtilizationPercentage }}
{{- end }}

Deploy with Helm

# Install
helm install data-pipeline ./data-pipeline-chart \
  --namespace data-engineering \
  --create-namespace \
  --set secrets.databaseUrl="postgresql://user:pass@host:5432/db"

# Use production values
helm install data-pipeline ./data-pipeline-chart \
  -f values-production.yaml \
  --namespace production

# Upgrade
helm upgrade data-pipeline ./data-pipeline-chart \
  --set image.tag="2.0.0"

# Rollback
helm rollback data-pipeline 1

# View history
helm history data-pipeline

Monitoring and Observability

Prometheus + Grafana Stack

# prometheus-values.yaml
prometheus:
  prometheusSpec:
    retention: 30d
    resources:
      requests:
        memory: 2Gi
        cpu: 500m
      limits:
        memory: 4Gi
        cpu: 1000m
    serviceMonitorSelectorNilUsesHelmValues: false

grafana:
  enabled: true
  adminPassword: admin
  datasources:
    datasources.yaml:
      datasources:
      - name: Prometheus
        type: prometheus
        url: http://prometheus:9090
        isDefault: true

Custom Metrics with Prometheus

# instrumenting your application
from prometheus_client import Counter, Histogram, start_http_server
import time

# Define metrics
PREDICTION_COUNTER = Counter(
    'predictions_total',
    'Total predictions made',
    ['model_version', 'status']
)

PREDICTION_LATENCY = Histogram(
    'prediction_latency_seconds',
    'Prediction latency',
    ['model_version'],
    buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0]
)

# Start metrics server
start_http_server(8000)

# In your prediction function
def predict(data):
    start_time = time.time()
    try:
        result = model.predict(data)
        PREDICTION_COUNTER.labels(model_version='v1', status='success').inc()
        PREDICTION_LATENCY.labels(model_version='v1').observe(time.time() - start_time)
        return result
    except Exception as e:
        PREDICTION_COUNTER.labels(model_version='v1', status='error').inc()
        raise

ServiceMonitor for Auto-Discovery

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: data-pipeline-monitor
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: data-pipeline
  endpoints:
  - port: metrics
    interval: 15s
    path: /metrics

Production Best Practices

1. Resource Requests and Limits

resources:
  requests:
    memory: "2Gi"   # Guaranteed memory
    cpu: "1000m"    # Guaranteed CPU
  limits:
    memory: "4Gi"   # Max memory before OOMKilled
    cpu: "2000m"    # Max CPU (throttled if exceeded)

2. Pod Disruption Budgets

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: etl-pdb
spec:
  minAvailable: 2  # Keep at least 2 pods running
  selector:
    matchLabels:
      app: etl-pipeline

3. Network Policies

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: restrict-data-access
spec:
  podSelector:
    matchLabels:
      app: database
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: etl-pipeline
    ports:
    - protocol: TCP
      port: 5432

4. Use Managed Kubernetes

For production, consider:

EKS (AWS) - Mature, integrates well with AWS services
GKE (GCP) - Best Kubernetes experience, automatic upgrades
AKS (Azure) - Good Azure integration, free control plane

5. Implement Proper Logging

import logging
import sys

# Configure structured logging
logging.basicConfig(
    level=logging.INFO,
    format='{"timestamp": "%(asctime)s", "level": "%(levelname)s", "message": "%(message)s"}',
    stream=sys.stdout
)

logger = logging.getLogger(__name__)

Key Takeaways

Kubernetes for data engineering means:

Declarative infrastructure: Define desired state, let K8s handle the rest
Auto-scaling: Handle variable workloads efficiently
Self-healing: Automatic restarts and rescheduling
Resource efficiency: Better utilization than VMs
Ecosystem: Rich tooling (Helm, Operators, Service Mesh)

The complexity is worth it for production data systems at scale.

Questions about Kubernetes for data? Reach out through the contact page or connect on LinkedIn.

DevOps Kubernetes Data Engineering Cloud Infrastructure

MD Furkanul Islam

Data Engineer & AI/ML Specialist

9+ years building intelligent data systems at scale. Passionate about bridging the gap between data engineering, AI, and robotics.

LinkedIn GitHub Twitter