Kubernetes has become essential for running data workloads at scale. After deploying numerous data pipelines, streaming applications, and ML services on Kubernetes, I’ve learned that the orchestration platform offers tremendous benefits—but also requires understanding its patterns and pitfalls. In this guide, I’ll share everything you need to know about running data applications on Kubernetes.
Why Kubernetes for Data?
Consider the challenges of running data applications without orchestration:
| Challenge | Traditional Approach | Kubernetes Solution |
|---|---|---|
| Scaling | Manual server provisioning | Auto-scaling (HPA/VPA) |
| Fault tolerance | Custom scripts | Self-healing pods |
| Resource isolation | Virtual machines | Namespaces, quotas |
| Service discovery | Hardcoded endpoints | Kubernetes Services |
| Configuration | Environment-specific configs | ConfigMaps, Secrets |
| Rolling updates | Downtime during deploys | Zero-downtime updates |
When Kubernetes Makes Sense
Good fit:
- Microservices architectures
- Variable workloads needing auto-scaling
- Multi-region deployments
- Complex service dependencies
- ML model serving
Maybe overkill:
- Single monolithic application
- Stable, predictable workloads
- Small team without DevOps expertise
- Simple batch jobs
Kubernetes Architecture Overview
┌─────────────────────────────────────────────────────────────┐
│ Control Plane │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ API │ │ │ │ │ │ │ │
│ │ Server │ │ etcd │ │ Scheduler│ │ Controller│ │
│ │ │ │ │ │ │ │ Manager │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────────┘
│
┌──────────────────┼──────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │
│ │ kubelet │ │ │ │ kubelet │ │ │ │ kubelet │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ ┌─────┐ │ │ │ │ ┌─────┐ │ │ │ │ ┌─────┐ │ │
│ │ │Pod 1│ │ │ │ │ │Pod 1│ │ │ │ │ │Pod 1│ │ │
│ │ │App │ │ │ │ │ │App │ │ │ │ │ │App │ │ │
│ │ └─────┘ │ │ │ │ └─────┘ │ │ │ │ └─────┘ │ │
│ │ ┌─────┐ │ │ │ │ ┌─────┐ │ │ │ │ ┌─────┐ │ │
│ │ │Pod 2│ │ │ │ │ │Pod 2│ │ │ │ │ │Pod 2│ │ │
│ │ └─────┘ │ │ │ │ └─────┘ │ │ │ │ └─────┘ │ │
│ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Core Kubernetes Concepts
Pods: The Smallest Deployable Unit
A pod is one or more containers that share:
- Network namespace (same IP/port space)
- Storage volumes
- Pod-level labels and annotations
# pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: data-processor
labels:
app: data-pipeline
component: processor
version: v1.0.0
spec:
containers:
- name: processor
image: myregistry/data-processor:1.0.0
ports:
- containerPort: 8080
name: http
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
- name: LOG_LEVEL
value: "INFO"
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
volumeMounts:
- name: data-volume
mountPath: /data
volumes:
- name: data-volume
persistentVolumeClaim:
claimName: data-pvc
Deployments: Declarative Updates
Deployments manage ReplicaSets and provide:
- Declarative updates
- Rollback capability
- Scaling
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: etl-pipeline
namespace: data-engineering
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # Zero-downtime deployment
selector:
matchLabels:
app: etl-pipeline
template:
metadata:
labels:
app: etl-pipeline
version: v2.1.0
spec:
containers:
- name: etl
image: myregistry/etl-pipeline:2.1.0
ports:
- containerPort: 8080
env:
- name: ENVIRONMENT
value: "production"
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-secret
key: url
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1000m"
livenessProbe:
exec:
command:
- python
- -c
- "import sys; sys.exit(0)"
initialDelaySeconds: 30
periodSeconds: 10
---
# Scale the deployment
kubectl scale deployment etl-pipeline --replicas=5
# Rolling update
kubectl set image deployment/etl-pipeline etl=myregistry/etl-pipeline:2.2.0
# Rollback if something goes wrong
kubectl rollout undo deployment/etl-pipeline
# Check rollout status
kubectl rollout status deployment/etl-pipeline
Services: Stable Networking
Services provide stable endpoints for accessing pods:
# ClusterIP - Internal access only
apiVersion: v1
kind: Service
metadata:
name: etl-service
spec:
selector:
app: etl-pipeline
ports:
- port: 80
targetPort: 8080
type: ClusterIP
---
# NodePort - Accessible from outside cluster
apiVersion: v1
kind: Service
metadata:
name: etl-external
spec:
selector:
app: etl-pipeline
ports:
- port: 80
targetPort: 8080
nodePort: 30080
type: NodePort
---
# LoadBalancer - Cloud provider load balancer
apiVersion: v1
kind: Service
metadata:
name: etl-loadbalancer
spec:
selector:
app: etl-pipeline
ports:
- port: 80
targetPort: 8080
type: LoadBalancer
StatefulSets: For Stateful Applications
StatefulSets manage stateful applications like databases:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: kafka-cluster
spec:
serviceName: kafka
replicas: 3
selector:
matchLabels:
app: kafka
template:
metadata:
labels:
app: kafka
spec:
containers:
- name: kafka
image: confluentinc/cp-kafka:7.5.0
ports:
- containerPort: 9092
volumeMounts:
- name: kafka-data
mountPath: /var/lib/kafka/data
volumeClaimTemplates:
- metadata:
name: kafka-data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
storageClassName: fast-ssd
ConfigMaps and Secrets: Configuration Management
# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: etl-config
data:
database_host: "postgres.database.svc.cluster.local"
database_port: "5432"
log_level: "INFO"
batch_size: "1000"
# File configuration
spark.conf: |
spark.executor.memory=4g
spark.driver.memory=2g
spark.sql.shuffle.partitions=200
---
# secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: db-credentials
type: Opaque
stringData:
username: admin
password: SuperSecretPassword123!
url: "postgresql://admin:SuperSecretPassword123!@postgres:5432/analytics"
# Using in pod
env:
- name: DB_HOST
valueFrom:
configMapKeyRef:
name: etl-config
key: database_host
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: db-credentials
key: password
Running Data Workloads on Kubernetes
Batch Processing with Jobs
# job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: daily-etl
spec:
completions: 1
parallelism: 1
backoffLimit: 3
activeDeadlineSeconds: 3600 # 1 hour timeout
template:
spec:
restartPolicy: Never
containers:
- name: etl
image: myregistry/etl-job:1.0.0
env:
- name: EXECUTION_DATE
value: "2026-03-04"
resources:
requests:
memory: "4Gi"
cpu: "2000m"
limits:
memory: "8Gi"
cpu: "4000m"
---
# CronJob for scheduled execution
apiVersion: batch/v1
kind: CronJob
metadata:
name: daily-etl-scheduler
spec:
schedule: "0 2 * * *" # Daily at 2 AM
concurrencyPolicy: Forbid # Don't run concurrently
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 1
jobTemplate:
spec:
template:
spec:
restartPolicy: Never
containers:
- name: etl
image: myregistry/etl-job:1.0.0
Spark on Kubernetes
# spark-operator deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: spark-operator
namespace: spark-operator
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: spark-operator
template:
spec:
serviceAccountName: spark-operator
containers:
- name: spark-operator
image: gcr.io/spark-operator/spark-operator:v1beta2-1.3.6-3.1.1
args:
- -logtostderr
- -enable-ui-service
- -ui-service-port=8080
ports:
- containerPort: 8080
---
# SparkApplication CRD
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: data-pipeline
namespace: data-engineering
spec:
type: Python
pythonVersion: "3"
mode: cluster
image: myregistry/spark-pipeline:1.0.0
mainApplicationFile: local:///app/main.py
sparkVersion: "3.4.1"
restartPolicy:
type: OnFailure
onFailureRetries: 3
onFailureRetryInterval: 10
onSubmissionFailureRetries: 3
onSubmissionFailureRetryInterval: 10
driver:
cores: 2
coreLimit: "2500m"
memory: "4g"
labels:
version: 1.0.0
serviceAccount: spark
executor:
cores: 2
instances: 5
memory: "8g"
labels:
version: 1.0.0
dynamicAllocation:
enabled: true
initialExecutors: 2
minExecutors: 1
maxExecutors: 20
Airflow on Kubernetes (KubernetesExecutor)
# airflow-values.yaml (Helm chart values)
executor: KubernetesExecutor
airflowConfigAnnotations:
kubernetes.pod_template_file: "/opt/airflow/pod_templates/pod_template.yaml"
workers:
replicas: 2
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
redis:
password: secret
postgresql:
enabled: true
# In your DAG
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
run_etl = KubernetesPodOperator(
task_id='run_etl',
namespace='data-engineering',
image='myregistry/etl-pipeline:1.0.0',
name='etl-task',
env_vars={'EXEC_DATE': execution_date},
resources={'limit_cpu': '1000m', 'limit_memory': '2Gi'},
get_logs=True,
is_delete_operator_pod=True,
)
Helm: Kubernetes Package Management
Creating a Helm Chart
# Create chart structure
helm create data-pipeline-chart
# Chart structure
data-pipeline-chart/
├── Chart.yaml
├── values.yaml
├── values-production.yaml
├── templates/
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── configmap.yaml
│ ├── secret.yaml
│ ├── _helpers.tpl
│ └── NOTES.txt
Chart.yaml
apiVersion: v2
name: data-pipeline
description: A Helm chart for data pipeline deployment
type: application
version: 1.0.0
appVersion: "1.0.0"
keywords:
- data-engineering
- etl
- pipeline
maintainers:
- name: Furkanul Islam
email: furkan@example.com
values.yaml
# Default values
replicaCount: 2
image:
repository: myregistry/data-pipeline
tag: "1.0.0"
pullPolicy: IfNotPresent
resources:
limits:
cpu: 1000m
memory: 2Gi
requests:
cpu: 500m
memory: 1Gi
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 80
targetMemoryUtilizationPercentage: 80
config:
logLevel: INFO
batchSize: 1000
secrets:
databaseUrl: "" # Set via --set or secrets file
service:
type: ClusterIP
port: 8080
Deployment Template
# templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "data-pipeline.fullname" . }}
labels:
{{- include "data-pipeline.labels" . | nindent 4 }}
spec:
{{- if not .Values.autoscaling.enabled }}
replicas: {{ .Values.replicaCount }}
{{- end }}
selector:
matchLabels:
{{- include "data-pipeline.selectorLabels" . | nindent 6 }}
template:
metadata:
labels:
{{- include "data-pipeline.selectorLabels" . | nindent 8 }}
spec:
containers:
- name: {{ .Chart.Name }}
image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
imagePullPolicy: {{ .Values.image.pullPolicy }}
ports:
- name: http
containerPort: {{ .Values.service.port }}
protocol: TCP
env:
- name: LOG_LEVEL
value: {{ .Values.config.logLevel }}
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: {{ include "data-pipeline.fullname" . }}-secrets
key: database-url
resources:
{{- toYaml .Values.resources | nindent 10 }}
livenessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: http
initialDelaySeconds: 5
periodSeconds: 5
---
# HorizontalPodAutoscaler
{{- if .Values.autoscaling.enabled }}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: {{ include "data-pipeline.fullname" . }}
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: {{ include "data-pipeline.fullname" . }}
minReplicas: {{ .Values.autoscaling.minReplicas }}
maxReplicas: {{ .Values.autoscaling.maxReplicas }}
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: {{ .Values.autoscaling.targetCPUUtilizationPercentage }}
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: {{ .Values.autoscaling.targetMemoryUtilizationPercentage }}
{{- end }}
Deploy with Helm
# Install
helm install data-pipeline ./data-pipeline-chart \
--namespace data-engineering \
--create-namespace \
--set secrets.databaseUrl="postgresql://user:pass@host:5432/db"
# Use production values
helm install data-pipeline ./data-pipeline-chart \
-f values-production.yaml \
--namespace production
# Upgrade
helm upgrade data-pipeline ./data-pipeline-chart \
--set image.tag="2.0.0"
# Rollback
helm rollback data-pipeline 1
# View history
helm history data-pipeline
Monitoring and Observability
Prometheus + Grafana Stack
# prometheus-values.yaml
prometheus:
prometheusSpec:
retention: 30d
resources:
requests:
memory: 2Gi
cpu: 500m
limits:
memory: 4Gi
cpu: 1000m
serviceMonitorSelectorNilUsesHelmValues: false
grafana:
enabled: true
adminPassword: admin
datasources:
datasources.yaml:
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus:9090
isDefault: true
Custom Metrics with Prometheus
# instrumenting your application
from prometheus_client import Counter, Histogram, start_http_server
import time
# Define metrics
PREDICTION_COUNTER = Counter(
'predictions_total',
'Total predictions made',
['model_version', 'status']
)
PREDICTION_LATENCY = Histogram(
'prediction_latency_seconds',
'Prediction latency',
['model_version'],
buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0]
)
# Start metrics server
start_http_server(8000)
# In your prediction function
def predict(data):
start_time = time.time()
try:
result = model.predict(data)
PREDICTION_COUNTER.labels(model_version='v1', status='success').inc()
PREDICTION_LATENCY.labels(model_version='v1').observe(time.time() - start_time)
return result
except Exception as e:
PREDICTION_COUNTER.labels(model_version='v1', status='error').inc()
raise
ServiceMonitor for Auto-Discovery
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: data-pipeline-monitor
labels:
release: prometheus
spec:
selector:
matchLabels:
app: data-pipeline
endpoints:
- port: metrics
interval: 15s
path: /metrics
Production Best Practices
1. Resource Requests and Limits
resources:
requests:
memory: "2Gi" # Guaranteed memory
cpu: "1000m" # Guaranteed CPU
limits:
memory: "4Gi" # Max memory before OOMKilled
cpu: "2000m" # Max CPU (throttled if exceeded)
2. Pod Disruption Budgets
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: etl-pdb
spec:
minAvailable: 2 # Keep at least 2 pods running
selector:
matchLabels:
app: etl-pipeline
3. Network Policies
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: restrict-data-access
spec:
podSelector:
matchLabels:
app: database
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: etl-pipeline
ports:
- protocol: TCP
port: 5432
4. Use Managed Kubernetes
For production, consider:
- EKS (AWS) - Mature, integrates well with AWS services
- GKE (GCP) - Best Kubernetes experience, automatic upgrades
- AKS (Azure) - Good Azure integration, free control plane
5. Implement Proper Logging
import logging
import sys
# Configure structured logging
logging.basicConfig(
level=logging.INFO,
format='{"timestamp": "%(asctime)s", "level": "%(levelname)s", "message": "%(message)s"}',
stream=sys.stdout
)
logger = logging.getLogger(__name__)
Key Takeaways
Kubernetes for data engineering means:
- Declarative infrastructure: Define desired state, let K8s handle the rest
- Auto-scaling: Handle variable workloads efficiently
- Self-healing: Automatic restarts and rescheduling
- Resource efficiency: Better utilization than VMs
- Ecosystem: Rich tooling (Helm, Operators, Service Mesh)
The complexity is worth it for production data systems at scale.
Questions about Kubernetes for data? Reach out through the contact page or connect on LinkedIn.
Related Posts
Docker Essentials for Data Engineers: Containerize Your Data Applications
Master Docker containerization for data engineering. Learn images, containers, volumes, networking, Docker Compose, and best practices for building reproducible data pipelines.
Data EngineeringBuilding Scalable Data Pipelines with Apache Spark: A Complete Guide
Learn how to design and implement production-ready data pipelines using Apache Spark. Covers architecture patterns, best practices, fault tolerance, and real-world examples for processing millions of events per second.
AI/MLDeploying Machine Learning Models to Production: A Complete Guide
Learn how to take ML models from Jupyter notebooks to production-ready systems. Covers containerization, model versioning, A/B testing, monitoring, and MLOps best practices with real examples.