PostgreSQL has been my go-to database for data engineering workloads. Whether it’s powering analytics dashboards, storing pipeline state, or serving as a data warehouse, PostgreSQL handles it all. But out-of-the-box configurations rarely deliver optimal performance. In this guide, I’ll share the optimization techniques that have yielded 10-100x performance improvements in my projects.
Understanding PostgreSQL Architecture
Before optimizing, understand how PostgreSQL works:
┌─────────────────────────────────────────────────────────────┐
│ Client Applications │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ PostgreSQL Server │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Shared Buffers │ │
│ │ (In-Memory Cache) │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ Buffer │ │ Buffer │ │ Buffer │ │ Buffer │ │ │
│ │ │ Page 1 │ │ Page 2 │ │ Page 3 │ │ Page 4 │ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Query │ │ Background │ │ WAL │ │
│ │ Planner │ │ Workers │ │ Buffer │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Disk Storage │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Table │ │ Index │ │ WAL │ │ Temp │ │
│ │ Files │ │ Files │ │ Files │ │ Files │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────────┘
Indexing Strategies
B-Tree Indexes (Default)
Best for equality and range queries:
-- Create index
CREATE INDEX idx_users_email ON users(email);
-- Composite index (order matters!)
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date DESC);
-- Partial index (index only matching rows)
CREATE INDEX idx_active_users ON users(email) WHERE active = true;
-- Expression index
CREATE INDEX idx_users_lower_email ON users(LOWER(email));
-- Covering index (includes additional columns)
CREATE INDEX idx_orders_covering ON orders(customer_id) INCLUDE (total_amount, status);
When to Use B-Tree
| Query Pattern | Use B-Tree? |
|---|---|
WHERE column = value | ✅ Yes |
WHERE column > value | ✅ Yes |
WHERE column BETWEEN x AND y | ✅ Yes |
ORDER BY column | ✅ Yes |
WHERE column IN (...) | ✅ Yes |
WHERE column LIKE '%text%' | ❌ No (use GIN) |
WHERE column @> json | ❌ No (use GIN) |
GIN Indexes (JSONB and Arrays)
-- JSONB index
CREATE INDEX idx_users_metadata ON users USING GIN (metadata);
-- Query with JSONB index
SELECT * FROM users
WHERE metadata @> '{"premium": true}';
SELECT * FROM users
WHERE metadata ? 'email_verified';
-- Array index
CREATE INDEX idx_posts_tags ON posts USING GIN (tags);
-- Query with array index
SELECT * FROM posts
WHERE tags && ARRAY['postgresql', 'database']; -- Overlap operator
SELECT * FROM posts
WHERE tags @> ARRAY['postgresql']; -- Contains operator
GiST Indexes (Geometric and Full-Text)
-- Geometric index
CREATE INDEX idx_locations_coords ON locations USING GiST (coordinates);
-- Full-text search index
CREATE INDEX idx_articles_search ON articles USING GiST (to_tsvector('english', title || ' ' || content));
-- Query with full-text search
SELECT * FROM articles
WHERE to_tsvector('english', title || ' ' || content) @@ to_tsquery('postgresql & performance');
BRIN Indexes (Large Time-Series Data)
-- Block Range Index for naturally ordered data
CREATE INDEX idx_events_timestamp_brin ON events USING BRIN (timestamp);
-- Much smaller than B-tree for time-series
-- Works best when data is inserted in order
Analyzing Query Performance
EXPLAIN ANALYZE
-- Basic explain
EXPLAIN SELECT * FROM orders WHERE customer_id = 123;
-- Explain with actual execution
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 123;
-- Include buffers (shows I/O)
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = 123;
-- Include timing
EXPLAIN (ANALYZE, TIMING, BUFFERS) SELECT * FROM orders WHERE customer_id = 123;
Reading EXPLAIN Output
QUERY PLAN
-----------------------------------------------------------------------------------------
Nested Loop (cost=0.43..8.46 rows=1 width=36) (actual time=0.025..0.026 rows=0 loops=1)
-> Index Scan using customers_pkey on customers (cost=0.43..8.45 rows=1 width=16)
(actual time=0.022..0.023 rows=1 loops=1)
Index Cond: (id = 123)
-> Index Scan using orders_customer_idx on orders (cost=0.00..8.01 rows=1 width=20)
(actual time=0.002..0.002 rows=0 loops=1)
Index Cond: (customer_id = 123)
Planning Time: 0.152 ms
Execution Time: 0.058 ms
Key metrics:
- cost: Estimated cost (startup..total)
- rows: Estimated rows
- actual time: Actual execution time (first row..all rows)
- loops: How many times this node executed
Identifying Slow Queries
-- Enable pg_stat_statements extension
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- Find slowest queries
SELECT
query,
calls,
total_exec_time,
mean_exec_time,
rows / calls AS avg_rows,
100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;
-- Find most frequently executed queries
SELECT
query,
calls,
total_exec_time,
mean_exec_time
FROM pg_stat_statements
ORDER BY calls DESC
LIMIT 10;
Query Optimization Techniques
1. Avoid SELECT *
-- BAD: Fetches all columns, prevents index-only scans
SELECT * FROM orders WHERE customer_id = 123;
-- GOOD: Fetch only needed columns
SELECT order_id, total_amount, status
FROM orders
WHERE customer_id = 123;
2. Use CTEs for Complex Queries
-- Hard to read and optimize
SELECT
c.name,
SUM(o.total)
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE c.active = true
AND o.created_at > NOW() - INTERVAL '30 days'
GROUP BY c.id, c.name
HAVING SUM(o.total) > 1000
ORDER BY SUM(o.total) DESC;
-- Better with CTEs
WITH recent_orders AS (
SELECT * FROM orders
WHERE created_at > NOW() - INTERVAL '30 days'
),
active_customers AS (
SELECT * FROM customers WHERE active = true
),
customer_totals AS (
SELECT
c.id,
c.name,
SUM(o.total) AS total_spent
FROM active_customers c
JOIN recent_orders o ON c.id = o.customer_id
GROUP BY c.id, c.name
)
SELECT name, total_spent
FROM customer_totals
WHERE total_spent > 1000
ORDER BY total_spent DESC;
3. Use EXISTS Instead of IN for Subqueries
-- Slower with large datasets
SELECT * FROM customers
WHERE id IN (SELECT customer_id FROM orders WHERE total > 1000);
-- Faster with EXISTS
SELECT * FROM customers c
WHERE EXISTS (
SELECT 1 FROM orders o
WHERE o.customer_id = c.id AND o.total > 1000
);
4. Batch Operations
-- BAD: Many round trips
INSERT INTO orders (customer_id, total) VALUES (1, 100);
INSERT INTO orders (customer_id, total) VALUES (2, 200);
INSERT INTO orders (customer_id, total) VALUES (3, 300);
-- GOOD: Single round trip
INSERT INTO orders (customer_id, total) VALUES
(1, 100),
(2, 200),
(3, 300);
-- Even better for bulk inserts
COPY orders(customer_id, total)
FROM STDIN WITH (FORMAT CSV);
1,100
2,200
3,300
\.
5. Proper JOIN Techniques
-- Ensure statistics are up to date for optimal join plans
ANALYZE customers;
ANALYZE orders;
-- Use explicit JOIN syntax
SELECT c.name, o.total
FROM customers c
INNER JOIN orders o ON c.id = o.customer_id
WHERE o.created_at > '2026-01-01';
-- For small tables, consider lateral joins
SELECT c.id, c.name, recent.total
FROM customers c
CROSS JOIN LATERAL (
SELECT total FROM orders
WHERE customer_id = c.id
ORDER BY created_at DESC
LIMIT 1
) recent;
Configuration Tuning
Key PostgreSQL Settings
# postgresql.conf
#------------------------------------------------------------------------------
# MEMORY SETTINGS
#------------------------------------------------------------------------------
# Shared buffer cache (typically 25% of RAM)
shared_buffers = 4GB
# Effective cache size (estimate of OS cache, typically 75% of RAM)
effective_cache_size = 12GB
# Memory per operation (for sorts, hash joins)
work_mem = 256MB
# Memory for maintenance operations (VACUUM, CREATE INDEX)
maintenance_work_mem = 1GB
# Memory for autovacuum workers
autovacuum_work_mem = 512MB
#------------------------------------------------------------------------------
# WRITE AHEAD LOG (WAL)
#------------------------------------------------------------------------------
# WAL level for replication
wal_level = replica
# Number of WAL segments to keep
max_wal_size = 4GB
min_wal_size = 1GB
# Checkpoint completion target (spread checkpoint I/O)
checkpoint_completion_target = 0.9
#------------------------------------------------------------------------------
# QUERY PLANNER
#------------------------------------------------------------------------------
# Random page cost (lower for SSDs: 1.1-1.5)
random_page_cost = 1.1
# Effective I/O concurrency (higher for SSDs: 200-1000)
effective_io_concurrency = 200
# Enable parallel query
max_parallel_workers_per_gather = 4
max_parallel_workers = 8
max_parallel_maintenance_workers = 4
#------------------------------------------------------------------------------
# CONNECTIONS
#------------------------------------------------------------------------------
# Maximum connections
max_connections = 200
# Superuser reserved connections
superuser_reserved_connections = 5
#------------------------------------------------------------------------------
# AUTOVACUUM
#------------------------------------------------------------------------------
# Enable autovacuum
autovacuum = on
# Number of autovacuum workers
autovacuum_max_workers = 3
# Time between autovacuum runs
autovacuum_naptime = 60s
# Threshold for triggering vacuum
autovacuum_vacuum_threshold = 50
autovacuum_vacuum_scale_factor = 0.1
# Threshold for triggering analyze
autovacuum_analyze_threshold = 50
autovacuum_analyze_scale_factor = 0.05
Configuration by Workload
# OLTP (High Transaction Volume)
shared_buffers = 25% of RAM
work_mem = 64-256MB
effective_cache_size = 75% of RAM
checkpoint_completion_target = 0.9
wal_buffers = 16MB
# Analytics/OLAP (Complex Queries)
shared_buffers = 25% of RAM
work_mem = 512MB-2GB
effective_cache_size = 75% of RAM
random_page_cost = 1.1
effective_io_concurrency = 200
max_parallel_workers_per_gather = 4-8
# Mixed Workload
shared_buffers = 25% of RAM
work_mem = 256MB
effective_cache_size = 75% of RAM
checkpoint_completion_target = 0.8
Maintenance Tasks
VACUUM and ANALYZE
-- Manual vacuum (reclaims space, doesn't lock)
VACUUM orders;
-- Verbose vacuum (shows progress)
VACUUM VERBOSE orders;
-- Full vacuum (aggressive, locks table)
VACUUM FULL orders;
-- Update statistics
ANALYZE orders;
-- Combined
VACUUM ANALYZE orders;
-- Check table bloat
SELECT
schemaname,
relname,
pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
pg_size_pretty(pg_relation_size(relid)) AS table_size,
pg_size_pretty(pg_indexes_size(relid)) AS index_size,
n_dead_tup,
n_live_tup,
n_dead_tup::float / nullif(n_live_tup + n_dead_tup, 0) AS dead_ratio
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;
Index Maintenance
-- Check index usage
SELECT
schemaname,
relname,
indexrelname,
idx_scan,
idx_tup_read,
idx_tup_fetch,
pg_size_pretty(pg_relation_size(indexrelid)) AS size
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC;
-- Find unused indexes (candidates for removal)
SELECT
schemaname,
relname,
indexrelname
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexrelname NOT LIKE '%_pkey'
ORDER BY relname;
-- Reindex (rebuild index)
REINDEX INDEX idx_orders_customer;
-- Reindex entire table
REINDEX TABLE orders;
-- Concurrent reindex (doesn't lock)
REINDEX INDEX CONCURRENTLY idx_orders_customer;
Monitoring and Alerting
Key Metrics to Monitor
-- Connection usage
SELECT
max_connections,
(SELECT count(*) FROM pg_stat_activity) AS active_connections,
(SELECT count(*) FROM pg_stat_activity WHERE state = 'idle') AS idle_connections,
(SELECT count(*) FROM pg_stat_activity WHERE state = 'active') AS running_queries
FROM pg_settings
WHERE name = 'max_connections';
-- Cache hit ratio (should be >99%)
SELECT
sum(heap_blks_hit) / nullif(sum(heap_blks_hit) + sum(heap_blks_read), 0) AS cache_hit_ratio
FROM pg_statio_user_tables;
-- Long-running queries
SELECT
pid,
usename,
now() - query_start AS duration,
state,
query
FROM pg_stat_activity
WHERE state != 'idle'
AND now() - query_start > INTERVAL '5 minutes'
ORDER BY duration DESC;
-- Lock information
SELECT
blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.GRANTED;
Setting Up Alerts
# monitoring/alerts.py
import psycopg2
from typing import List, Dict
class PostgreSQLMonitor:
def __init__(self, connection_string: str):
self.conn = psycopg2.connect(connection_string)
def check_slow_queries(self, threshold_minutes: int = 5) -> List[Dict]:
"""Alert on long-running queries."""
query = """
SELECT pid, usename, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle'
AND now() - query_start > INTERVAL '%s minutes'
"""
with self.conn.cursor() as cur:
cur.execute(query, (threshold_minutes,))
return [
{'pid': r[0], 'user': r[1], 'duration': str(r[2]), 'query': r[3]}
for r in cur.fetchall()
]
def check_replication_lag(self, threshold_mb: int = 100) -> Dict:
"""Alert on replication lag."""
query = """
SELECT client_addr,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
FROM pg_stat_replication
"""
with self.conn.cursor() as cur:
cur.execute(query)
alerts = []
for r in cur.fetchall():
lag_mb = r[1] / 1024 / 1024
if lag_mb > threshold_mb:
alerts.append({'client': r[0], 'lag_mb': lag_mb})
return {'alerts': alerts, 'status': 'critical' if alerts else 'ok'}
def check_connection_usage(self, threshold_percent: float = 80) -> Dict:
"""Alert on high connection usage."""
query = """
SELECT
(SELECT setting::int FROM pg_settings WHERE name = 'max_connections') AS max_conn,
(SELECT count(*) FROM pg_stat_activity) AS active_conn
"""
with self.conn.cursor() as cur:
cur.execute(query)
max_conn, active_conn = cur.fetchone()
usage_percent = (active_conn / max_conn) * 100
return {
'max_connections': max_conn,
'active_connections': active_conn,
'usage_percent': usage_percent,
'status': 'warning' if usage_percent > threshold_percent else 'ok'
}
# Usage
monitor = PostgreSQLMonitor('postgresql://user:pass@host:5432/db')
slow_queries = monitor.check_slow_queries(threshold_minutes=5)
if slow_queries:
send_alert(f"Found {len(slow_queries)} slow queries", slow_queries)
replication = monitor.check_replication_lag(threshold_mb=100)
if replication['status'] == 'critical':
send_alert("Replication lag critical!", replication)
connections = monitor.check_connection_usage(threshold_percent=80)
if connections['status'] == 'warning':
send_alert("High connection usage!", connections)
Key Takeaways
PostgreSQL optimization requires:
- Right indexes: Choose index types based on query patterns
- Query analysis: Use EXPLAIN ANALYZE to understand execution plans
- Configuration tuning: Adjust settings for your workload
- Regular maintenance: VACUUM, ANALYZE, REINDEX
- Monitoring: Track key metrics and set up alerts
Proper optimization can yield 10-100x performance improvements.
Questions about PostgreSQL optimization? Reach out through the contact page or connect on LinkedIn.
Related Posts
Modern Data Transformation with dbt: The Complete Guide
Master dbt (data build tool) for analytics engineering. Learn models, tests, snapshots, and best practices for building reliable, scalable data transformations with SQL.
Data EngineeringBuilding Scalable Data Pipelines with Apache Spark: A Complete Guide
Learn how to design and implement production-ready data pipelines using Apache Spark. Covers architecture patterns, best practices, fault tolerance, and real-world examples for processing millions of events per second.
Data EngineeringBuilding a Modern Data Lakehouse Architecture
Combine the best of data lakes and warehouses with lakehouse architecture for flexible, scalable analytics.