PostgreSQL Performance Optimization: Practical Tips for Data Engineers

PostgreSQL has been my go-to database for data engineering workloads. Whether it’s powering analytics dashboards, storing pipeline state, or serving as a data warehouse, PostgreSQL handles it all. But out-of-the-box configurations rarely deliver optimal performance. In this guide, I’ll share the optimization techniques that have yielded 10-100x performance improvements in my projects.

Understanding PostgreSQL Architecture

Before optimizing, understand how PostgreSQL works:

┌─────────────────────────────────────────────────────────────┐
│                      Client Applications                      │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                    PostgreSQL Server                          │
│  ┌──────────────────────────────────────────────────────┐   │
│  │                 Shared Buffers                        │   │
│  │              (In-Memory Cache)                        │   │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐    │   │
│  │  │ Buffer  │ │ Buffer  │ │ Buffer  │ │ Buffer  │    │   │
│  │  │ Page 1  │ │ Page 2  │ │ Page 3  │ │ Page 4  │    │   │
│  │  └─────────┘ └─────────┘ └─────────┘ └─────────┘    │   │
│  └──────────────────────────────────────────────────────┘   │
│          │                  │                  │             │
│          ▼                  ▼                  ▼             │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │   Query      │  │   Background │  │   WAL        │      │
│  │   Planner    │  │   Workers    │  │   Buffer     │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                      Disk Storage                             │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐       │
│  │  Table   │ │  Index   │ │   WAL    │ │   Temp   │       │
│  │  Files   │ │  Files   │ │  Files   │ │  Files   │       │
│  └──────────┘ └──────────┘ └──────────┘ └──────────┘       │
└─────────────────────────────────────────────────────────────┘

Indexing Strategies

B-Tree Indexes (Default)

Best for equality and range queries:

-- Create index
CREATE INDEX idx_users_email ON users(email);

-- Composite index (order matters!)
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date DESC);

-- Partial index (index only matching rows)
CREATE INDEX idx_active_users ON users(email) WHERE active = true;

-- Expression index
CREATE INDEX idx_users_lower_email ON users(LOWER(email));

-- Covering index (includes additional columns)
CREATE INDEX idx_orders_covering ON orders(customer_id) INCLUDE (total_amount, status);

When to Use B-Tree

Query Pattern	Use B-Tree?
`WHERE column = value`	✅ Yes
`WHERE column > value`	✅ Yes
`WHERE column BETWEEN x AND y`	✅ Yes
`ORDER BY column`	✅ Yes
`WHERE column IN (...)`	✅ Yes
`WHERE column LIKE '%text%'`	❌ No (use GIN)
`WHERE column @> json`	❌ No (use GIN)

GIN Indexes (JSONB and Arrays)

-- JSONB index
CREATE INDEX idx_users_metadata ON users USING GIN (metadata);

-- Query with JSONB index
SELECT * FROM users
WHERE metadata @> '{"premium": true}';

SELECT * FROM users
WHERE metadata ? 'email_verified';

-- Array index
CREATE INDEX idx_posts_tags ON posts USING GIN (tags);

-- Query with array index
SELECT * FROM posts
WHERE tags && ARRAY['postgresql', 'database'];  -- Overlap operator

SELECT * FROM posts
WHERE tags @> ARRAY['postgresql'];  -- Contains operator

GiST Indexes (Geometric and Full-Text)

-- Geometric index
CREATE INDEX idx_locations_coords ON locations USING GiST (coordinates);

-- Full-text search index
CREATE INDEX idx_articles_search ON articles USING GiST (to_tsvector('english', title || ' ' || content));

-- Query with full-text search
SELECT * FROM articles
WHERE to_tsvector('english', title || ' ' || content) @@ to_tsquery('postgresql & performance');

BRIN Indexes (Large Time-Series Data)

-- Block Range Index for naturally ordered data
CREATE INDEX idx_events_timestamp_brin ON events USING BRIN (timestamp);

-- Much smaller than B-tree for time-series
-- Works best when data is inserted in order

Analyzing Query Performance

EXPLAIN ANALYZE

-- Basic explain
EXPLAIN SELECT * FROM orders WHERE customer_id = 123;

-- Explain with actual execution
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 123;

-- Include buffers (shows I/O)
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = 123;

-- Include timing
EXPLAIN (ANALYZE, TIMING, BUFFERS) SELECT * FROM orders WHERE customer_id = 123;

Reading EXPLAIN Output

QUERY PLAN
-----------------------------------------------------------------------------------------
Nested Loop  (cost=0.43..8.46 rows=1 width=36) (actual time=0.025..0.026 rows=0 loops=1)
  ->  Index Scan using customers_pkey on customers  (cost=0.43..8.45 rows=1 width=16)
        (actual time=0.022..0.023 rows=1 loops=1)
        Index Cond: (id = 123)
  ->  Index Scan using orders_customer_idx on orders  (cost=0.00..8.01 rows=1 width=20)
        (actual time=0.002..0.002 rows=0 loops=1)
        Index Cond: (customer_id = 123)
Planning Time: 0.152 ms
Execution Time: 0.058 ms

Key metrics:

cost: Estimated cost (startup..total)
rows: Estimated rows
actual time: Actual execution time (first row..all rows)
loops: How many times this node executed

Identifying Slow Queries

-- Enable pg_stat_statements extension
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Find slowest queries
SELECT
    query,
    calls,
    total_exec_time,
    mean_exec_time,
    rows / calls AS avg_rows,
    100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

-- Find most frequently executed queries
SELECT
    query,
    calls,
    total_exec_time,
    mean_exec_time
FROM pg_stat_statements
ORDER BY calls DESC
LIMIT 10;

Query Optimization Techniques

1. Avoid SELECT *

-- BAD: Fetches all columns, prevents index-only scans
SELECT * FROM orders WHERE customer_id = 123;

-- GOOD: Fetch only needed columns
SELECT order_id, total_amount, status
FROM orders
WHERE customer_id = 123;

2. Use CTEs for Complex Queries

-- Hard to read and optimize
SELECT
    c.name,
    SUM(o.total)
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE c.active = true
  AND o.created_at > NOW() - INTERVAL '30 days'
GROUP BY c.id, c.name
HAVING SUM(o.total) > 1000
ORDER BY SUM(o.total) DESC;

-- Better with CTEs
WITH recent_orders AS (
    SELECT * FROM orders
    WHERE created_at > NOW() - INTERVAL '30 days'
),
active_customers AS (
    SELECT * FROM customers WHERE active = true
),
customer_totals AS (
    SELECT
        c.id,
        c.name,
        SUM(o.total) AS total_spent
    FROM active_customers c
    JOIN recent_orders o ON c.id = o.customer_id
    GROUP BY c.id, c.name
)
SELECT name, total_spent
FROM customer_totals
WHERE total_spent > 1000
ORDER BY total_spent DESC;

3. Use EXISTS Instead of IN for Subqueries

-- Slower with large datasets
SELECT * FROM customers
WHERE id IN (SELECT customer_id FROM orders WHERE total > 1000);

-- Faster with EXISTS
SELECT * FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o
    WHERE o.customer_id = c.id AND o.total > 1000
);

4. Batch Operations

-- BAD: Many round trips
INSERT INTO orders (customer_id, total) VALUES (1, 100);
INSERT INTO orders (customer_id, total) VALUES (2, 200);
INSERT INTO orders (customer_id, total) VALUES (3, 300);

-- GOOD: Single round trip
INSERT INTO orders (customer_id, total) VALUES
    (1, 100),
    (2, 200),
    (3, 300);

-- Even better for bulk inserts
COPY orders(customer_id, total)
FROM STDIN WITH (FORMAT CSV);
1,100
2,200
3,300
\.

5. Proper JOIN Techniques

-- Ensure statistics are up to date for optimal join plans
ANALYZE customers;
ANALYZE orders;

-- Use explicit JOIN syntax
SELECT c.name, o.total
FROM customers c
INNER JOIN orders o ON c.id = o.customer_id
WHERE o.created_at > '2026-01-01';

-- For small tables, consider lateral joins
SELECT c.id, c.name, recent.total
FROM customers c
CROSS JOIN LATERAL (
    SELECT total FROM orders
    WHERE customer_id = c.id
    ORDER BY created_at DESC
    LIMIT 1
) recent;

Configuration Tuning

Key PostgreSQL Settings

# postgresql.conf

#------------------------------------------------------------------------------
# MEMORY SETTINGS
#------------------------------------------------------------------------------

# Shared buffer cache (typically 25% of RAM)
shared_buffers = 4GB

# Effective cache size (estimate of OS cache, typically 75% of RAM)
effective_cache_size = 12GB

# Memory per operation (for sorts, hash joins)
work_mem = 256MB

# Memory for maintenance operations (VACUUM, CREATE INDEX)
maintenance_work_mem = 1GB

# Memory for autovacuum workers
autovacuum_work_mem = 512MB

#------------------------------------------------------------------------------
# WRITE AHEAD LOG (WAL)
#------------------------------------------------------------------------------

# WAL level for replication
wal_level = replica

# Number of WAL segments to keep
max_wal_size = 4GB
min_wal_size = 1GB

# Checkpoint completion target (spread checkpoint I/O)
checkpoint_completion_target = 0.9

#------------------------------------------------------------------------------
# QUERY PLANNER
#------------------------------------------------------------------------------

# Random page cost (lower for SSDs: 1.1-1.5)
random_page_cost = 1.1

# Effective I/O concurrency (higher for SSDs: 200-1000)
effective_io_concurrency = 200

# Enable parallel query
max_parallel_workers_per_gather = 4
max_parallel_workers = 8
max_parallel_maintenance_workers = 4

#------------------------------------------------------------------------------
# CONNECTIONS
#------------------------------------------------------------------------------

# Maximum connections
max_connections = 200

# Superuser reserved connections
superuser_reserved_connections = 5

#------------------------------------------------------------------------------
# AUTOVACUUM
#------------------------------------------------------------------------------

# Enable autovacuum
autovacuum = on

# Number of autovacuum workers
autovacuum_max_workers = 3

# Time between autovacuum runs
autovacuum_naptime = 60s

# Threshold for triggering vacuum
autovacuum_vacuum_threshold = 50
autovacuum_vacuum_scale_factor = 0.1

# Threshold for triggering analyze
autovacuum_analyze_threshold = 50
autovacuum_analyze_scale_factor = 0.05

Configuration by Workload

# OLTP (High Transaction Volume)
shared_buffers = 25% of RAM
work_mem = 64-256MB
effective_cache_size = 75% of RAM
checkpoint_completion_target = 0.9
wal_buffers = 16MB

# Analytics/OLAP (Complex Queries)
shared_buffers = 25% of RAM
work_mem = 512MB-2GB
effective_cache_size = 75% of RAM
random_page_cost = 1.1
effective_io_concurrency = 200
max_parallel_workers_per_gather = 4-8

# Mixed Workload
shared_buffers = 25% of RAM
work_mem = 256MB
effective_cache_size = 75% of RAM
checkpoint_completion_target = 0.8

Maintenance Tasks

VACUUM and ANALYZE

-- Manual vacuum (reclaims space, doesn't lock)
VACUUM orders;

-- Verbose vacuum (shows progress)
VACUUM VERBOSE orders;

-- Full vacuum (aggressive, locks table)
VACUUM FULL orders;

-- Update statistics
ANALYZE orders;

-- Combined
VACUUM ANALYZE orders;

-- Check table bloat
SELECT
    schemaname,
    relname,
    pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
    pg_size_pretty(pg_relation_size(relid)) AS table_size,
    pg_size_pretty(pg_indexes_size(relid)) AS index_size,
    n_dead_tup,
    n_live_tup,
    n_dead_tup::float / nullif(n_live_tup + n_dead_tup, 0) AS dead_ratio
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;

Index Maintenance

-- Check index usage
SELECT
    schemaname,
    relname,
    indexrelname,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch,
    pg_size_pretty(pg_relation_size(indexrelid)) AS size
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC;

-- Find unused indexes (candidates for removal)
SELECT
    schemaname,
    relname,
    indexrelname
FROM pg_stat_user_indexes
WHERE idx_scan = 0
  AND indexrelname NOT LIKE '%_pkey'
ORDER BY relname;

-- Reindex (rebuild index)
REINDEX INDEX idx_orders_customer;

-- Reindex entire table
REINDEX TABLE orders;

-- Concurrent reindex (doesn't lock)
REINDEX INDEX CONCURRENTLY idx_orders_customer;

Monitoring and Alerting

Key Metrics to Monitor

-- Connection usage
SELECT
    max_connections,
    (SELECT count(*) FROM pg_stat_activity) AS active_connections,
    (SELECT count(*) FROM pg_stat_activity WHERE state = 'idle') AS idle_connections,
    (SELECT count(*) FROM pg_stat_activity WHERE state = 'active') AS running_queries
FROM pg_settings
WHERE name = 'max_connections';

-- Cache hit ratio (should be >99%)
SELECT
    sum(heap_blks_hit) / nullif(sum(heap_blks_hit) + sum(heap_blks_read), 0) AS cache_hit_ratio
FROM pg_statio_user_tables;

-- Long-running queries
SELECT
    pid,
    usename,
    now() - query_start AS duration,
    state,
    query
FROM pg_stat_activity
WHERE state != 'idle'
  AND now() - query_start > INTERVAL '5 minutes'
ORDER BY duration DESC;

-- Lock information
SELECT
    blocked_locks.pid AS blocked_pid,
    blocked_activity.usename AS blocked_user,
    blocking_locks.pid AS blocking_pid,
    blocking_activity.usename AS blocking_user,
    blocked_activity.query AS blocked_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
    ON blocking_locks.locktype = blocked_locks.locktype
    AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
    AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
    AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
    AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
    AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
    AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
    AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
    AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
    AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
    AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.GRANTED;

Setting Up Alerts

# monitoring/alerts.py
import psycopg2
from typing import List, Dict

class PostgreSQLMonitor:
    def __init__(self, connection_string: str):
        self.conn = psycopg2.connect(connection_string)

    def check_slow_queries(self, threshold_minutes: int = 5) -> List[Dict]:
        """Alert on long-running queries."""
        query = """
            SELECT pid, usename, now() - query_start AS duration, query
            FROM pg_stat_activity
            WHERE state != 'idle'
              AND now() - query_start > INTERVAL '%s minutes'
        """
        with self.conn.cursor() as cur:
            cur.execute(query, (threshold_minutes,))
            return [
                {'pid': r[0], 'user': r[1], 'duration': str(r[2]), 'query': r[3]}
                for r in cur.fetchall()
            ]

    def check_replication_lag(self, threshold_mb: int = 100) -> Dict:
        """Alert on replication lag."""
        query = """
            SELECT client_addr,
                   pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
            FROM pg_stat_replication
        """
        with self.conn.cursor() as cur:
            cur.execute(query)
            alerts = []
            for r in cur.fetchall():
                lag_mb = r[1] / 1024 / 1024
                if lag_mb > threshold_mb:
                    alerts.append({'client': r[0], 'lag_mb': lag_mb})
            return {'alerts': alerts, 'status': 'critical' if alerts else 'ok'}

    def check_connection_usage(self, threshold_percent: float = 80) -> Dict:
        """Alert on high connection usage."""
        query = """
            SELECT
                (SELECT setting::int FROM pg_settings WHERE name = 'max_connections') AS max_conn,
                (SELECT count(*) FROM pg_stat_activity) AS active_conn
        """
        with self.conn.cursor() as cur:
            cur.execute(query)
            max_conn, active_conn = cur.fetchone()
            usage_percent = (active_conn / max_conn) * 100

            return {
                'max_connections': max_conn,
                'active_connections': active_conn,
                'usage_percent': usage_percent,
                'status': 'warning' if usage_percent > threshold_percent else 'ok'
            }

# Usage
monitor = PostgreSQLMonitor('postgresql://user:pass@host:5432/db')

slow_queries = monitor.check_slow_queries(threshold_minutes=5)
if slow_queries:
    send_alert(f"Found {len(slow_queries)} slow queries", slow_queries)

replication = monitor.check_replication_lag(threshold_mb=100)
if replication['status'] == 'critical':
    send_alert("Replication lag critical!", replication)

connections = monitor.check_connection_usage(threshold_percent=80)
if connections['status'] == 'warning':
    send_alert("High connection usage!", connections)

Key Takeaways

PostgreSQL optimization requires:

Right indexes: Choose index types based on query patterns
Query analysis: Use EXPLAIN ANALYZE to understand execution plans
Configuration tuning: Adjust settings for your workload
Regular maintenance: VACUUM, ANALYZE, REINDEX
Monitoring: Track key metrics and set up alerts

Proper optimization can yield 10-100x performance improvements.

Questions about PostgreSQL optimization? Reach out through the contact page or connect on LinkedIn.

Data Engineering PostgreSQL Database Performance SQL

MD Furkanul Islam

Data Engineer & AI/ML Specialist

9+ years building intelligent data systems at scale. Passionate about bridging the gap between data engineering, AI, and robotics.

LinkedIn GitHub Twitter