Database Connection Leak Detection — Find and Fix Leaks Before They Take You Down
What is a connection leak?#
Your application opens a database connection, uses it, and never returns it to the pool. The pool shrinks. Eventually all connections are exhausted. New requests block or fail. Your application is down — not because the database is overloaded, but because your code forgot to close a connection.
Connection leaks are one of the most common causes of production outages in database-backed applications. They are insidious because they happen slowly, then all at once.
Symptoms of a connection leak#
Before you know the root cause, you see these patterns:
Gradual degradation#
- Response times creep up over hours or days
- The application works fine after a restart but degrades again
- Database connections grow monotonically — they go up but never come back down
Sudden failure#
- Connection pool exhausted errors:
Unable to acquire connection from pool - PostgreSQL:
FATAL: too many connections for role - MySQL:
Too many connections - Application threads blocked waiting for a connection
Misleading metrics#
- Database CPU and memory look fine — the database is not the bottleneck
- Application memory grows slowly (each leaked connection holds buffers)
- Load balancer health checks start failing intermittently
Detection with PostgreSQL pg_stat_activity#
PostgreSQL exposes every active connection through the pg_stat_activity view. This is your primary diagnostic tool.
Count connections by state#
SELECT state, count(*)
FROM pg_stat_activity
WHERE datname = 'your_database'
GROUP BY state
ORDER BY count DESC;
Healthy output shows most connections as idle (returned to pool) with a few active (executing queries). If you see hundreds of idle connections growing over time, connections are being opened but not reused properly.
Find long-lived idle connections#
SELECT pid, usename, application_name, client_addr,
backend_start, state_change,
now() - state_change AS idle_duration,
query
FROM pg_stat_activity
WHERE state = 'idle'
AND now() - state_change > interval '10 minutes'
ORDER BY idle_duration DESC;
Connections idle for longer than your pool's max idle time are likely leaked. The application_name and client_addr fields tell you which service is leaking.
Track connection age#
SELECT pid, usename, application_name,
backend_start,
now() - backend_start AS connection_age
FROM pg_stat_activity
WHERE datname = 'your_database'
ORDER BY connection_age DESC
LIMIT 20;
In a healthy pool, connections are recycled. If you see connections that are hours or days old while the pool's max lifetime is 30 minutes, those connections were leaked and never returned.
Identify the leaking query#
SELECT pid, query, state,
now() - query_start AS query_duration,
now() - backend_start AS connection_age
FROM pg_stat_activity
WHERE state != 'idle'
AND now() - query_start > interval '5 minutes'
ORDER BY query_duration DESC;
Long-running queries that never complete can hold connections indefinitely. Look for queries stuck in idle in transaction state — this means a transaction was started but never committed or rolled back.
Connection pool monitoring#
Your connection pool is the chokepoint. Monitor it aggressively.
HikariCP (Java)#
HikariCP exposes metrics via JMX or Micrometer:
hikaricp_connections_active— currently in usehikaricp_connections_idle— available in poolhikaricp_connections_pending— threads waiting for a connectionhikaricp_connections_timeout_total— connection acquisition timeouts
Alert when pending is greater than 0 for more than 30 seconds. This means demand exceeds supply.
Alert when active equals maximum_pool_size for more than 60 seconds. The pool is saturated.
pgBouncer#
If you use pgBouncer as an external connection pooler:
SHOW POOLS;
SHOW CLIENTS;
SHOW SERVERS;
Key fields from SHOW POOLS:
cl_active— client connections actively linked to a server connectioncl_waiting— client connections waiting for a server connectionsv_active— server connections in usesv_idle— server connections available
Alert when cl_waiting is greater than 0 persistently.
Generic metrics to track#
Regardless of your pool implementation, track these:
| Metric | Healthy | Leaking |
|---|---|---|
| Active connections | Fluctuates with load | Grows monotonically |
| Idle connections | Stable | Shrinks to zero |
| Pending requests | 0 or brief spikes | Sustained and growing |
| Connection age | Below max lifetime | Exceeds max lifetime |
| Pool utilization | 20-70% | Approaching 100% |
Leak prevention patterns#
Always use try-with-resources (or equivalent)#
The most common leak: opening a connection and not closing it in the error path.
Java:
// Correct — connection is always returned
try (Connection conn = dataSource.getConnection()) {
// use connection
} // auto-closed even if exception is thrown
Python:
# Correct — context manager returns connection
with engine.connect() as conn:
result = conn.execute(query)
# connection returned to pool here
Go:
// Correct — defer ensures close
rows, err := db.Query(query)
if err != nil {
return err
}
defer rows.Close() // always close rows
Set aggressive pool timeouts#
Do not let connections live forever. Configure:
- Max lifetime: 30 minutes. Connections older than this are closed and replaced.
- Idle timeout: 10 minutes. Idle connections beyond this are removed.
- Connection timeout: 5 seconds. If a connection cannot be acquired in 5 seconds, fail fast rather than queue indefinitely.
- Leak detection threshold: 60 seconds. If a connection is checked out for longer than this, log a warning with the stack trace.
Enable leak detection in your pool#
HikariCP has built-in leak detection:
spring:
datasource:
hikari:
leak-detection-threshold: 60000 # 60 seconds
When a connection is held longer than the threshold, HikariCP logs the stack trace of the code that borrowed it. This pinpoints the exact line that forgot to return the connection.
Use statement timeouts#
Set a maximum query execution time at the database level:
-- PostgreSQL: kill queries running longer than 30 seconds
ALTER ROLE app_user SET statement_timeout = '30s';
This prevents a single runaway query from holding a connection indefinitely.
Wrap connection acquisition in middleware#
In web applications, acquire the connection at the start of the request and return it at the end — regardless of what happens in between:
@app.middleware("http")
async def db_session_middleware(request, call_next):
async with get_db_session() as session:
request.state.db = session
response = await call_next(request)
# session is returned here, guaranteed
return response
Automated alerting#
Prometheus + Alertmanager rules#
groups:
- name: connection_pool_alerts
rules:
- alert: ConnectionPoolExhausted
expr: hikaricp_connections_active / hikaricp_connections_max == 1
for: 2m
labels:
severity: critical
annotations:
summary: "Connection pool is fully exhausted"
- alert: ConnectionPoolLeakSuspected
expr: increase(hikaricp_connections_active[1h]) > 10
and hikaricp_connections_idle == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Possible connection leak — active growing, idle at zero"
PostgreSQL connection count alert#
- alert: PostgreSQLTooManyConnections
expr: pg_stat_activity_count > (pg_settings_max_connections * 0.8)
for: 5m
labels:
severity: warning
annotations:
summary: "PostgreSQL connections at 80% of max_connections"
Dashboard checklist#
Build a Grafana dashboard with these panels:
- Connection pool utilization — active / max over time
- Pending connection requests — should be zero
- Connection age distribution — histogram of how long connections have been alive
- pg_stat_activity by state — stacked area chart of idle, active, idle in transaction
- Connection acquisition latency — p50, p95, p99
Emergency response#
When connections are exhausted in production:
- Restart the leaking service — immediate relief, buys time
- Kill idle connections in PostgreSQL:
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND now() - state_change > interval '30 minutes'; - Increase max_connections temporarily — not a fix, but buys time
- Deploy a connection pooler (pgBouncer) if you do not have one
- Find and fix the leak using the detection methods above
Visualize your database connection architecture#
Map out your services, connection pools, and database topology with interactive diagrams — try Codelit to generate architecture diagrams from a text prompt.
Key takeaways#
- Connection leaks degrade slowly then fail suddenly — monitor pool utilization before it hits 100%
- pg_stat_activity is your best friend — query it regularly to spot idle and long-lived connections
- Always use try-with-resources or context managers to guarantee connection return
- Enable leak detection in your pool (HikariCP's leak-detection-threshold logs the offending stack trace)
- Set aggressive timeouts — max lifetime, idle timeout, statement timeout
- Alert on pool saturation and monotonically growing active connections before users notice
Article #423 in the Codelit engineering series. Explore our full library of system design, infrastructure, and architecture guides at codelit.io.
Try it on Codelit
Cost Estimator
See estimated AWS monthly costs for every component in your architecture
Related articles
Try these templates
Build this architecture
Generate an interactive architecture for Database Connection Leak Detection in seconds.
Try it in Codelit →
Comments