Preventing Max Connection Errors in Go

Earlier this week, I had to debug a production Go system that was crashing with the friendly PostGres error pq: sorry, too many clients already. The culprit was a function that left database transactions open under certain conditions. Although the solution was easy (close your transactions!), the problem was damaging enough that I wanted to find an automated way to guarantee every transaction was being closed throughout the application.

The error I experienced can be replicated by opening more transactions than the database server allows (a common max_connections setting for PostGres is 100):

package main

import "database/sql"
import "log"
import _ "github.com/lib/pq"

func main() {
    conn, err := sql.Open("postgres", "<credentials>")
    if err != nil {
        log.Fatal(err)
    }

    for i := 0; i < 101; i += 1 {
        if _, err := conn.Begin(); err != nil {
            log.Fatal(err)
        }
    }
}

Recent Go versions have introduced more ways to control the database connection pool, such as SetMaxOpenConns, which sets a hard limit on the number of active connections. If set, and when the limit is reached, any requests for new connections will block until another connection is closed. It will not, however, clean up long-running connections (such as unclosed transactions). Therefore, instead of a returning a direct error message when the limit has been reached, the service will timeout - arguably a worse error state.

For PostGres, there are a number of standalone programs (such as PgBouncer) that will implement connection pooling that includes long-running connection clean-up. But not only is this another layer of complexity for the service, its use could mask significant bugs in an application.

The good news is, thanks to Go 1.4’s introduction of TestMain, we can create a test harness for our application that guarantees all connections have been closed upon completion of our tests.

Our first challenge is that any number of testing functions may need to access the database connection pool. Having each call sql.Open, however, will create a new pool per function and prevent us from intelligently managing the active connections. By using sync.Once, we can guarantee that only one connection pool will be created per test run:

var conn *sql.DB // Set package-wide, but not exported
var once sync.Once

func GetConnection() *sql.DB {
    once.Do(func() {
        var err error
        if conn, err = sql.Open("postgres", "<credentials>"); err != nil {
            log.Panic(err)
        }
        conn.SetMaxOpenConns(20) // Sane default
        conn.SetMaxIdleConns(0)
        conn.SetConnMaxLifetime(time.Nanosecond)
    })
    return conn
}

We can then call GetConnection in any test that requires database access:

func TestSomething(t *testing.T) {
    conn := GetConnection()
    for i := 0; i < 10; i++ {
        if err := Something(conn); err != nil {
            t.Error(err)
        }
    }
}

GetConnection also sets some sane defaults around the connection pool that will be useful later.

Next, we need to create a TestMain function. Since our various testing functions will create one and only one connection pool, we only need a setup if there is additional work to be done, such as starting a container image or loading initial SQL fixtures into the database.

func TestMain(m *testing.M) {
    setup()
    retCode := m.Run()
    teardown() // Can't use defer because os.Exit :(
    os.Exit(retCode)
}

The magic of our harness is done in teardown, which is run after all tests have completed. Through a call to Stats we can access - after a brief pause - the current number of open connections. Even one is an error.

func teardown() {
    time.Sleep(time.Millisecond)
    if conn == nil {
        // No tests used the database!
        return
    }
    open := conn.Stats().OpenConnections
    if open > 0 {
        // This could also modify the return code...
        log.Panicf("failed to close %d connections", open)
    }
}

Tests that close their connections will pass as usual:

go test -run=TestGood
PASS
ok      github.com/aodin/go-sql-test-harness    0.071s

But buggy ones will panic during teardown (even if their actual tests pass):

go test -run=TestBad
PASS
2016/05/23 02:04:58 failed to close 10 connections
panic: failed to close 10 connections

Unfortunately, the testing harness cannot tell us in which function(s) the lingering open connections were created. With judicious use of Go test’s -run flag, however, we can quickly locate the buggy function - even if the harness is at best a smoke test.

You can see a basic implementation of this harness on GitHub. Happy hacking!