Gorm Expire Aurora Postgresql Connecitons when master-slave Switch

Describe the feature

System Background:

golang version 1.8
postgresql version: Aurora Postgresql
gorm version: v1.21.15

Issue Summary:

If we set up the connMaxlifetime into sqlDB.SetConnMaxLifetime(time.Hour) the golang application will be fall into ERROR: cannot execute UPDATE in a read-only transaction (SQLSTATE 25006) in an hour without recovering.

Problem Analysis

How connection reuse?

//If the database has
// a concept of per-connection state, such state can be reliably observed
// within a transaction (Tx) or connection (Conn). **Once DB.Begin is called**, the
// returned Tx is bound to a single connection. Once Commit or
// Rollback is called on the transaction, that transaction's
// connection is returned to DB's idle connection pool. The pool size
// can be **controlled with SetMaxIdleConns**.
type DB struct {
...
    freeConn     []*driverConn // free connections ordered by returnedAt oldest to newest
...
}

func (db *DB) begin(ctx context.Context, opts *TxOptions, strategy connReuseStrategy) (tx *Tx, err error) {
    dc, err := db.conn(ctx, strategy)
    if err != nil {
        return nil, err
    }
    return db.beginDC(ctx, dc, dc.releaseConn, opts)
}

// conn returns a newly-opened or cached *driverConn.
func (db *DB) conn(ctx context.Context, strategy connReuseStrategy) (*driverConn, error) {
...
}

According to the code above, we notice that for each SQL connection, it will fetch a connection by DB.Begin, and it will returns a cached connection if the life time is still available, codes bellow.

        conn := db.freeConn[last]
        db.freeConn = db.freeConn[:last]

Why read-only transaction (SQLSTATE 25006) peresist? It's a tricky and confusing problem for Aurora master-slave switching strategy.

Let's see the following logs while I keep updating data:

Timestamp ErrorMessage
1675653253.223686 unexpected EOF; conn closed
1675653254.041482 connect: connection refused
1675653261.698505 server error (FATAL: the database system is starting up (SQLSTATE 57P03))
1675653262.8552341 ERROR: cannot execute UPDATE in a read-only transaction (SQLSTATE 25006)
1675653274.2220411 Recovered

And the Ping Host logs here:

-- Sensitive data marked.
___go_bui 65222 TCP 10.3.14.43:62035->10.*.*.226:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:62035->10.*.*.226:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:62035->10.*.*.226:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:63562->10.*.*.226:postgresql (SYN_SENT)
___go_bui 65222 TCP 10.3.14.43:63584->10.*.*.226:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:63584->10.*.*.226:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:63594->10.*.*.226:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:63594->10.*.*.226:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:63594->10.*.*.226:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:63594->10.*.*.226:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:63594->10.*.*.226:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:63594->10.*.*.226:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:63594->10.*.*.226:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:63594->10.*.*.226:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:63594->10.*.*.226:postgresql (ESTABLISHED)

We notice that Aurora didn't use a VIP-Server to proxy the backend database entity, it somehow changes the ip for a specific host every time.

Reference here: - https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.BestPractices.FastFailover.html - https://docs.aws.amazon.com/whitepapers/latest/amazon-aurora-mysql-db-admin-handbook/dns-caching.html - http://go-database-sql.org/errors.html

So, the root cause is that, golang database engine will reuse the previous connection configuration instead of fetching a new ip for a specific host. After reconnected the target host is now being switching to Readonly, but gorm or golang sql package could not identify. And in this case if the maxLifeTime was configured as Time.Hour, your system will getting down in this time period.

Gorm Expire Aurora Postgresql Connecitons when master-slave Switch

Solution Suggestion

change the configuration db.SetMaxIdleConns(0) or DB.SetConnMaxLifetime(time.Second) Obviously, it will cost some time to reconnect instead of using connection pool.
Looping & Querying with scan: https://github.com/jackc/pgx/issues/1178

My Solution

Using callback here:

func DetectReadonly(db *gorm.DB) {
    db.Callback().Create().Register("gorm:readonly:error", Callback)
    db.Callback().Update().Register("gorm:readonly:error", Callback)
    db.Callback().Delete().Register("gorm:readonly:error", Callback)
    db.Callback().Raw().Register("gorm:readonly:error", Callback)
}

func Callback(db *gorm.DB) {
    sqlDB, _:= PgDb.DB()

    if db.Error != nil && strings.Contains(db.Error.Error(), "SQLSTATE 25006") {
        sqlDB.SetConnMaxLifetime(time.Second)
        fmt.Println("Readonly Exception: ", db.Error.Error())
    } else {
        sqlDB.SetConnMaxLifetime(time.Hour)
    }
}

Ip changes after reset connection max life time

___go_bui 65222 TCP 10.3.14.43:62035->10.*.*.226:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:62035->10.*.*.226:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:62035->10.*.*.226:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:63562->10.*.*.226:postgresql (SYN_SENT)
___go_bui 65222 TCP 10.3.14.43:63584->10.*.*.226:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:63584->10.*.*.226:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:63594->10.*.*.226:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:63594->10.*.*.226:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:63594->10.*.*.226:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:63594->10.*.*.226:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:63594->10.*.*.226:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:63594->10.*.*.226:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:63594->10.*.*.226:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:63594->10.*.*.226:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:63594->10.*.*.226:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:63625->10.*.*.37:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:63625->10.*.*.37:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:63625->10.*.*.37:postgresql (ESTABLISHED)
___go_bui 65222 TCP 10.3.14.43:63625->10.*.*.37:postgresql (ESTABLISHED)

Motivation

Actually, I want something that whether we can identify the current pgdb is master or slave, in this case we can detect that our update/insert logic should use the writing database.

Maybe we can add a configuation that allow us to do that.

Related Issues

N.A.

Comment From: whenjonny

It seems that this repo pgconn@v1.10.1 had solved this issue by using the setting target_session_attrs=read-write.

And the libpq-connect has specific that:

target_session_attrs This option determines whether the session must have certain properties to be acceptable. It's typically used in combination with multiple host names to select the first acceptable alternative among several hosts. There are six modes: read-write session must accept read-write transactions by default (that is, the server must not be in hot standby mode and the default_transaction_read_only parameter must be off)