Notes from a PostgreSQL RDS upgrade

inopinatus

So I recently received an RDS maintenance notification:

From: Amazon Web Services, Inc.
Subject: Upgrade now available for your Amazon RDS PostgreSQL database instances

Dear Amazon RDS Customer,

A system update is now available for any Amazon RDS PostgreSQL database instances you created before 13 October 2015. We recommend installing this update to take advantage of several performance improvements and security fixes. You may choose to install this update immediately, or during your next scheduled maintenance window. After approximately six weeks, your RDS instance will be automatically upgraded during your maintenance window. To learn more about scheduling or installing an upgrade, please refer to the RDS documentation: http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_UpgradeDBInstance.OSUpgrades.html.

Installing this update will have an availability impact for a few minutes on your RDS instance (60-120 seconds for Multi-AZ instances). To reduce downtime, you may consider converting your Single-AZ instances to Multi-AZ. If you have any questions or concerns, please contact your TAM or AWS Support.

The database in question is low-traffic but critical to our service. Here are some notes and lessons from executing this upgrade.

Application outage may be much longer

The quoted impact on availability of “60-120 seconds” is from a database point of view, not your application. For a multi-AZ database on a pair of t2.micro instances, the elapsed time for the upgrade was thirty (30) minutes. For most of this time the postgresql instance was available; the database outage itself was only forty (40) seconds and occurred twenty-one (21) minutes into the procedure. But as a safety move we placed our application into maintenance mode for that whole half-hour period, to avoid any application-level data inconsistency.

Instance may fail over

Upgrade of this multi-AZ instance included a failover. I presume this occurs intentionally, to minimise the RDS outage duration. There were two impacts for us.

Most significantly, and as in all RDS failover events, the service IP address changes. Our application uses Ruby on Rails with the Unicorn application container. The entire Unicorn container must be stopped then started to pick up the new RDS IP address; note that issuing simply a restart will fail because it merely reloads the master process, across which the C library caches the DNS lookup. Yes I tripped over this and for ten seconds we were totally unusable :blush:

Secondly, we received a CPU credit alarm. This is peculiar to the AWS t2 series of instance types that use CPU credits to manage workload bursts. It’s a good fit for our database workload. It seems that the multi-AZ failover process starts an instance with a low initial credit allocation. In our case, the credit level for RDS dropped instantly from ca.125 to 30 and is now creeping back up at the usual rate.

Lessons

  1. Schedule a decent sized maintenance window for your application. We chose a thirty-minute window and it only just fitted.
  2. Determine in advance if your application requires a stop & start to handle RDS failover.
  3. T2 instance users: don’t rely on having a high CPU credit level. If you frequently burst and consume many credits, be aware that the restarted instance may have a low initial credit score. Mitigate this by choosing a quiet time for the upgrade. If you never have (or cannot predict) a quiet time, t2 instances may not be for you.