As a preview of a big upcoming change in BioGPS, we have been working for several months on a migration of our database from Oracle to CouchDB. We are also moving our gene annotation to the cloud by hosting our CouchDB instance at Amazon EC2.

We have generally been extremely happy with the performance and stability of EC2 during our internal testing. However, last Thursday morning we experienced our first down time incident. It actually occurred right in the middle of our weekly code review. We lost the connection to our EC2 server, while “Amazon Management Console” still reported that it was running normally. Rebooting server did not bring it back.

Friday morning, we found our server was back to normal again. Amazon did post an “informational message” for Thursday, May 6th indicating that there was an issue during a matching time frame. But it’s still listed under “Service is operating normally” icon, and their 30 minute window was much shorter than our observed outage. And there were other reports with a similar problem posted at the EC2 forum (1, 2, 3).

All those green “operating normally” icons on the AWS status page are nice, but they are significantly less reassuring if smaller outages like this are going under the radar. We have now put our EC2 instance under Nagios monitoring so we can get more data. But still, this incident has slightly slowed up our adoption of AWS.

We’d be very interested if anybody else has had similar experiences to share? Do you have a backup plan in case AWS goes down?