Important information about an ongoing mandrill (MailChimp) email delivery outage
Incident Report for PayWhirl Inc.
Resolved
We've received an update from Mandrill and services have been fully restored. Here is the note from their team:

Our Engineering team has restored Mandrill’s operations. For most customers, features should be working as expected. However, there may be some residual effects in your account. We’ve included information on that below.

Due to the nature of this outage, it was difficult to determine which individual accounts were affected so we’ve chosen to notify all Mandrill users of the outage, its effects, and our efforts to resolve it.

What happened and how we responded
For more background information on what caused the outage, please read our previous email to customers here. The outage began around 10:30pm EST on Sunday, February 3, and since then our Engineering team has worked around the clock to diagnose the issue, then develop a plan to fix it. We made code changes to work around the damaged database to preserve as much functionality for as many users as possible. We added machines, storage, and networking to attempt a variety of efforts at preserving all data and getting back online. Ultimately, we decided that our efforts to preserve all data would take too much time. We changed direction to truncate specific tables in order to get back online faster. These efforts were successful.

How we resolved it
We determined that some data was preventing necessary automatic processes from running. We deleted that data, which allowed those processes to successfully run and return the database to a usable state.

Residual effects
You may see some lingering effects in your account as a result of the outage. Stats and metrics that should have been tracked during the outage may be incorrect or missing entirely.

The Mandrill Team

Please let us know if you have any questions!
Team PayWhirl
Posted Feb 06, 2019 - 13:48 PST
Identified
This morning we were notified by one of our partners (Mandrill, which is owned by Mailchimp) that they have been experiencing an outage with their transactional email system.

PayWhirl uses this system to send emails to both businesses and their customers so you may experience issues related to email delivery until they are able to resolve the matter. Below is a statement provided to us by Mandrill about the issues.

OFFICIAL ANNOUNCEMENT FROM MANDRILL:

We’re contacting you about an ongoing outage with the Mandrill app. This email provides background on what happened and how users are affected, what we’re doing to address the issue, and what’s next for our customers.

What happened
Mandrill uses a sharded Postgres setup as one of our main datastores. On Sunday, February 3, at 10:30pm EST, 1 of our 5 physical Postgres instances saw a significant spike in writes. The spike in writes triggered a Transaction ID Wraparound issue. When this occurs, database activity is completely halted. The database sets itself in read-only mode until offline maintenance (known as vacuuming) can occur.

The database is large—running the vacuum process takes a significant amount of time and resources, and there’s no clear way to track progress.

Customer impact
The impact to users could come in the form of not tracking opens, clicks, bounces, email sends, inbound email, webhook events, and more. Right now, it looks like the database outage is affecting up to 20% of our outbound volume as well as a majority of inbound email and webhooks.

What we’re doing to address this
We don’t have an estimated time for when the vacuum process and cleanup work will be complete. While we have a parallel set of tasks going to try to get the database back in working order, these efforts are also slow and difficult with a database of this size. We’re trying everything we can to finish this process as quickly as possible, but this could take several days, or longer. We hope to have more information and a timeline for resolution soon.

In the meantime, it’s possible that you may see errors related to sending and receiving emails. We’ll continue to update you on our progress by email and let you know as soon as these issues are fully resolved.

What’s next
We apologize for the disruption to your business. You don’t need to take any action at this time. Again, we’re sorry for the interruption and we hope to have good news to share soon.

We will also keep you posted on any updates we receive from the Mandrill team. Please let us know if you have any questions in the meantime.

Team PayWhirl
Posted Feb 05, 2019 - 09:10 PST
This incident affected: Payment Widgets & Customer Portals.