High volume of errors in API

Incident Report for Ezypay

Postmortem

On Sunday 02/07/2023, the Ezypay platform was returning a high rate of errors through our partner-facing APIs. Our partners (and by extension, our Merchants) were impacted when attempting to initiate core operations, like creating a customer or processing a transaction. The issue was caused by significant performance bottlenecks on our primary databases, triggered by a maintenance event.

Issue Summary

The Ezypay platform consists of a micro-services architecture hosted on a global cloud platform, backed by relational SQL databases. In the early hours of 02/07/2023, our on-call engineers were notified of errors in our API and internal processes via our automated monitoring and alerting platforms. The team immediately pulled together to understand what was happening and set about identifying the root cause. The symptoms presented as database locks and starved connection pools in a few key services. The initial focus was on the application layer; in particular any batch processes with expensive queries, and on how connections were being managed. But we quickly realised the root cause of our problems; our cloud provider (we won’t mention the name, but it’s one of the ‘Big 3’ global cloud providers and it starts with A) had performed a major version upgrade on our primary database cluster.

To the best of our knowledge, we didn’t receive any notification or advance warning that this upgrade was going to occur. Our cloud provider has admitted that through this program of database upgrades for our Relational Database Management System (RDBMS) of choice, there has been a breakdown in communication to Ezypay, and to many other customers of theirs. From their records, they sent us an advance warning in January 2022 that this upgrade would occur at some stage, with no subsequent follow-up notice sent to us in the following 18 months.

Our Response

Once we understood the root cause, we quickly jumped into action. Despite the timing of the upgrade being a surprise for us, we were somewhat prepared, as we had planned to perform this upgrade ourselves in the near future. We briefly considered trying to roll back to our previous version, but decided the best approach was to ‘lean into’ the upgrade. Our primary concern was that many of our more complicated database queries, and our host and cluster-level configuration, was tuned for, or geared towards, our previous version of our RDBMS. We had executed the upgrade in a development environment as a pre-requisite to our Production upgrade, but hadn’t yet performed as much validation as we’d have liked to have performed prior to the Production upgrade.

Still, we knew what to look for, and we had a run-sheet prepared, so we were confident, given the circumstances, that completing the remainder of the DB upgrade steps from our end was the right approach. The biggest concern, and our biggest delay in returning the platform to its fully operational state, was in re-calculating statistics for query execution plans across our key transactional databases. Although the automated upgrade process (executed on our behalf) returned the databases to an ‘online’ state, the ‘stale’ query execution plans were causing significant performance bottlenecks on key processes, resulting in the DB locks, timeouts, and starved connection pools which were the primary symptoms. Along with re-calculating statistics, we also significantly scaled up our database resources to assist with any further performance ‘differences’.

Finally, we spent some time carefully inspecting the batch processes that were interrupted and/or did not complete successfully. Whilst the majority of our batch processes are ‘re-entrant’, we wanted to validate batch process executions and outputs to ensure that there would be no double processing when we re-executed those ‘hung’ jobs. All internal batch processes, including subscription billing processes, were executed successfully on 02/07.

Our learnings and action items

We firmly believe that how we respond to incidents and outages is just as important as reducing the potential for incidents to occur in the first place; there is always the potential for something to go wrong. In this case, it was an event triggered by an external party, our infrastructure provider, which was triggered without sufficient warning. We don’t take pleasure in passing blame, and don’t highlight this to reduce the impact of this incident to our reputation. We are the first to admit that the majority of incidents with the Ezypay platform have been caused by an internal issue or process. We highlight this because we believe that it doesn’t matter how robust or mature a platform and its processes are, incidents will still occur, and it’s how we respond that counts. In terms of our response, we got some things right, and we’ve identified some areas where we need to improve.

We are working with our cloud provider to ensure that this doesn’t happen again, which includes re-validating our communication channels, and opening further channels with their teams, on planned maintenance activities. We’re refining our playbooks for Root Cause Analysis (RCA), with a focus on the learnings from this incident - our RCA was not as efficient as we’d like.

One of our bigger learnings from this incident was that we need to better understand how our partners handle interruptions to our service, and how the response that we provide back to our partners may alter their execution path. The majority of our partners averted significant impact to our shared customers by retrying failed API calls to create customers or process transactions automatically. In some cases, a subsequent batch process on the partner platform re-submitted API calls that had failed through the incident window. In other cases, a partner platform continually retried (with exponential backoff) until the API call was successful. In some cases, the partner platform would retry automatically if it received a particular 5xx response code (i.e., 503 - Service Unavailable), but would not automatically retry on any other 5xx response codes.

One important part of our response to each partner was to provide a list of transactions that was successfully processed by Ezypay throughout the incident window. We did this to allow partners to reconcile our list with the transactions that should have been processed by Ezypay, according to their platform. This was particularly important for those partners who utilise Ezypay in an ‘on-demand’ model, where they provide us with the details of a transaction when it is scheduled to be processed according to the partner platform’s schedule. For partners that utilise Ezypay in a ‘Subscription’ model, where Ezypay holds the schedule and processes transactions accordingly, our scheduled billing processes were executed successfully, resulting in no interruption to scheduled billing.

What we can improve on in this area is to provide a better breakdown of the interaction between a partner and the Ezypay platform throughout an incident window, to all partners. Through our monitoring and logging platforms, we have visibility of the traffic coming from our partners, and how we’re responding to their requests. In certain cases following this recent incident, we shared some insights from this visibility with the partner in question, which helped validate and reconcile transactions between the 2 platforms. Whilst our partners are modern tech companies like us, and likely have their own great visibility into their integration with us, we have a strong focus on making things ‘Ezy’ for our partners and merchants, which extends to identifying and recovering from issues. In addition, we can trace API calls from our perimeter through to our internal systems, and out to banking and payment partners, so can provide additional context that is not always available to a partner.

Finally, we can improve on how we handle communications to Merchants. This is a challenging space for Ezypay - some of our partners prefer to handle communications about Ezypay interruptions to their merchants themselves, others prefer Ezypay handle those communications. The same applies to some of our larger merchants, who would prefer to handle internal communications to their users and stakeholders themselves. In the case of this incident, it was further compounded by the fact that in many cases Ezypay did not fully understand what the impact would be to Merchants. For our ‘on-demand’ partners, some of them have billing cycles that include Sunday, others do not. As covered earlier, many of them automatically retry, others do not. So it was difficult for Ezypay to provide accurate information to merchants on what the impact would be.

In closing

We apologise for the impact this has caused for our partners and merchants. We have identified some learnings and action items from this incident, particularly around how we can streamline the activities occurring within, and following, a major platform outage.

In terms of Partner and Merchant communication relating to incidents and outages, we are focused on driving communications from our status page (Ezypay Status ). We encourage all of our Partners and Merchants to subscribe to this page so that they are pro-actively alerted when there is a service disruption or outage on the Ezypay platform.

Posted Jul 12, 2023 - 11:36 AEST

Resolved

The issue has been resolved. We will continue to monitor closely to ensure that there are no downstream issues.

Posted Jul 02, 2023 - 15:04 AEST

Monitoring

Our engineers have implemented a fix, and the platform is now operational. We will be monitoring closely to ensure there are no further issues.

Posted Jul 02, 2023 - 13:16 AEST

Update

We have identified the root cause as maintenance activity on our database layer which was applied by our hosting provider. We are working with our hosting provider to resolve the issue.

Posted Jul 02, 2023 - 11:11 AEST

Identified

We have identified the source of the issue and we are working on a resolution.

Posted Jul 02, 2023 - 09:33 AEST

Investigating

We are currently investigating a high volume of errors with our public API. Our teams are fully mobilised and investigating, and we will provide updates here on a regular basis.

Posted Jul 02, 2023 - 08:15 AEST

This incident affected: Platform Services (Authentication Services, Webhook Notifications, Merchant reporting), Customer & Payment Method Management (Customer payment method management, Customer Management, Plan & Subscription Management), Backend Payments & Settlement (Card Billing, Bank Direct Debit billing, PayTo Billing, Merchant Settlement), Payment Initiation (Ezypay scheduled billing initiation, Partner initiated on-demand billing), and Customer one-off payment (checkout.ezypay.com, paynow.ezypay.com).