Duplicated Billing in AU Direct Debit/Bank Transactions
Incident Report for Ezypay
Postmortem

On Wednesday, 17/05/2023, anomalies were detected in our Direct Debit billing processes, affecting Australian Merchants' customers. For a number of customers, their regular subscription payments were processed twice, and in some cases 4 times. This issue affected bank direct debit transactions within Australia only, so it only affected a subset of our platform. but we take any issue affecting our billing and settlement processes extremely seriously, and reacted accordingly.

Issue Summary

To explain how this issue occurred, we need to describe a little about our internal batch billing processes. Ezypay’s platform is built on flexible architecture consisting of many different services all talking to each other in order to deliver the functionality our platform offers (a cloud-hosted micro-services model). We follow modern, Agile practices in making smaller, more frequent changes to our services, which can be released at any time. When changes are released, an automated process spins up a new version of a particular service alongside the existing instance, waits until the new instance has started and is ready to process, then shuts down the old instance, allowing it to complete its processing. This is a ‘deployment strategy’ known as a ‘Rolling Update’.

Some of our services, in particular the batch-oriented services that are required for legacy payment processes like Direct Debit, must only run 1 active instance at any time. This is primarily due to the asynchronous, batch/file-based nature of transaction processing on the legacy BECS platform which underpins current Direct Debit billing in Australian banks. Card-based transactions, and Real-Time Payment (RTP) platforms (for example, PayTo) don’t suffer the same challenges associated with processing transactions in batches.

Our batch processes have functionality to ensure that only one active instance is running; before processing any batch, they acquire a ‘lock’ to ensure no other instance tries to process the same batch. On 17/05/2023 at approximately 11:55am AEST, a scheduled deployment was initiated, which included one of our batch billing services. Unfortunately, due to a particular type of exception encountered when acquiring the distributed lock, and a flaw in how this exception was handled, there were 2 instances of this batch process running for a brief period of time.

Our automated monitoring alerted us to an issue in our batch billing process, and the team started investigating at 12:10pm AEST. The initial focus was on a particular batch (let’s call it batch B) of Direct Debit transactions which were unable to be written to a file, as 2 instances were trying to write to the file simultaneously. The team worked through that issue and regenerated the file via the platform, to minimise interruptions to scheduled billing. Unknown to the team, a previous batch process (called batch A) had been able to complete and successfully write the batch file, consisting of duplicated transactions. Although we have multiple reconciliation processes in place, the primary process which is performed on Direct Debit batch files prior to processing with our banking partner, in the case of batch A, was overlooked due to the team taking extra precautionary measures with reconciling batch B. Subsequent reconciliation processes detected the anomaly.

Our Response

Once we were alerted that duplicate transactions had been processed, we immediately identified a course of action which we believed was in the best interests of our Partners, Merchants and Customers, which was to communicate with every Merchant and Customer who was impacted, and to process a ‘Refund’ as soon as possible. We began by providing an update on our status page (Ezypay Status ), which is always our first method of communicating service impacts to Partners and Merchants, who we encourage to self-subscribe to updates so they are notified immediately when there are outages or impacts. We then spent some time analysing the data to clearly identify which customers and merchants were impacted, before communicating out to the Partners, Merchants and Customers who had been impacted. Within 24 hours we had sent communications to each affected party, and had issued a full ‘Refund’ to each impacted customer. Technically this wasn’t a traditional refund: we weren’t reimbursing customers for a transaction that had been processed. One of the challenges of the traditional BECS DE processing platform that is used for processing Direct Debit transactions is that it can take 3 days (and longer in some cases) to know whether a transaction has been successful. In order to minimise friction for our merchants, and provide the best customer experience, we processed a full reverse transaction for each duplicated transaction within 24 hours. We genuinely took this action in the interests of Merchants and Customers, knowing that a proportion of the original transactions would fail due to insufficient funds and other decline codes. In these scenarios, customers would have money deposited into their account from Ezypay as reimbursement for a transaction that was ultimately unsuccessful, so we are crediting their account, rather than debiting, and would then be required to attempt to recover the surplus amount. We firmly believe that this was the best outcome for the majority of customers who honour their direct debit agreements.

One of our core principles relating to our incident management processes is to communicate as early as possible. It’s one of the reasons why we have a public status page ( Ezypay Status ). This incident has highlighted the complexity that we face with trying to communicate with Partners, Merchants (including Merchant hierarchies with Head Office and franchise merchants), and Merchants' Customers. The information wasn’t communicated to the standard that we hold ourselves accountable to; there was a lack of detail around what the next steps were, what Merchants and their customers may see and experience in partner platforms, and the dissemination of information wasn’t as fast or accurate as it could have been.

Our learnings and action items

We are proud of our response; issuing communications and a full refund to each affected customer within 24 hours is an industry leading response. However, we made some mistakes and identified some areas for improvement, both in our response to the incident, and in the root cause that allowed the incident to occur in the first place. Below is a summary of the action items that our teams are focused on implementing or improving:

  • System Batch Process Improvements

    • Ensure concurrency locks are resilient to all failures, and ensure at-most-once processing where required
    • Add additional alerting for duplicated transactions in batch staging jobs
    • Have blackout periods for Production deployments around critical batch processes
  • Operational Process Improvements

    • Increased visibility of System and Bank File totals for secondary reconciliation processes
    • Additional automated reconciliation process for bank file transactions
    • Internal staff training
  • Incident Response Improvements

    • Streamlined process for communicating directly to subsets of customers and merchants to reduce communication delays
    • Agreed processes with Partners and Merchants on coordination of communications to Merchants and Customers
    • Process improvements and training around communicating Personal Identifiable Information (PII)

In closing

We wish to apologise for the impact this has caused. We understand that events like this are extremely frustrating, and have a huge impact on our Merchant’s and their customers. We believe that how we respond to incidents is critical, and whilst we are proud of our overall response, including the turn-around time for communications and refunds, we acknowledge that some mistakes were made, and some processes weren’t as efficient or accurate as they could do. We will do everything we can to learn from this event, to improve our processes going forward.

Posted May 24, 2023 - 12:40 AEST

Resolved
Refunds have been finalised with our banking partner. Refunds may take up to 3 business days to appear in customer's accounts.
Posted May 17, 2023 - 17:27 AEST
Update
We are currently finalising our refund processes with our banking partner. Refunds may take up to 3 business days to appear in customer's accounts.
Posted May 17, 2023 - 14:32 AEST
Update
We are in the process of preparing full refunds for all customers affected.
Posted May 17, 2023 - 08:29 AEST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted May 16, 2023 - 20:52 AEST
Investigating
We have detected some instances of duplicated transactions in customer billing for Direct Debit (Bank) transactions for Australian customers that were processed today, 16th May. This was caused by an isolated error in our direct debit banking processes, and the root cause has been resolved. Full refunds will be issued for all duplicated transactions as soon as possible.
Posted May 16, 2023 - 20:51 AEST
This incident affected: Backend Payments & Settlement (Bank Direct Debit billing).