By Tom Long, Senior Vice President of Software Quality Assurance at oneZero

Chaos testing is a technique that helps software developers and engineers identify potential weaknesses in a system by intentionally creating chaos. How many times have you heard that it’s never going to be used like that in production? How many times have you experienced serious production issues because the application was used in ways nobody was expecting?

Chaos testing is essential when there is unpredictable or scaled input into the platform. Chaos testing essentials involve defining steady-states, creating a control group and a test group, introducing chaos, monitor and repeat, ensuring minimal impact to end users.

At oneZero, we have developed a chaos testing approach for our core application, the Hub. This involves testing with trades that are 100x busier than the busiest day in production, while being forced to randomly crash, restart, and recover every few minutes. Despite the intentionally created chaos, the oneZero Hub recovers and performs reliably and maintains its core trade and quote processing without any data loss.

Chaos Testing is a valuable technique for ensuring that software can handle unexpected situations and remain stable in the face of chaos resulting from the ever-increasing complexity of our Hub and the exponential growth of our business over the past 10 years. Our reality at oneZero is that our application has to scale over time and the testing that is required to support it may be fine today, but will need to continuously adapt for tomorrow’s business. Our approach provides a robust testing framework for financial institutions and other businesses that require high levels of reliability and resilience in their software applications.

We are dealing in financial markets where our clients have a requirement that their systems perform at a predictable base with an increased pace of markets. The funny aspect is that 99% of the time it looks like we are running too much hardware relative to the load on the Hub that is running, but in a matter of minutes major news or market events can trigger massive activity on our systems and our ability to perform through this chaos is really what differentiates our tech and makes our solution the optimal choice for clients.

Keys to success – questions:

1. Engagement – At what level does your executive leadership invest and engage in quality?
2. Realism – How well do you understand how key clients use the system?
3. Focus – How well can you recreate a client environment in test and test as a client?
4. Scope – Are you focused on core features?
5. Support – Are test tools core application features with their own agile team?
6. Process – How do you fit performance changes into your release schedule?
7. Feedback – Are you prepared for more chaos?

There are no right answers, but the reason why it works at oneZero is because we understand what our chaos is and that the conditions are right for chaos testing.

The standard testing pyramid tends to look like this:

Chaos testing is about combining all your testing capabilities into a single set of tests. Execute them all at once, and measure the performance and resilience of the application. Then, review and identify areas to improve on. A key to the review process is to be able to decouple the set of tests into the subset of targeted tests to help narrow down the root cause of any improvement, fix or reconfigure and run and rerun those tests, measure and remeasure the metrics and resiliency until you meet performance and resiliency standards needed for the future.

oneZero’s technology is complex. We process (in some cases) millions of state updates a second across thousands of financial instruments. This diagram shows our solution for institutional brokers, with the added vendor and internal testing solutions that we have put into place.

We believe that chaos testing is about stability, recovery and resilience.

oneZero’s chaos testing essentials are as follows:

  • Define the steady-states (Quotes and Trades)
    • Biggest most complicated clients
  • Create a control group and a test group
    • Control group = production clients
    • Test group = test environment client setup
  • Introduce Chaos
    • Trades (100x current average busiest day)
    • Replay (a very busy day in the life of a client)
    • Play all quotes and orders (the multiverse test)
    • Exception test (busy day + HW & SW failures)
  • Monitor Results
    • Compare test and control group
      • What’s better? What’s worse?
      • What’s the next problem?
  • Ensure minimal impact to end users
    • Fix bugs ASAP and release when ready

Ideally, you have combined several distinct tests into a single “Big Bang” test that identifies issues that can be decoupled and have individual tests run to narrow down and identify root cause.

To summarize:

Keys to Success – Answers…

  1. Engagement – Performance and quality are one of oneZero’s CTO’s core technical focus areas.
  2. Realism – Chaos tests are based on client configurations and client Hub usage.
  3. Focus – We have QA dedicated to understanding how to set up and test client setups.
  4. Scope – Trade and quotes are the core focus, but not the only risk.
  5. Support – Test tools are core features supported by product mgt and a scrum team.
  6. Process – Key performance code changes are treated as a separate release, bug fixes are merged into standard releases.
  7. Feedback – oneZero team tracks production issues and adds more automation and tools for future chaos. It’s a reason why we receive lots of industry awards.

The Hub can process > 100x current peak demand. Today’s chaos will be tomorrow’s BaU: At 93% YoY growth in 5-7 years it will be BaU performance. You need to plan for future chaos now as it’s only going to get more challenging.

Key questions I’d be asking myself if I was considering this: Do we have areas of the system with unpredictable and scaling growth and demand? Is our executive team engaged and understanding of the need to test future scale (at expense of resources on “new features testing” and typical regression). It’s not for everyone but when you do have these types of challenges it’s an essential role the QA team can play in future growth of a business.

—-

 

Tom Long is Senior Vice President of Software Quality Assurance at oneZero. Tom started in QA as an IBM Co-op testing the first networked PCs. He then worked for 5 and a half years in Nepal and South Korea in education, before getting two degrees from Harvard in education and technology. He then got back into QA and implemented a test automation framework that lead to his first QA management role, he then went on to manage the QA and Dev Ops team of a rapidly growing start-up into what is now one of the leading FSA Benefits Card platforms, and then onto senior QA leadership roles in large financial companies and teams involved in Employee Benefits, Custody, Investor Services, Mutual Fund pricing, ETF, TA, etc. The common thread through all of these experiences has been the need for testing with a strategic client focus, a continuous improvement mindset and open communication and collaboration with internal teams and external stakeholders.