
From seeing to observing: What Sherlock Holmes can teach us about observability
Holmes famously quipped, “You see, but you do not observe.” But these words aren’t just useful for solving a murder, they hold up a mirror to our tendency to overlook the intricacies hidden in plain sight.
In the realm of modern financial technology, milliseconds can make or break a transaction. We’re dealing with complex systems, so identifying and understanding what is working (or not working) is a constant challenge for us as QA professionals.
Which brings us to observability. The concept of observability is grounded in systems theory: it involves understanding a complex system by analyzing its output. It’s the capacity to monitor and grasp a system’s behavior from the outside, offering insights into its inner workings, even without direct access to those internal elements.
It’s a bit like the climate (a non-digital complex system). Even if we don’t know exactly what the complex mechanics are that shape our weather, we can measure outputs like temperature or rainfall and use those to understand what’s going on within the system.
Clearly observability is critical for us, but it’s tricky to get right. One of the key challenges we face is the amount of data that a complex digital system generates. The sheer volume of data can make it challenging to spot meaningful insights from the noise.
That’s why at OpenPayd, we’ve taken it a step further, by embracing the 80/20 rule for observability.
Let’s first delve into the 80/20 rule: this principle that rescues us from being overwhelmed by information.
The 80/20 rule, also known as the Pareto Principle, states that roughly 80% of effects come from 20% of causes. A common example might be that a company sees 80% of its revenue coming from 20% of its customers.
For myself as a QA professional, the 80/20 rule is a reminder that when we’re identifying the root causes of problems, some will be responsible for far more problems than others. Instead of getting lost in the entirety of the technological structure, we must shift our focus to the root of the largest number of problems.
That’s the theory. There are a few key tools we use to bring this to life: Kafka Lag Exporter, Kubernetes overview, RabbitMQ metrics, and K6 performance testing.
Kafka Lag Exporter: Bridging the Visibility Gap
In the dynamic landscape of real-time transactions, Kafka has emerged as a cornerstone technology.
Kafka is an open-source event streaming platform designed for real-time data processing and building scalable, fault-tolerant data pipelines.
Yet, keeping track of Kafka consumer lag can be challenging. Our solution to this challenge was the Kafka Lag Exporter. This tool provides real-time insights into the lag between Kafka producers and consumers, giving a clear view of how well the system is keeping up with incoming data.
However, the Kafka Lag Exporter goes beyond mere monitoring. It fits perfectly into the 80/20 rule for observability. Instead of drowning in vast amounts of Kafka data, it lets us focus on the 20% of metrics that truly matter. By doing so, we can not only keep Kafka clusters in check but also optimize their observability efforts for maximum impact.
RabbitMQ: Queuing up for Success
Efficient message queuing is paramount in the fintech realm.
At OpenPayd, we use RabbitMQ: an open-source message broker that facilitates the efficient exchange of messages between different systems and applications.
RabbitMQ plays a vital role in OpenPayd’s architecture. Yet, ensuring the optimal performance of RabbitMQ is a multifaceted challenge because it requires a holistic approach that encompasses message management and continuous monitoring, with each aspect affecting the overall efficiency of the messaging system. OpenPayd tackles this by focusing on key RabbitMQ metrics that align with the 80/20 rule.
By monitoring critical aspects like queue depths, message rates, and consumer acknowledgements, We can gain a holistic view of our RabbitMQ instances at OpenPayd. This strategic observability approach guarantees that potential bottlenecks are proactively addressed, maintaining the integrity of the messaging infrastructure.
K6: Paving the Road to Peak Performance
Performance testing is a non-negotiable part of our quality assurance. K6, a powerful open-source load testing tool, is our preferred solution because it provides the scalability and flexibility needed to simulate real-world traffic scenarios, allowing us to evaluate the performance of our system under changing conditions, identify bottlenecks and ultimately deliver a more reliable and resilient platform for our users.
And again, the 80/20 rule comes into play, reshaping how they approach performance testing. Rather than getting entangled in exhaustive testing scenarios, the OpenPayd QA Team focuses on the crucial 20% of use cases that can significantly impact their platform’s responsiveness. This targeted approach not only saves time but also ensures that performance testing efforts are aligned with real-world usage scenarios.
– –
Our observability strategy introduces a new perspective on observability, guided by the 80/20 rule. By focusing on the vital 20% of metrics, tools like Kafka Lag Exporter, Kubernetes, RabbitMQ, and K6 are transformed from data sources to strategic enablers of greater platform performance.
In a world where drowning in data is all too easy, our approach is grounded in efficiency and effectiveness. By looking beyond the surface and embracing the core principles of observability, we are not only navigating the complexities of financial technology, but also thriving in its ever-evolving landscape.