It’s Time for Developer-Driven Reliability

Ten years on from, “software is eating the world,” it’s safe to say we live in a new digital age.

Today’s businesses, from banks, to hospitals, to transportation and telecommunications, rely upon digital services to power the infrastructure behind everyday modern life.

This new world has been made possible by rapid advances in how developers build and deliver software. Continuous delivery and infrastructure as code (IaC) have helped developers deliver new features faster than ever before. Cloud native architectures have led to globally distributed services that scale near-instantaneously to meet user demand.

But with this new world comes a whole new set of challenges. Now that nearly all software is consumed as a service, building and shipping rapidly is no longer sufficient. Companies must also ensure their services are stable, available and performant. In short, services must be reliable.

Reliability is broken, and traditional IT can’t fix it

Reliability is no easy feat, however. When a service goes down, the advances in development and deployment are nowhere to be found. While on-call tools automatically route alerts to the appropriate engineers, and IT service management or ticketing systems keep track of the incident response workflow, the work of investigating an error, e.g., identifying the root cause, determining steps to fix, and then performing those steps, still remains mostly manual. Moreover, what automation does exist is driven by business rules configured in a management console and siloed away from the code that powers the rest of the application lifecycle. This manual work, siloed information, and IT-first processes slows down remediation, keeping services down and customers negatively impacted for unnecessarily protracted periods of time.

Advances in DevOps technology have not made it to the right; when production applications fail

Creating a more reliable world, where these processes are truly automated and integrated into the application lifecycle requires reinventing how reliability is managed. The last few years have seen a rise in the discipline of Site Reliability Engineering, which takes engineering principles and applies them to the formerly IT-led practice of running reliable software services.

At StackPulse, we’re believers that site reliability engineering is critical to service reliability in our digital age. And just as automated testing, continuous delivery and infrastructure provisioning are all developer-led, ‘as code’ disciplines today, we see site reliability engineering as the next frontier in developer empowerment.

With that in mind, we set out to reimagine what reliability can (and should) be.

Introducing Reliability as Code

Today, we’re announcing our Seed and Series A fundraising, and making the public debut of the StackPulse Reliability Platform.

With StackPulse, organizations can fully move away from traditional on-call or ITSM practices and tools, and place reliability in the hands of the development teams who understand the running services best.

To start, StackPulse ingests alerts from across your monitoring tools, cloud services and private networks. It enriches, analyzes and triages these alerts, filtering out those that don’t matter from those that do. StackPulse lets you deliver alerts with context and recommended action directly to on-call teams, helping speed the time to incident detection.

When an incident is detected, StackPulse can execute automated playbooks to perform investigation or remediation steps. This speeds incident resolution, and reduces manual action or human error. These standardized playbooks help teams distribute knowledge and operational practices globally, ensuring consistent and swift response no matter who is on call. StackPulse comes with ready-made playbooks that can be easily activated; and modular construction allows playbooks to be quickly shared across teams.

Throughout the incident lifecycle, StackPulse works in the background to centralize incident data and team communication automatically, while preserving a record of the technical and human elements of each incident. This knowledge is then used to uncover insights that prevent similar incidents from recurring.

This combination of alert enrichment, automated incident response, and incident lifecycle management is powerful in itself as a framework of platform components that will enable reliability.  But the real power of StackPulse lies in how we approach reliability.

Every enrichment step, every remediation playbook, every incident update in StackPulse is defined in code. Developers can build, test and deliver incident response operations using the same CI/CD or GitOps pipelines used today for application delivery. No more silos. No more custom scripts. No more cumbersome IT tools built for a world of on-premise servers. StackPulse gives you standardized, scalable incident response, built by developers and delivered as code.

You build it, you run it

In 2006, Amazon CTO Werner Vogels outlined the principle of, “You build it, you run it,” the notion that developers are ultimately responsible for the entire lifecycle of their services. Since then, we’ve seen more and more disciplines, from quality assurance, to release engineering, to infrastructure provisioning move to being developer-led. We’re proud to deliver on this principle with the industry’s first reliability platform.

We’re StackPulse. Reliability as code.

Read Previous Post
Read Next Post