#DevOps Toolset

#Alerts

#Metrics

#Prometheus

Prometheus Monitoring: Architecture and Challenges

Prometheus allows collecting and processing metrics data from any infrastructural or applicative component. It is used very frequently to monitor containerized workloads. The platform integrates with Kubernetes and other components in the cloud native ecosystem, ingests data, and helps you collect and process four types of metrics: counters, gauges, histograms, and summaries. The Prometheus data model mirrors Kubernetes infrastructure metadata, making it a natural choice for Kubernetes monitoring.

In this article, you will learn:

  • What is Prometheus
  • Prometheus for Kubernetes monitoring
  • How Prometheus monitoring works
  • Types of Prometheus metrics
  • Prometheus architecture: explained
  • Challenges of using Prometheus

What Is Prometheus?

Prometheus is an open-source monitoring and alerting solution designed for large-scale environments, very often used in environments containing containers and containerized workloads. It enables rich customization and flexible queries with low system resource requirements. It can also ingest massive amounts of data with a single server, from thousands of machines simultaneously if needed.

Prometheus was originally developed at SoundCloud starting in 2012, designed to overcome the limitations of solutions that were in use at that time. Multiple Prometheus servers federation allows, among other things, supporting further growth of any environment.

Prometheus: the Native Kubernetes Monitoring Solution

Prometheus was inspired by the Google Borg Monitor solution. Google Borg was the predecessor of Kubernetes, introducing many of the principles that founded the modern container orchestrators. Prometheus is one of only a few graduated projects of the Cloud Native Computing Foundation (CNCF), and was built from the ground up to monitor Kubernetes infrastructure.

In particular, Prometheus includes several characteristics that match it well to Kubernetes. These include:

  • Data model—Prometheus uses a multi-dimensional data model with metrics stored as time series data in key-value pairs. This mirrors how infrastructure metadata is stored in Kubernetes using labels.
  • Accessible data format and protocols—metrics are displayed in a human-readable format, eliminating the need to transform data before use. To access metrics, you just need to connect via HTTP, meaning you can view metrics easily using various tools..
  • Service discovery— In addition to acquiring metrics from explicitly specified targets, Prometheus has built-in support for auto-discovery of new targets using a number of methods. This is particularly important in dynamic orchestrated environments where PoDs of various services can be spawned and decommissioned dynamically.

How Prometheus Monitoring Works

Prometheus is made of three main components held on a single server—a time series database, a data retrieval worker, and a web server. The data retrieval worker is used to discover targets in Kubernetes and pull metrics from ongoing jobs or a pushgateway for short-lived jobs. These metrics are then pushed to the database. From there, metrics can be accessed via HTTP for queries and alerting.

Prometheus Monitoring diagram

4 Types of Metrics Monitored by Prometheus

When monitoring, there are four types of metrics that Prometheus can process. These are:

  • Counter—a cumulative metric that can only increase or reset to zero. It is useful for measures like tasks completed, errors, or number of requests. 
  • Gauge—a point-in-time metric that can increase or decrease. It is useful for measures like concurrent requests or current memory usage.
  • Histogram—a metric that samples data and categorizes observations into custom groups. It is useful for aggregated measures like request durations or response sizes. It can also present Apdex scores, which are used to measure the performance and user satisfaction of applications.
  • Summary—a metric that samples data and returns the total counts of observations, sums of values, and observation quartiles. Prometheus summary is useful for obtaining a metrics overview with quartiles for statistical significance.

Prometheus Architecture

Prometheus Monitoring architecture
Source: Prometheus.io

To function smoothly, Prometheus must integrate a wide variety of components into its architecture. Because of the tool’s flexibility, these components vary by your individual implementation but the following elements should always be included.

Client libraries

Client libraries enable you to add direct instrumentation to the code of your services. This instrumentation then returns metrics data according to Prometheus’ metrics types.

Libraries are readily available for most popular languages already, including Go, Java, Python, and Ruby. You can also create your own client library or use an exposition format, which displays metrics in raw text.

Exporters

Exporters enable you to collect data from services and components that you cannot add direct instrumentation to because you do not have code access. For example, databases, OS kernels, routers, or cloud resources. There are numerous officially-supported exporters as well as third-party ones.

Exporters run alongside your services and function as a proxy between Prometheus and the service. These tools accept Prometheus requests, gather data from the service, convert data to the correct format, and return it to the data retrieval worker.

Service discovery

Prometheus enables you to both statically configure metrics targets and to locate targets dynamically through service-discovery mechanisms. This ensures that metrics are collected across your system while covering all the components and resources dynamically created and then decommissioned by orchestrators..

To accomplish service discovery, Prometheus allows integration with many existing methods, including those used by Kubernetes, HashiCorp Consul, and all major cloud infrastructure providers.

Scraping

In Prometheus, scraping is the actual process of collecting metrics from endpoints. This is accomplished when Prometheus’ data retrieval worker sends a scrape request via HTTP. When instrumentation or exporters receive these requests, a response is sent with the appropriate data.

Typically, responses are also appended with scrape specific data. This data can include how long the request took and whether the scrape succeeded.

Storage

Prometheus includes persistent storage on a local, on-disk time series database. You can also integrate it with remote storage via remote read/write APIs.

In the persistent storage, data is grouped into two-hour blocks which are eventually compacted. While data is being collected it is stored in memory and documented with write-ahead-logs (WALs) to protect against loss due to server failure.

PromQL

PromQL is a query functional query language that is used to select and aggregate time-series data from Prometheus database.

Recording rules and Alerting rules

Prometheus supports these two types of rules, each based on a PromQL query for two distinct use-cases:

  • Recording rules evaluate PromQL expressions regularly and store the results as separate time-series. This helps distribute the work that is required to aggregate metrics, and speeds up visualization rendering times for queries that are computation-complex. 
  • Alerting rules define conditions, again as PromQL queries, that constitute an alert. The actual alert is issued via integrations with external alerting tools. Prometheus sends the result of the PromQL query defined in the rule, and either distributes those results through your preferred communication routes, or “handles” the alert in any other way. The most common alerting tools for notifying a human operator about an occurence of a specific condition are Slack, PagerDuty, OpsGenie and VictorOps.

Dashboards

Prometheus enables you to access visualizations of a collected time-series data in a few different ways. For example, you can use the built-in expression browser for ad-hoc queries and debugging. For more extensive reporting, however, you should use either Grafana ,Console templates or other visualization tools capable of retrieving data from Prometheus.

Challenges with Prometheus Monitoring

Prometheus is a powerful tool for monitoring Kubernetes (and other) environments, but it’s not without faults. Below are some of the most common challenges you might encounter when using this tool.

  • Storage—the amount of storage space you need for Prometheus is directly related to how many metrics you are collecting and how frequently. Depending on the size of your Kubernetes deployment, the amount of storage required can exceed your resources rapidly. There is no real way around this; it’s just something to keep in mind when budgeting your resources. When using a third-party storage via a dedicated Prometheus adapter, various data maintenance and lifecycle policies can be applied.
  • Redundancy—Prometheus uses a single node database to ensure data integrity. However, this does not allow for redundancy and can lead to data loss if your disks fail. To get around this, you can replicate data externally through asynchronous methods or you can run a mirror of your Prometheus server. Neither is a great solution but both will provide redundancy. Additionally, more complicated Prometheus server federation techniques exist.
  • No event-driven metrics—since metrics are scraped (rather than pushed), you cannot easily implement event-driven metrics. You can, however, use Prometheus’ Push Gateway to collect metrics pushed by short-lived jobs or batches. To simulate event-driven metrics, you can scrape this gateway frequently although this can tax your server and produce more data than is useful.
  • Segmented network access—depending on your services, you may run into access restrictions or lack of external access. This often requires multiple instances of Prometheus for your range of metrics. To view these metrics, you then need to either access different Grafana dashboards to use Federation to aggregate metrics from your instances to a single server.
  • Long term storage—Prometheus offers local storage but is not designed for long term storage. If you choose to keep metrics for a long period, you can severely degrade your query performance. To avoid this, you need to either use recording rules to segment your metrics data or you can integrate a distributed database, like Thanos or Cortex. 

Conclusion

At the time of writing of this article, Prometheus is an undisputed leader in the Open Source field in metrics-based monitoring. The community of its users grows constantly and rivals the communities of users of far more feature-powerful commercial solutions in the field.

Successful implementation of Prometheus serves as a basis for operations and Site Reliability of your software services. Furthermore, working with Prometheus can (and should) be done in ways that fit the DevOps paradigms, managing both the metrics and the rules “as code”.

StackPulse has a native integration with Prometheus and allows DevOps, R&D and SRE teams to take the next step and codify operational processes that will take place when an alerting rule identifies a situation that requires response. Implementing smart auto-scaling, collecting troubleshooting information and notifying relevant role-players or performing automatic remediation steps – all this and more can be done fully within the DevOps Lifecycle paradigms when using Prometheus in combination with StackPulse.

Click here to begin your StackPulse evaluation and supercharge your Prometheus to become a real “Mission Control Center” for your software services operations.