#DevOps Toolset

#Alerts

#Datadog

Datadog Monitoring: How to Get Started

Datadog monitoring enables you to keep track of events across your applications and infrastructure. Datadog is designed to measure the performance of on-premises and cloud infrastructure, providing capabilities for logging, metrics, event tracing, and alerts.

Datadog is part of a DevOps toolset. Monitoring platforms like Datadog offer the benefit of simplified licensing, pricing, certification, and procurement. This type of ready-made solution comes with built-in compatibility between components and a relatively smooth learning curve.

In this article, you will learn:

  • What is Datadog?
  • Datadog capabilities
  • What is the Datadog agent
  • What are Datadog monitors
  • Datadog monitoring best practices
  • Actionable automation with StackPulse

What Is Datadog?

Datadog is a monitoring and analytics platform that you can use to measure the performance of on-premises and cloud infrastructures. It includes features for logging, metrics, event tracing, and alerts.

Datadog is built on a backend that includes PostgreSQL, Cassandra, and Kafka. Its front end includes a Go-based agent and a REST API. You can integrate Datadog with a variety of tools, including Kubernetes, Chef, Ansible, and Bitbucket. 

You can use Datadog monitoring with Windows, Linux, and Mac OS as well as with all major cloud service providers. This includes AWS, Microsoft Azure, Red Hat OpenShift, and Google Cloud Platform.

To learn about open source monitoring for cloud native environments, check out our article about Prometheus monitoring.

Datadog Capabilities

From the Datadog user interface, you can create customizable dashboards for real-time viewing of multiple data sources. You can also configure notifications and tie those notifications to on-call management tools like PagerDuty or chat platforms like Slack. 

Datadog also includes features for:

  • Automatic detection and analysis of logs, errors, and performance
  • End-to-end request tracing across distributed systems
  • Code instrumentation and custom plug-ins
  • AI-powered tests for user experience testing

What Is the Datadog Agent?

The Datadog Agent is a piece of software that collects event and metrics data from your hosts and sends it to the Datadog database. It is open source and can be found on GitHub.

Agents are composed of two functionalities: 

  • The collector—gathers data at a set interval. The v6 agent includes a Python interpreter for custom checks and to enable integrations.
  • The forwarder—sends data via HTTPS to the Datadog server. Data is buffered to prevent corruption and logs are forwarded over TCP connections encrypted with SSL.

Two optional processes can be spawned by the Agent:

  • APM Agent—a tool used to collect tracing data. It is enabled by default.
  • Process Agent—a tool used to collect live process information. It collects information from live containers; if no containers are available, it is disabled.

Other elements that play a role in Datadog agent use include:

  • DogStatsD—an implementation of Etsy’s StatsD metric aggregation daemon. This daemon is used to instrument custom code without creating latency and enables you to roll up arbitrary metrics.
  • User controls—to monitor and configure your agents you can use either the web-based GUI or CLI. By default the GUI is disabled but you can enable it via the datadog.yaml file. 

What are Datadog Monitors?

Datadog monitors are tools you can use to alert on metrics when a certain threshold is reached.  For example, if you want to set an alert for low disk space you can use the system.disk.in_use metric. You need to average this metric over devices and hosts to get your value. 

When setting up monitors, you need to assign a meaningful title and message to ensure that reporting is useful. Titles must be unique and can use template variables for specifics. Messages should inform teams how to manage the alert. For example:

  • Title—“Disk space is low on {{device.name}} / {{host.name}}”
  • Message—“You need to free up disk space for continued operations. Step 1: Remove unused packages. Step 2: Remove duplicate files. …”.

Once the monitor is created, it is automatically propagated to ensure every host or device reporting your metric can trigger an alert. 

The Datadog monitor overview
An example of Datadog monitors

Datadog Monitoring Best Practices

When using Datadog, there are several best practices you should consider adopting. These Datadog monitoring practices can help you ensure that you’re getting maximum ROI and that your systems are operating at optimal capacity. 

Collect all available metadata tags from your infrastructure

Datadog tags are metadata that you can use to group and identify your various data sources. Most platforms that you are using have built-in support for monitoring tags and can generate this information automatically. Before you integrate Datadog with these tools, make sure to verify that your tagging schema makes sense in these systems and that tags are applied consistently. 

The reason for this is that Datadog automatically imports tags when integrations are made. It is easier to ensure that tagging is right from the start than to fix it after. When verifying your tags, you may want to use the standards or recommendations made by your source provider. However, if tags between Datadog integrations do not align, you should use custom values. 

Looking for general monitoring guidelines? Check out our article—DevOps Monitoring: The Abridged Guide.

Pivot between data reporting types

With tags in place, Datadog automatically correlates data, enabling you to switch between dashboards, logs, and traces. In your Datadog dashboard you can see the health of your systems and monitor your services. These dashboards are interactive, enabling you to drill into information displayed by clicking on data points. 

Once a datapoint is selected, you can view the logs associated with that point in the Log Explorer. In the explorer, tags from the dashboard query you used are auto-populated, enabling you to quickly locate the data you need. This functionality also works in the opposite direction, enabling you to move from log to dashboard. You can also go directly to tracing information.

With tracing data, you can view specific traces related to your datapoint or log. From those traces, you can also expand back out to metrics or logs. The connectivity between these features enables you to quickly investigate events and locate issues. 

Assign services to owners, teams, or business units

Assigning owners to your various services can help you ensure that all appropriate teams are aware of service status and are alerted if an issue arises. Owner tags enable you to classify your services and dynamically route alert notifications as needed. You can also use owner classifications to quickly search for any components managed by a specific team or business unit.

When using these tags, you can use a single designation or you can layer tags. For example:

team: support
owner: devops
business-unit: development

What Happens When Datadog Alerts Come in?

Continuously monitoring the operation of software services is a fundamental part of both DevOps culture and of software engineering in general. Choosing the right monitoring tools for the job, and keeping the monitoring pipeline up-to-date with the software and infrastructure landscape, will allow you to provide stable and reliable software services.

While monitoring your systems, using tools like Datadog, enables the operations aspects of the DevOps culture, it is only a means, not a goal. The goal is to be able to provide uninterrupted service to the users. Operations activities that ensure this goal is reached, based on monitoring data, are as critical as having the monitoring in place. The ability to react to events at scale, before they affect the users of the service, is a business critical need. 

StackPulse is a platform that allows you to automate, analyze and visualize operational aspects of Site Reliability Engineering. When an alert or a notification is sent, an automatic playbook is triggered by StackPulse, ensuring that a predictable set of operational steps is performed. 

The StackPulse platform can help you perform the following actions automatically:

  • Verifying the extent of the problem and its impact on components of the system – is it a mission-critical problem? How high would we prioritize handling it? Is this a false-positive or just a reflection of another issue that is already being handled?
  • Checking potential causes for the problem, either disqualifying them or identifying the direction for the root cause.
  • Updating all relevant persons and systems with the latest state of the potential incident.
  • Ensuring all investigation and remediation steps are audited and documented, in order to enable efficient post-incident analysis.
  • Automatically applying remediations and monitoring their effect on the system to achieve self-healing.

StackPulse ensures that engineering organizations get a high return on investment on monitoring infrastructure, by making every alert and notification count. 

Get early access to the StackPulse platform and try it for yourself!