Abstract

Production incidents are inevitable. When systems fail, how effectively your team responds can mean the difference between a minor disruption and a major outage. Datadog Incident Management gives you a central place to declare, track, and manage incidents from detection to resolution. 

In this course, you'll work through a realistic incident scenario at a fictional e-commerce company, Storedog. You'll configure monitor notifications and incident notification rules so the right people are alerted when issues arise. When high error rates hit two services, you'll declare an incident, investigate the root cause, and share your findings with your team through the incident workbench and timeline. After resolving the incident, you'll generate an AI-powered postmortem and review incident analytics. From there, you'll identify improvements to your incident response process—configuring a status page and setting up incident automations.

Learning Objectives

By the end of this course, you will be able to: 

  • Describe the key phases of the incident management lifecycle, from detection and declaration through resolution and post-incident analysis.
  • Configure monitor notifications and incident notification rules to ensure the right people are alerted when issues arise.
  • Declare incidents in Datadog in response to monitor alerts and assign appropriate severity levels, incident commanders, and response teams.
  • Investigate incidents using APM and observability data to identify root causes, and share findings with your team via the incident workbench and timeline.
  • Generate AI-powered postmortems and analyze incident metrics using dashboards to identify opportunities for process improvement.
  • Configure status pages for stakeholder communication and create incident automations to trigger follow-up tasks automatically.

Primary Audience

This course is designed for engineers who use Datadog to monitor their applications and are involved in the incident response lifecycle. It's particularly suitable for DevOps Engineers, Software Engineers, Site Reliability Engineers (SREs), and Engineering Managers who serve as Incident Responders or Incident Commanders.

Prerequisites

The prerequisites for this course are the following: 

Technical Requirements

In order to complete the course, you will need:

  • Google Chrome or Firefox
  • Third-party cookies must be enabled to access labs

Course Navigation

At the bottom of each lesson, click MARK LESSON COMPLETE AND CONTINUE button so that you are marked complete for each lesson and can receive the certificate at the end of the course.

Course Enrollment Period

Please note that your enrollment in this course ends after 30 days. You can re-enroll at any time and pick up where you left off.

Course Curriculum

    1. Introduction to Incident Management

    1. Incident Management Best Practices

    1. Datadog Incident Management

    2. Lab: Datadog Incident Management

    1. Summary

    2. Feedback Survey

Getting Started with Incident Management

  • 2 hours to complete
  • 3 Lessons
  • Beginner