Site Reliability Engineer Learning Path
Follow this curated path to enhance your ability to maintain reliable services at scale in Datadog.
Through hands-on courses, you’ll explore how to understand application performance, accurately monitor infrastructure and networking in real time, and implement SLO-driven strategies—ensuring you can quickly detect, analyze, and resolve system-wide issues.
This path is designed for Site Reliability Engineers (SREs) and other roles tasked with optimizing service uptime and performance.
You’ll learn how to do the following:
Getting Started with APM Metrics & Traces
Monitor service health and performance with Application Performance Management. Explore traces to understand requests and interactions between services. Track key metrics to understand trends that impact system behavior and user experience.
Getting Started with Infrastructure and Cloud Network Monitoring
NEW! Learn how to analyze metrics, visualize network and infrastructure performance, and troubleshoot issues effectively in this introduction to using Datadog’s Infrastructure and Cloud Network Monitoring (CNM).
Getting Started with Service Level Objectives (SLOs)
NEW! In this course, you’ll deepen your understanding of Service Level Objectives (SLOs) and gain hands-on experience using them to solve issues in a web application. This builds on the concepts covered in Understanding Service Level Objectives.
Introduction to Incident Management
In this course, you learn about managing incidents by working through a hands-on example with Datadog Incident Management. You also learn how to use Slack to effectively communicate incident status to your team.