Site Reliability Engineer Learning Path
Follow this curated path to enhance your ability to maintain reliable services at scale in Datadog.
Through hands-on courses, you’ll explore how to understand application performance, accurately monitor infrastructure and networking in real time, and implement SLO-driven strategies—ensuring you can quickly detect, analyze, and resolve system-wide issues.
This path is designed for Site Reliability Engineers (SREs) and other roles tasked with optimizing service uptime and performance.
You’ll learn how to do the following:
Getting Started with Incident Management
NEW! Learn how to manage incidents using Datadog Incident Management. By the end of this course, you'll know how to set up Incident Management, detect and declare incidents, and guide your team through resolution.
Getting Started with APM Metrics & Traces
Monitor service health and performance with Application Performance Management. Explore traces to understand requests and interactions between services. Track key metrics to understand trends that impact system behavior and user experience.
Getting Started with Infrastructure and Cloud Network Monitoring
Learn how to analyze metrics, visualize network and infrastructure performance, and troubleshoot issues effectively in this introduction to using Datadog’s Infrastructure and Cloud Network Monitoring (CNM).
Getting Started with Service Level Objectives (SLOs)
In this course, you’ll deepen your understanding of Service Level Objectives (SLOs) and gain hands-on experience using them to solve issues in a web application. This builds on the concepts covered in Understanding Service Level Objectives.