How do you keep a data-intensive, real-time service that monitors hundreds of thousands of servers up-and-running around the clock?
How do you respond to infrastructure failures or performance issues in a high-volume, low-latency computing environment?
What should the infrastructure look like when Datadog monitors millions of servers and containers? If you think you have the answers, join us as a Site Reliability Engineer (SRE).
What you will do
- Keep our service reliable, available and fast as a member of the operations team.
- Respond to, investigate and fix service issues, whether they be deep in the OS kernel or in the application code.
- Design, build and maintain the infrastructure we need to support orders of magnitude more customers.
What we're looking for
- You have a BS/MS/PhD in a scientific field or equivalent experience
- You have a track record as an engineer in the operations of a large site
- You value correctness and efficiency; you leave no stone unturned when diagnosing production issues
- You handle infrastructure with code because automation lets you focus on the more difficult and rewarding problems
- You have production experience with distributed compute/storage tools, e.g. zookeeper, cassandra, postgres, kafka, elasticsearch, redis
- You have submitted bug fixes to the aforementioned projects
- You are fully fluent in python, ruby and go
Is this you? Tell us why, and apply now. Include links to your github, stackoverflow or other online projects.