Site Reliability Engineer
Join our global force of 400+ innovators, blending the latest in tech with the greatest in soundtracking, from our Stockholm HQ to offices in London, New York, Los Angeles, Berlin, Paris, Oslo, and Seoul. We’re an industry leader with a startup mentality. We take what we do seriously, but we don’t take ourselves too seriously. Creating and collaborating to transform the sound of streaming, content, and culture. Come join us, and let the world feel your work As a Site Reliability Engineer at Epidemic Sound, you will be a core member of the central platform team that builds and operates the platform the rest of Engineering ships on - keeping it reliable, scalable, and secure is what this team exists to do. This is infrastructure-flavoured software engineering: you will write the code that defines and automates the platform, and treat it as a product whose customers are the rest of Engineering. The goal is to make the reliable way the easy way - self-service paths that let product teams build and ship safely without waiting for anyone. Your key responsibilities include - Build and operate the platform our services run on - GKE clusters, the controllers that extend them, and the Terraform that defines our cloud. - Own the path from commit to production - CI/CD, GitOps, and the progressive-delivery patterns that turn a merge into a safe release. - Strengthen the networking and routing layer - traffic management on top of the VPC, firewalls, and network policies that keep it safe and predictable. - Govern access and guardrails - IAM across every layer, policy-as-code, and break-glass paths - so teams move fast within safe defaults rather than waiting on tickets. - Grow reliability and observability - alert hygiene, runbooks, SLOs, and the metrics and tracing that show how the platform behaves in production. - Enable product teams and raise the bar - make production readiness the default, and drive healthy adoption of the standards and docs you would rather share than gatekeep. Requirements - Kubernetes fundamentals: a solid grasp of controllers, core components, and CNI and networking - depth in the domain matters more than any single tool (GKE a plus). - Infrastructure as code and delivery: Terraform, Helm or Kustomize, CI/CD and GitOps (ArgoCD), and the traffic-management and progressive-delivery mechanisms that move releases out safely. - Networking and access: routing fundamentals, the VPC, firewall, and network-policy primitives beneath it, and IAM and access management at different levels. - Operational depth: monitoring fundamentals (a clear view of when to reach for metrics versus tracing, and experience with an open-source observability stack), strong troubleshooting across distributed systems, and solid Unix/Linux. - Agentic development mindset: you use AI agents actively in your own work, knowing where they add leverage and where human judgement is non-negotiable. - Collaboration and judgement: you do your best work on large, cross-cutting projects, communicate openly, and stay opinionated but open to discussion - reaching for the right tool over your own creation. It would also be music to our ears if you have - Familiarity with GCP and an observability stack with Prometheus, Thanos, and Grafana. - Experience running containerised platforms at scale. - Service mesh experience with Cilium eBPF, Linkerd, or Istio. - Familiarity with platform building blocks like cert-manager, external-secrets, or external-dns. Equal opportunity employer We believe that bringing people together from different backgrounds, experiences and perspectives makes for a healthy workplace, a more successful business and a better world. We value diversity and encourage everyone to come and soundtrack the world with us. Application Ready to make the world feel your work? Please apply, in English.
Findigo hittar jobben och fyller i ansökan. Du klickar Skicka.
Visa jobbet och ansökUrsprunglig annons: jobs.ashbyhq.com