Back

Site Reliability Engineer – AI & ML Infrastructure, Kubernetes, AWS, Terraform

Worldwide Salaried Open

Job Description:

  • Architect and maintain our core computing platform using Kubernetes on AWS and on-premise, providing a stable, scalable environment for all applications and services.
  • Develop and manage our entire infrastructure using Infrastructure-as-Code (IaC) principles with Terraform, ensuring our environments are reproducible, versioned, and automated.
  • Design, build, and optimize our AI/ML job scheduling and orchestration systems, integrating Slurm with our Kubernetes clusters to efficiently manage GPU resources.
  • Provision, manage, and maintain our on-premise bare metal server infrastructure for high-performance GPU computing.
  • Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions to support high-throughput, low-latency workloads across hybrid environments.
  • Develop a comprehensive observability stack (monitoring, logging, tracing) to ensure platform health, and create automation for operational tasks, incident response, and performance tuning.
  • Collaborate with AI researchers and ML engineers to understand their infrastructure needs and build the tools and workflows that accelerate their development cycle.
  • Automate the life cycle of single-tenant, managed deployments

Requirements:

  • 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE)
  • Proven, hands-on experience building and managing production infrastructure with Terraform
  • Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment
  • Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads
  • Experience managing bare metal infrastructure, including server provisioning (e.g., PXE boot, MAAS), configuration, and lifecycle management
  • Strong scripting and automation skills (e.g., Python, Go, Bash)

Benefits:

  • Medical, dental, vision benefits
  • Annual wellness stipend
  • Mental health support
  • Life, STD, LTD Income Insurance Plans
  • Unlimited PTO
  • Generous paid parental leave
  • Flexible schedule
  • 12 Paid US company holidays
  • Quarterly personal productivity stipend
  • One-time stipend for home office upgrades
  • 401(k) plan with company match
  • Tax Savings Programs
  • Learning / Education stipend
  • Participation in talks and conferences
  • Employee Resource Groups
  • AI enablement workshops / sessions

Apply tot his job Apply To this Job

More jobs

Sr. Site Reliability Engineer. Cloud

Worldwide Salaried

Site Reliability Engineer (Splunk, Prometheus, Grafana) Hybrid

Worldwide Salaried

Platform Site Reliability Engineer:

Worldwide Salaried

Site Reliability Engineer 2 days Onsite

Worldwide Salaried

Senior Site Reliability Engineer — Token Factory (Inference Platform)

Worldwide Salaried

Senior Site Reliability Engineer

Worldwide Salaried

Site Reliability Engineer/Sunnyvale, CA/ Austin, TX (Hybrid)- 6-12 months

Worldwide Salaried

Senior Site Reliability Engineer (CloudVision as a Service)

Worldwide Salaried

Site Reliability Engineer Manager

Worldwide Salaried

Site Reliability Engineer: initial focus on Release Management

Worldwide Salaried

Experienced Entry-Level Customer Relations Chat Agent | No Experience Required - Remote Work Opportunity at arenaflex

Worldwide Salaried

Experienced Entry-Level Remote Chat Assistant – Customer Support Representative (No Experience Needed) for arenaflex

Worldwide Salaried

Customer Support Associate – Employee Assistance Program at arenaflex

Worldwide Salaried

Experienced Live Chat Agent – Compassionate Customer Support Representative

Worldwide Salaried

Experienced Part-Time Data Entry Clerk – Entry-Level Opportunity for Career Growth and Development at arenaflex

Worldwide Salaried

Experienced Entry-Level Data Entry Clerk – Remote Opportunity with arenaflex

Worldwide Salaried

(Senior) Account Manager Public Sector – Digitalisierung (all genders)

Worldwide Salaried

Experienced Part-Time Remote Data Entry Specialist – Precision and Efficiency in a Global Retail Leader

Worldwide Salaried

Experienced Customer Service Supervisor – Remote Work Opportunity at arenaflex

Worldwide Salaried

Experienced Data Entry Specialist – Remote Opportunity with arenaflex

Worldwide Salaried