Senior Engineer - Systems Infrastructure

Overview

Own the infrastructure that runs AI systems in production. You'll design deployment architectures, set up monitoring and alerting, ensure security compliance, and keep everything running smoothly at enterprise scale. This isn't ticket-driven ops work. You're building the platform that enables rapid, reliable delivery across multiple client environments. Expect to write code, design systems, and debug complex distributed failures.

Responsibilities

•Design and implement deployment architectures for client AI systems
•Build CI/CD pipelines and Infrastructure-as-Code configurations
•Set up observability: metrics, logs, traces, alerting
•Ensure security compliance (SOC 2, ISO 27001, client-specific requirements)
•Respond to incidents and implement preventive measures
•Document runbooks and train client teams on operations

Requirements

•5+ years operating production infrastructure (compute, networking, observability)
•Deep knowledge of containerization and orchestration (Docker, Kubernetes)
•Experience with IaC and cloud platforms (Terraform, AWS/GCP)
•Track record maintaining high-availability services (99.9%+ uptime)

Required Skills

DockerKubernetesTerraformSystems Design

Preferred Qualifications

•Built CI/CD pipelines from scratch
•Responded to production incidents at 3am and improved alerting so it doesn't happen again
•Experience with service mesh, API gateways, or distributed tracing
•Open-source contributions to infrastructure projects

Nice to Have

PythonAWS

A Day in the Life

Morning: Review overnight alerts (none, because you built good alerts). Mid-morning: Design review for multi-region deployment. Afternoon: Implement automated failover for critical service. Evening: Write runbook for new deployment pattern.

Why This Role

Build infrastructure for systems that millions depend on. Make architectural decisions that matter. Work with modern tools and patterns. No legacy baggage to maintain.

Technical Challenge

This role requires completing a technical challenge as part of the application process. Challenge: Medium: High-Availability Service

View Challenge Details

Frequently Asked Questions

What's the on-call expectation?

Rotating on-call (one week per month). Incidents are rare because we invest in reliability. Compensation for on-call time and incident response.

How much automation?

Everything is code. Manual deployments are failures. You'll spend significant time building automation that prevents repetitive work.

What about the technical challenge?

Required. Medium-difficulty infrastructure challenge (6-8 hours). We evaluate: system reliability, observability setup, deployment automation.