Site Reliability Engineer (SRE) - AI/Defense
Ensure reliability of deployed AI systems and defense infrastructure
31
Open Positions
Active Positions (36)
Lead/Manager Site Reliability Engineering Team (Amsterdam) senior
Together AI·Amsterdam
PagerDutyAnsibleTerraform
Staff Software Engineer, Site Reliability (SRE)staff
Harvey AI·Bengaluru
site reliabilityscaling across 50+ regionsmission-critical operations
Senior Site Reliability Engineer, Production Engineering senior
Anduril·Costa Mesa, California, United States
Lattice OSProduction Engineeringmission-critical systemsautonomous command and controloperational environmentsreliability engineering
Member of Technical Staff - Platform (Deployment Infrastructure)staff
xAI·Palo Alto, CA; Washington, D.C.
bare metal provisioningGPU workloadsair-gapped deploymentKubernetes manifestssite topology profilescompliance requirements
Member of Technical Staff - Deployment & Compliance (Air-Gapped Infrastructure)staff
xAI·Palo Alto, CA
air-gapped GPU infrastructureATO package preparationSTIG evaluationPOAM managementsecurity compliance accreditationclassified AI inference platforms
Sr. Engineering Manager, SREsenior
Abridge·SF Office
SLOsmulti-region deploymentmulti-cloud deploymentapplication reliability roadmapsoftware replatformingrearchitecture
Senior Software Engineer, Site Reliabilitysenior
Anduril·Sydney, New South Wales, Australia
Tactical Networkingcommand and control (C2)collaborative autonomyOSI reference modelLayer-1 PhysicalLayer-2 Data
Software Engineer, Infrastructure Reliabilitymid
OpenAI·San Francisco
Distributed system performance optimizationSystem resilience improvementObservability platform developmentIncident response postmortemsInfrastructure scalability patternsReliability guardrails
Senior Site Reliability Engineer (x/f/m)senior
Doctolib·Paris, Paris, France
Database optimizationDatastores healthDatabase reliabilityDatabase availabilityDatabase performanceDatabase automation
Engineering Manager - Observability & Reliability Engineering Obsession (x/f/m)manager
Doctolib·Berlin, Berlin, Germany
Ruby on Rails backend foundationsPostgreSQL scalabilityMongoDB integrationPlatform as a Product mindsetBackend foundation managementCI pipeline automation
Senior Site Reliability Engineersenior
Anduril·Irvine, California, United States; Washington, District of Columbia, United States
Lattice OSsensor fusionautonomyCounter Intrusion systemsAir Defense systemsrobotic systems
Site Reliability Engineer, Discoverymid
Anduril·Seattle, Washington, United States
site reliability engineeringmission autonomymesh networkingsystems integrationroboticsnetworking
Senior Editor, Taiwansenior
Spotify·Taiwan
playlist curationeditorial ecosystemmusic trends analysisuser behavior analysisartist discoverycultural insights
Senior Site Reliability Engineersenior
Spotify·New York, NY
AI-native workflowsagentic production systemsbackground coding agentsBackstagedeveloper portalsagentic developer tooling
Senior Site Reliability Engineer - Developer, Connected Warfaresenior
Anduril·Costa Mesa, California, United States
warfighter capability deliverydeployment engineer supportsystem integration strategiesfault tolerant system deliveryscalable system deliverymodern technology solutions
Senior Site Reliability Engineer - Tactical Reconnaissance & Strikesenior
Anduril·Atlanta, Georgia, United States
Lattice OSautonomous dronessolid rocket motorsGhostAnvilBolt
Site Reliability Engineer IImid
Dataiku·United States, New York
pretrainingposttrainingscience organizationtechnical operationsprogram managementexecution engine
Engineering Manager SRE (x/f/m)manager
Doctolib·Paris, Paris, France
Automation PlatformCI/CD automationTesting infrastructureEphemeral development environmentsDeveloper productivity toolingContract testing
Senior Site Reliability Engineer - Observability (x/f/m)senior
Doctolib·Berlin, Berlin, Germany; Paris, Paris, France
observability strategyloggingmetricstracingalertingincident detection
SRE / Incident Manager Team Leader (x/f/m)senior
Doctolib·Paris, Paris, France
Incident ManagementProblem ManagementOperational ExcellenceReliability EngineeringChange SafetyObservability
Senior Site Reliability Engineersenior
Algolia·Paris, France
AI SearchNeuralSearchkeyword searchsemantic searchvector searchAI Re-Ranking
Senior Site Reliability Engineer, AI ResearchseniorRemote
Algolia·Remote - Australia
Site Reliability Engineeringcloud-firstservice-oriented architecturesGoogle Cloud PlatformSRE fundamentalsproduction services
Senior Site Reliability Engineer - Deployed, Connected Warfaresenior
Anduril·Costa Mesa, California, United States
system deploymenthardware installationsoftware installationnetwork expansioncustomer mission supportmission critical capabilities
Member Of Technical Staff - Government Infrastructurestaff
xAI·Los Angeles, CA; Palo Alto, CA; Washington, D.C.
classified cloudfederal complianceGPU hardware provisioningsecure infrastructurehybrid cloud architecturesgovernment projects
Member of Technical Staff - Infrastructure Reliabilitystaff
xAI·Palo Alto, CA
GPU supercluster reliabilityhigh-QPS production systemsinfrastructure automation in Rustdistributed infrastructure monitoringtraining throughput optimizationstorage infrastructure evolution
Site Reliability Engineer - Cybersecuritymid
xAI·Palo Alto, CA
X Money platformP2P paymentsmoney transmissionhybrid cloud securitydistributed systems securitysecurity automation
Site Reliability Engineer (SRE)mid
xAI·London, UK
BuildkiteArgoCDPrometheusGrafanaPagerDutyPulumi
Site Reliability Engineer - US Governmentmid
xAI·Palo Alto, CA; Washington, D.C.
classified cloudGPU hardwarelarge-scale AI workloadsfederal compliance requirementstraining clustersinference clusters
DevOps Engineer, IPSmid
Scale AI·Doha, Qatar
Infrastructure as Code (Terraform)CloudFormationCI/CD pipelinescontainerized applicationsVPCsVPNs
Site Reliability Engineer / DevOpsmid
Scale AI·Mexico City, MX
robot stationstechnical facilities managementon-site infrastructurenetwork installationshardware troubleshootingphysical infrastructure
Senior Site Reliability Engineer - Database (x/f/m)senior
Doctolib·Nantes
LLMVLMRAG-based systemsAI Medical CompanionVector DatabasesGoogle Cloud Platform (GCP)
Site Reliability Engineermid
Together AI·San Francisco
usage-based billingpayment processors (Stripe)product entitlementscustomer-facing analyticscommerce platformAPI-driven services
Deployment Site Reliability Engineer, Connected Warfaremid
Anduril·Costa Mesa, California, United States
Lattice OSautonomycomputer visionsensor fusionfirst principles aircraft designsystem safety
Senior IT Systems Engineer senior
Abridge·SF Office
JAMFFleetMDM platformsSOC 2 complianceHIPAA complianceendpoint lifecycle management
Senior DevOps Engineer, Spacesenior
Anduril·Costa Mesa, California, United States
Lattice OSSpace Domain Awareness (SDA)Space ControlSDANetInfrastructure pipeline hardeningTest and release pipeline development
Site Reliability Engineer Internintern
Dataiku·France, Paris
Dataiku Cloudfully-managed offeringlaunchpadSaaS portalCloud EngineeringSRE