Staff DevOps Engineer

🇬🇧 Zdalnie, GB Zdalnie IT Pełny etat Opublikowano Kwi 28, 2026

Aplikuj

LokalizacjaZdalnie, GB

Tryb pracyZdalnie

Forma zatrudnieniaPełny etat

KategoriaIT

Kategoria ITDevOps / SRE

Opublikowano28 kwietnia 2026

Ostatnio sprawdzono7 maja 2026

Runware is building the API layer for the next generation of AI products. Our platform gives teams fast, reliable access to real-time inference across thousands of models through a single flexible API. We help customers build and scale media generation products with better performance, lower cost, and less operational complexity.

Behind this is an infrastructure platform built for speed, reliability, and GPU scale. New models launch constantly. Customer traffic can grow quickly. Performance matters at every layer.

We are looking for a Staff/Senior DevOps Engineer to help build, operate, and scale the infrastructure behind Runware’s global AI inference platform. You’ll play a critical role in making our systems faster, more resilient, easier to operate, and ready for the next stage of growth.

About the role

Runware’s infrastructure is the engine behind some of the fastest-growing AI products in the world. As a Staff/Senior DevOps Engineer, you’ll help design, build, and operate the systems that power real-time AI inference across large-scale GPU fleets and a global production platform.

This is not a traditional DevOps role. You’ll be working at the intersection of bare-metal infrastructure, GPUs, networking, automation, observability, and high-performance distributed systems. Your work will directly shape how quickly we can launch new models, scale customer traffic, recover from failures, and deliver low-latency AI experiences to millions of users.

You’ll turn complex, hardware-driven infrastructure into reliable, automated, developer-friendly platforms. From provisioning and orchestration to deployment pipelines, monitoring, incident response, and capacity scaling, you’ll help remove friction so engineering teams can move faster without compromising reliability.

You’ll build the foundations that let Runware scale with confidence: infrastructure that is fast, resilient, observable, secure, and built for the demands of real-time AI.

What you’ll do

Build and scale the infrastructure that powers real-time AI inference across GPU fleets, bare-metal servers, serverless and containerised production systems
Help evolve Runware’s platform toward more elastic, on-demand infrastructure that can scale quickly with customer traffic and model demand
Make Runware faster, more reliable and more resilient by improving the critical paths behind our request entrypoints, inference services, queues, storage, load balancers and networking layer
Automate the hard parts of infrastructure operations, from provisioning and configuration through to CI/CD, deployment safety, progressive rollouts and rapid rollback
Build the observability backbone for a high-performance AI platform, with the signals needed to spot issues early, understand capacity and fix problems before customers feel them
Play a leading role in production operations, incident response, debugging and post-incident improvements, helping us turn operational challenges into a stronger platform
Strengthen the security and compliance foundations of our infrastructure through patching, secrets management, access controls, hardening, auditability, documentation and repeatable operational processes