Role Overview
We are seeking a US Infrastructure Operations Technical Lead to drive the operational excellence, technical leadership, and growth of Radiant’s US Infrastructure Operations function. This is a hands-on player-manager role designed for an infrastructure-focused engineering leader with a strong Site Reliability Engineering mindset and deep understanding of large-scale distributed infrastructure environments.
Working closely with the UK Infrastructure Operations Manager during overlapping morning hours (US Eastern Time), you will help coordinate cross-regional operations, strategic planning, incident management, and infrastructure delivery across Radiant’s global AI and HPC platform. During US business hours, you will lead and mentor the local Infrastructure Operations team, currently consisting of three engineers, while helping scale operational maturity and team capability as the business continues to grow.
The ideal candidate will come from a hyperscale, HPC, or large-scale cloud-native SaaS infrastructure background, with experience operating complex distributed systems at scale. This role requires breadth across datacentre compute, Linux systems, networking, and storage fabrics, with the ability to troubleshoot and lead continuous improvement of our infrastructure.
You should be comfortable operating and troubleshooting bare-metal environments, low-latency networking, storage protocols, and core infrastructure technologies underpinning high-performance AI and GPU compute platforms.
This role requires strong operational leadership capabilities, including experience running small engineering teams, participating in ITIL-aligned operational processes, and supporting high-availability production environments through structured incident, change, and problem management practices. You will also participate in an on-call rota to lead major incidents, orchestrating technical resources to quickly resolve large scale issues.
As Radiant expands its global footprint, your operational leadership, technical expertise, and ability to build high-performing teams will play a critical role in shaping the future of our US infrastructure operations.
What’s in It for You
Join a globally distributed engineering organisation operating cutting-edge GPU, AI, and high-performance compute infrastructure at scale. As the US Infrastructure Operations Technical Lead, you’ll work hands-on with advanced compute and networking technologies powering large-scale AI and machine learning workloads.
This is an opportunity to operate at the forefront of modern infrastructure engineering, helping shape operational standards, automation practices, and reliability engineering across a rapidly scaling global platform. You’ll collaborate with highly skilled engineers across Infrastructure Operations, HPC SRE, Networking, and Platform Engineering teams in an environment that values technical excellence, ownership, and continuous improvement.
We move quickly, solve meaningful infrastructure challenges, and provide engineers with the opportunity to influence how next-generation AI infrastructure is designed, operated, and scaled globally.
You can also expect:
Exposure to industry-leading GPU and AI infrastructure
Opportunities to help build and scale a growing US operations function
A collaborative, inclusive, and globally connected engineering culture
Real ownership and influence across operational strategy and execution
Work at the intersection of reliability, automation, performance, and scale
A flexible remote-first working environment with ambitious growth plans
Key Responsibilities
Leadership & Operational Ownership
Lead a small but high-impact US Infrastructure Operations team, owning both people leadership and technical execution
Ensure 99.9%+ platform uptime across US-region services.
Act as the senior US operational owner for production infrastructure, accountable for reliability, incident outcomes, and day-to-day operational execution
Partner tightly with the UK Infrastructure Operations Manager to align priorities, respond to incidents, and execute global infrastructure plans in real time
Own US-side incident leadership, driving fast and effective resolution of production-impacting infrastructure issues
Build and reinforce a strong ownership culture built on do, document, automate
Ensure operational knowledge is captured and shared through lightweight, high-signal documentation rather than process overhead
Hire, onboard, and develop Infrastructure Operations engineers as the team scales
Run direct 1:1s and performance conversations focused on raising technical bar and operational effectiveness
Ensure disciplined execution of core operational processes (incident, change, problem management) without slowing delivery
Participate in on-call rotation and lead from the front during major incidents
Willingness to travel within the US and Europe as required to support infrastructure deployments, data centre work, and cross-regional collaboration (UK-headquartered company)
Help define how Infrastructure Operations scales globally as the company grows
Technical Day-to-Day
Stay hands-on and close to the systems while leading the team — this is not a purely managerial role
Take ownership of real infrastructure problems and actively contribute to debugging, fixing, and improving production systems
Work across production infrastructure spanning compute, storage, networking, and platform services
Assist with resolution of deep infrastructure issues across the stack, including:
Linux systems (performance, stability, kernel behaviour, resource contention)
Networking (routing, switching, DNS, TCP/IP, latency, packet-level troubleshooting)
Storage systems (distributed storage performance, consistency, and failure modes)
Bare-metal infrastructure (hardware issues, firmware, lifecycle and deployment failures)
Operate and improve large-scale Linux environments across on-prem, private cloud, and hybrid infrastructure
Take ownership of infrastructure reliability through automation, configuration management, and system hardening
Build and improve Infrastructure as Code workflows (Terraform, Ansible or equivalent)
Drive observability as a first-class requirement — metrics, logs, traces, and actionable alerting
Lead or directly participate in major incident response, helping drive technical resolution under pressure
Act as a senior technical problem solver across Infrastructure Ops, Networking, Platform, and SRE teams
Identify repetitive operational work and eliminate it through automation and system improvements
Contribute directly to scaling decisions, capacity planning, and reliability improvements
Participate in on-call rotation with real escalation authority and accountability
Essential Skills & Experience
8+ years infrastructure engineering, SRE, platform ops, or large-scale production infrastructure experience
2+ years in technical leadership or engineering management with direct reports
Strong experience operating production infrastructure at scale (on-prem, private cloud, or hybrid)
Deep Linux expertise: performance tuning, debugging, kernel/system behaviour, production troubleshooting
Hands-on across the stack: bare metal → OS → network → storage → platform
Strong infrastructure fundamentals: compute, storage, networking in real-world production environments
Incident-heavy environment experience (24x7 ops, on-call, major incident response, postmortems)
Strong networking: TCP/IP, routing, switching, DNS, latency, packet-level debugging
Bare-metal operations experience (Redfish, IPMI, lifecycle management, hardware troubleshooting)
Strong automation + config management (Ansible preferred)
Strong scripting (Python, Bash or similar)
Strong ownership mindset: simplifies, automates, removes operational toil
Highly desirable:
Experience in HPC, AI/ML, GPU compute, or large-scale high-performance infrastructure environments
Distributed / parallel storage experience (Lustre, WEKA, or equivalent)
InfiniBand or other high-performance, low-latency networking experience
Preferred Qualifications
Bachelor’s or Master’s degree in Computer Science, Information Technology, Engineering, or a related field, or equivalent experience
NVIDIA NCP type qualifications
PMP, ITIL, or equivalent project/operations management certification.
LPI or equivalent Linux certifications.
Why should you join us?
What sets us apart is our blend of modern technology, competitive benefits, and an open, welcoming work culture that enables our people to thrive.
Here are just some of the great things you can expect from us:
15 days of annual leave: we value your peace of mind. With 20 days off (excluding public holidays) and access to mental health resources, we make sure you're as strong mentally as you are professionally.
A culture that emphasises results over hierarchy, process & ego: we place great emphasis on the quality, ingenuity and creativity of work.
Open communication, regular feedback: we value smooth collaboration, direct and actionable feedback, and believe that leading with empathy and a growth mindset makes us better together.
Learning Time: we all have dedicated learning time to focus on new skills, projects or interests that lay outside of your day-to-day job.
Health & Wellbeing: we want everyone to feel healthy and happy, so we offer private medical insurance via UnitedHealthcare.
Participation in the company shares program
Diversity, Equality, Inclusion and Belonging
We are an equal opportunity employer and we strive to reduce unconscious bias throughout our hiring process. All applicants will be considered for employment without attention to ethnicity, religion, sexual orientation, gender identity, family or parental status, national origin, veteran, neurodiversity status or disability status. To ensure our recruitment processes provide an equal opportunity for all applicants to succeed, we encourage you to let us know if there are any adjustments that we can make.