Résumé du poste par JobGrid
Datacentre Operations Engineering Lead at Radiant: Paris, France; Sur site; Lead; IT; Support IT et systèmes. This listing is part of JobGrid's Emplois support IT depuis des pages carrières. JobGrid adds normalized role facts, source context, and a path to the employer application page so candidates can compare the listing before applying.
- Location and workplace: Paris, France, Sur site
- Role classification: IT, Support IT et systèmes, Lead
- Source freshness: checked by JobGrid on 2026-06-05.
- Application path: candidates continue to the employer application page with non-personal referral tags.
About Us
We’re a fast-growing GPU-as-a-Service provider, delivering scalable, high-performance compute infrastructure purpose-built for AI and HPC workloads. Operating across global data centres, we run mission-critical environments where uptime, throughput, and ultra-low latency are non-negotiable.
Role Overview
We are seeking an exceptional Datacentre Operations Engineering Lead to own the operational buildout, stabilisation, and long-term SLA execution of our Paris-Saclay deployment—Radiant’s flagship AI infrastructure site in Europe. This is a location senior technical leadership role with full accountability for on-the-ground engineering execution, team performance, and the continuous elevation of operational standards across next-generation, ultra-high-density hardware architecture.
You will take command of site commissioning and drive it through to operational steady state—establishing non-negotiable physical standards, embedding advanced liquid cooling workflows, and instituting rigorous working practices for high-density power distribution directly on the data centre floor. You will be the definitive technical authority at the site: the person who sets the bar, enforces it, and continuously raises it.
Working in close partnership with Infrastructure (HPC) SRE, Network Engineering, and Datacentre Strategy teams, you will translate organisational engineering standards into flawlessly executed daily operations. Your most critical mandate in your first months is to become the recognised expert on the site’s most demanding systems—including NVIDIA NVL72-class and busbar-based ultra-high-density compute platforms—and to define and document the working practices that will make these systems reliably supportable at scale.
As Radiant expands its EMEA footprint, the operational blueprints, team capability, and cultural standards you establish here will directly shape how we scale across the region.
What’s in It for You
Join a team operating some of the world’s most advanced high-performance computing infrastructure. As a Datacentre Operations Engineering Lead, you’ll work hands-on with cutting-edge GPU and CPU platforms — including the latest NVIDIA architectures — powering dense, large-scale compute environments used for AI, machine learning, and next-generation workloads.
This is an opportunity to build and lead at the absolute forefront of modern infrastructure, where reliability, scale, and performance matter every day. You’ll direct experienced engineers and collaborate across a globally distributed organisation that values openness, inclusion, technical excellence, and continuous learning.
We move quickly, solve meaningful challenges, and give people the space to make an impact. If you thrive in high-stakes environments, want genuine ownership of a world-class site, and are ready to define what great looks like for ultra-high-density AI compute operations, this role was built for you.
You can also expect:
Exposure to industry-leading GPU and AI infrastructure
Opportunities to grow alongside a rapidly scaling global business
A collaborative, inclusive, and supportive engineering culture
Real ownership and the ability to influence operational excellence
Work that sits at the intersection of people, performance, and technology
A modern, flexible, globally connected workplace with ambitious goals
Key Responsibilities
Leadership & Operational Authority
Own and direct day-to-day operations at the Paris-Saclay site, providing authoritative technical leadership and clear direction to all Datacentre Operations Engineers
Set and enforce site-level operational standards—you are the final word on physical practices, hardware handling, cooling operations, and SLA delivery on the ground
Drive escalation resolution with pace and decisiveness; own the outcome of critical incidents at site level
Work closely with Infrastructure (HPC) SRE teams to implement operational standards, deployment procedures, monitoring improvements, and reliability initiatives
Partner with Datacentre Strategy leadership on operational readiness, service improvement initiatives, and deployment scalability planning
Contribute operational feedback and deployment learnings to support process development and future datacentre onboarding activities across EMEA
Promote operational discipline, documentation standards, and continuous improvement practices across the local engineering team
Commissioning & Operational Readiness
Lead site commissioning from physical infrastructure readiness through to operational steady state, setting uncompromising standards at every phase
Establish and own all working practices for advanced liquid cooling systems, including CDU commissioning, coolant loop management, leak detection protocols, and steady-state monitoring
Define and implement safe working procedures for ultra-high-density compute platforms including NVIDIA NVL72 (GB300 NVLink rack-scale) systems and busbar-based high-density power distribution
Ensure all cabling, power, and cooling infrastructure is deployed to exacting standards, with complete documentation and handover packs for ongoing operations
High-Density Compute & Liquid Cooling Operations
Operate, maintain, and troubleshoot advanced Direct Liquid Cooling (DLC) systems, including Cooling Distribution Units (CDUs), rear-door heat exchangers, in-row cooling, and associated coolant infrastructure
Own and maintain cooling capacity awareness across the site; ensure that real world is fully reflected in our observability stack and capacity management tooling
Execute and oversee structured break/fix procedures for NVL72-class and equivalent ultra-high-density GPU platforms, including GPU module swap, coolant loop isolation, busbar connection/disconnection, and firmware-level fault isolation
Enforce safe working practices around high-density power distribution, including busbar tap-off management, power capacity distribution etc.
Own the development and continuous refinement of break/fix SOPs for liquid-cooled and high-density compute systems, ensuring repeatability, safety, and minimal MTTR
Capacity Management
Own and maintain site-level capacity management across power, space, and cooling dimensions, ensuring infrastructure headroom is tracked, communicated, and acted upon
Contribute to capacity models aligned with current deployments and forward demand forecasts; flag risks proactively to Datacentre Strategy and Infrastructure teams
Partner with HPC SRE and Network Engineering to ensure capacity planning informs deployment scheduling and prevents operational constraints
Manage lifecycle of on-site hardware assets from delivery through to disposal, maintaining full fidelity in Radiant’s CMDB
Day to Day Responsibilities
Lead rapid diagnosis and resolution of hardware and network issues to maximise uptime; set the pace and standard for incident response across the team
Respond to critical hardware alerts via our monitoring and observability platform. Contribute to ongoing service improvement to improve our monitoring capability
Own vendor relationships and RMA processes, ensuring all support cases are resolved within Radiant’s SLOs to meet customer SLAs
Manage on the ground assets from the point of purchase and delivery, through lifecycle management and disposal - all while owning asset management within Radiant’s CMDB system.
In conjunction with the HPC SRE function, you’ll be responsible for running of high performance test suites for newly commissioned infrastructure in location, as well as using your hardware expertise to continually improve tooling.
Deploy and maintain HPC and AI hardware for uninterrupted operations, including performing low-level system maintenance such as hardware troubleshooting, firmware updates, and replacement of components as needed.
Oversee cooling, power distribution, and other critical data center technologies to maintain high operational standards.
Develop and maintain datacentre and hardware management SOPs, ensuring alignment with Radiant’s governance and compliance requirements
Apply ITSM frameworks: Incident, Major Incident, Change Management, and service improvement.
Operate and support services 24x7x365 for production environments, including on-call rotation; set expectations for team on-call readiness and response quality
Lead Incident postmortem analyses and root cause investigations; document learnings and drive remediation automation
Mentor engineers, manage performance on the ground, and act as the operational requirements authority for other departments
Communicate technical decisions clearly to non-technical stakeholders and customers
Uphold a culture of: do, document, automate
Willing to cross train and upskill in Infrastructure/Platform SRE practises.
Willing to travel across EMEA to support future datacentre onboarding and train in new technologies
Probation Period Objectives
The following outcomes are expected to be substantially achieved within the probation period:
Independently execute break/fix operations on NVL72-class (or equivalent ultra-high-density) and busbar-based compute platforms to production standard, with no requirement for external vendor support for in-scope procedures
Define, document, and publish working practices and SOPs for the safe operation, maintenance, and break/fix of liquid-cooled and ultra-high-density compute systems at the Paris-Saclay site
Establish clear supportability enhancements for these platforms—including spare parts protocols, escalation paths, and runbooks—that the broader team can operate from
Demonstrate full command of site capacity management: power, space, and cooling models are accurate, maintained, and informing operational decisions
Achieve demonstrable team uplift: Datacentre Operations Engineers are operating to elevated standards under your direction, with evidence of structured coaching and documented process improvements
Essential Skills & Experience
Degree in Computer Science/Electrical Engineering, or 10+ years of directly relevant industry experience in senior data centre operations roles
5+ years of experience in data centre operations, HPC, or related roles, with demonstrable progression into technical leadership
Proven track record in a lead or senior position with responsibility for team direction, standards ownership, and SLA accountability
Strong communication skills in French and English.
Passion for hardware and upholding the highest of operational standards on the ground.
Proven track record working with HPC Nvidia GPU or equivalent systems, high-performance storage, and networking.
Expertise in hardware installation, network configuration, and low-level system maintenance, including hardware troubleshooting and firmware management.
Knowledge of data center environment technologies, including cooling and power distribution.
Hands-on experience with Direct Liquid Cooling (DLC) systems: CDU operation, coolant loop management, leak detection, thermal monitoring, and associated maintenance practices
Demonstrated experience or strong working knowledge of ultra-high-density compute platforms such as NVIDIA NVL72, busbar-based power distribution, or equivalent systems operating at >30kW/rack—or clear evidence of rapid proficiency in adjacent high-density technologies and willingness to qualify on NVL72/busbar systems
Expertise in structured break/fix methodology for liquid-cooled and high-density compute: GPU module exchange, coolant loop isolation, busbar operations, and firmware-level fault isolation
Strong capacity management capability: power, space, and cooling modelling; headroom tracking; demand forecasting
Experience in data center design, greenfield deployments, and operations.
Strong understanding of hardware and spares management, with the ability to handle RMAs and support cases within defined SLOs to meet SLA requirements.
Solid understanding of HPC and AI workloads.
Strong problem-solving abilities and the resilience to thrive in a fast-paced environment.
Excellent communication skills and ability to collaborate with cross-functional, internationally dispersed teams.
Strong grasp of ITSM and service operation best practices
Excellent communication and mentorship skills
Comfortable interfacing with internal stakeholders and external customers
Bonus: Vendor-endorsed qualifications from NVIDIA, HPE, or equivalent OEMs for high-density AI compute or liquid cooling systems
Preferred Qualifications
Knowledge of large scale private cloud deployments and capacity planning.
Qualifications in HVAC management and deployments
Certifications in relevant areas - Hardware, Networking
ITIL Foundation level qualification or equivalent experience
Data centre facilities qualifications (e.g., Uptime Institute ATD, EPI CDCP, or equivalent)