Podsumowanie roli od JobGrid
MLOps Support Team Lead at CloudFactory: Nairobi, Kenia; Hybrydowo. JobGrid adds normalized role facts, source context, and a path to the employer application page so candidates can compare the listing before applying.
- Location and workplace: Nairobi, Kenia, Hybrydowo
- Source freshness: checked by JobGrid on 2026-05-30.
- Application path: candidates continue to the employer application page with non-personal referral tags.
At CloudFactory, we are a mission-driven team passionate about unlocking the potential of AI to transform the world. By combining advanced technology with a global network of talented people, we make unusable data usable, driving real-world impact at scale.
More than just a workplace, we’re a global community founded on strong relationships and the belief that meaningful work transforms lives. Our commitment to earning, learning, and serving fuels everything we do as we strive to connect one million people to meaningful work and build leaders worth following.
Our Culture
At CloudFactory, we believe in building a workplace where everyone feels empowered, valued, and inspired to bring their authentic selves to work. We are:
- Mission-Driven: We focus on creating economic and social impact.
- People-Centric: We care deeply about our team’s growth, well-being, and sense of belonging.
- Innovative: We embrace change and find better ways to do things together.
- Globally Connected: We foster collaboration between diverse cultures and perspectives.
If you’re passionate about innovation, collaboration, and making a real impact, we’d love to have you on board!
Role Summary
As the MLOps Operations Lead, you will own the day-to-day reliability, supportability, and operational maturity of CloudFactory’s MLOps service. You will lead a global support team responsible for monitoring, triaging, and resolving issues across production ML systems, while driving improvements in observability, incident management, and service delivery.
You will work closely with Engineering, Platform Ops, and external partners to ensure AI/ML solutions are not only functional, but stable, measurable, and trusted in production. This role is critical in transitioning MLOps from reactive support to a proactive, scalable service capability.
Responsibilities:
Service Ownership & Reliability
- Own the operational performance of all production ML systems and pipelines
- Ensure reliability, availability, and supportability across client and internal MLOps workloads
- Establish and enforce SLAs, SLOs, and operational standards
- Act as the escalation point for major incidents and service degradation
Team Leadership & Delivery
- Lead a global MLOps Support team (L1/L2) across regions (Colombia, Kenya, Nepal)
- Define shift patterns, on-call rotations, and coverage models
- Set clear expectations, performance metrics, and development plans
- Foster a strong operational culture focused on accountability and continuous improvement
Incident Management & RCA
- Own incident response processes, including triage, communication, and resolution
- Ensure high-quality Root Cause Analysis (RCA) and follow-through on corrective actions
- Drive reduction in repeat incidents through structured problem management
- Improve time to detect (TTD) and time to resolve (TTR) metrics
Monitoring, Observability & MLOps Maturity
- Drive implementation and evolution of monitoring across:
- pipelines and data flows
- infrastructure and compute
- model performance and drift
- Ensure visibility extends beyond system health to model accuracy, bias, and data integrity
- Partner with Engineering to improve instrumentation, logging, and alerting
Support Model & Process Design
- Define and evolve the MLOps support operating model
- Clearly establish boundaries between Support, Engineering, and external partners
- Build and maintain runbooks, playbooks, and escalation paths
- Standardize intake, triage, and resolution workflows (e.g. Slack, ticketing systems)
Stakeholder & Partner Management
- Act as the primary operational interface for:
- Engineering teams
- Platform Operations
- External partners
- Reduce reliance on individuals by formalizing ownership and knowledge sharing
- Provide clear communication during incidents and service updates
Continuous Improvement & Scaling
- Identify trends in incidents and operational inefficiencies
- Drive improvements in:
- automation
- alert quality
- self-healing capabilities
- Support onboarding of new MLOps projects into a standardized support model
- Contribute to building MLOps as a scalable, repeatable service offering
Reporting & Service Health
- Define and track key operational metrics:
- incident volume and severity
- SLA adherence
- system uptime and reliability
- Support regular service reviews and model health reporting
- Provide leadership visibility into risks, trends, and improvement areas