Electrum Software

Software Engineer - Reliability/SRE

🇿🇦 Cape Town, Sudáfrica Presencial Jornada completa Publicado May 22, 2026

Solicitar

Ubicación Cape Town, Sudáfrica

Modalidad Presencial

Contrato Jornada completa

Idioma English

Publicado 22 de mayo de 2026

Última verificación 29 de mayo de 2026

Contexto de JobGrid

Resumen del puesto por JobGrid

Software Engineer - Reliability/SRE at Electrum Software: Cape Town, Sudáfrica; Presencial; Jornada completa. JobGrid adds normalized role facts, source context, and a path to the employer application page so candidates can compare the listing before applying.

Location and workplace: Cape Town, Sudáfrica, Presencial
Role classification: Jornada completa
Source freshness: checked by JobGrid on 2026-05-29.
Application path: candidates continue to the employer application page with non-personal referral tags.

Electrum is a next-generation payment software technology company.

Since 2012, we've delivered trusted, enterprise-grade, cloud-native software to optimise financial transaction processing. Our deep expertise has established us as a respected partner in high-volume, low-value payment schemes, enabling clients to deliver services to millions of South Africans daily.

At Electrum, we are grounded in impact – designing solutions that matter, acting with urgency, and continuously learning as we scale. We believe in creating together – working side by side with our clients and teams to build meaningful, lasting solutions. We prioritise making it safe – encouraging open communication, smart risk-taking, and trust so that creativity and alignment thrive. And we back empowered strong teams – hiring brilliant people, collaborating hard, and holding each other to high standards while leading with empathy and kindness.

The Role

As a Core Reliability Engineer, you will not be joining a traditional 24/7 operations team. Instead, you will act as a central software team enabler, defining the standards, observability tooling, and automation frameworks that allow our stream-aligned product teams to own their service health independently.

Reliability in our specific FinTech niche isn't just about keeping servers up; it's about processing high-volume, widely impacting financial transactions where a dropped message has real-world consequences. We are looking for an innovative systems thinker who wants to solve difficult industry problems, architect for scale alongside reliabiilty and help us set the industry benchmark for reliability in payments.

Your ultimate goal is to ensure reliable software is easy to build, and when we fall short, we know about it before our clients do.

Responsibilities

Enablement & RelOps Culture

Implement the Observability Ladder: Guide teams from basic monitoring to high-signal metric tracking. Work with product teams to define SLAs, SLIs, and SLOs, and build the dashboards that track specific error budgets.
Empower Product Teams: Build frameworks and deployment tooling (e.g., CI/CD, internal tooling integrations) that allow teams to make data-driven decisions on deployment safety and automate rollbacks when error budgets are depleted.
Champion Reliability: Drive a blameless post-mortem culture focused on actionable takeaways, system improvements, and measurable metrics (MTBF, MTTR).

Frameworks & Automation

Standardised Alerting & On-Call: Continuously improve our company-wide alerting and on-call frameworks to reduce alert fatigue and ensure that, when a pager goes off, the alert is highly actionable and symptom-based.
Disaster Recovery: Drive to evolve our DR strategies from manual processes into fully automated "runbooks-as-code." You'll build the tooling that allows teams to prove and improve their service’s recoverability through autonomous, evidence-based testing.
Eliminate Toil: Develop systems, automations and tooling, e.g. for pre- and post-deployment verification, ensuring that our "hands-off" reliability vision becomes a production reality, via Python (or similar).
Reliability-as-Code: Lead the drive to manage our entire reliability suite through Infrastructure as Code. Use Terraform to architect, deploy, and configure our observability stack - including ELK, Grafana, Loki, Prometheus, and Tracing - ensuring our monitoring environment is as reliable as our production systems,