Résumé du poste par JobGrid
Senior Site Reliability Engineer at obsidiansecurity: Australia, Australie; Sur site; Senior; IT; DevOps / SRE. JobGrid adds normalized role facts, source context, and a path to the employer application page so candidates can compare the listing before applying.
- Location and workplace: Australia, Australie, Sur site
- Role classification: IT, DevOps / SRE, Senior
- Source freshness: checked by JobGrid on 2026-06-04.
- Application path: candidates continue to the employer application page with non-personal referral tags.
The DevOps/SRE team at Obsidian ensures that engineering excellence translates into stable, scalable, and high-performing production systems. We work closely with Engineering, Quality Engineering, and Customer Support to deliver end-to-end services that bring code to life and maintain our world-class SaaS security platform.
As part of our Sydney team, you will also play a foundational role in building Sherlock — our AI-powered SRE agent — owning the infrastructure that enables autonomous incident detection, root cause analysis, and remediation at scale.
What You’ll DoPlatform Reliability
- Support and maintain the service quality of our customer-facing SaaS security platform
- Address complex challenges around scalability, reliability, observability, and cost efficiency
- Collaborate with Engineering teams to maintain and enhance Helm charts, application deployment, monitoring, and CI/CD pipelines
- Embed into the engineering team so that you understand the application deeply
- Define service verification strategies and implement them as part of the CI/CD process to meet SLAs
- Improve developer experience by optimizing CI/CD workflows and performance
- Participate in the on-call rotation, providing 24/7 support in coordination with our global SRE team
- Monitor, debug, and optimize production infrastructure and services on AWS/GCP
- Own and evolve the observability stack: design and maintain Prometheus/Mimir metrics pipelines, Grafana dashboards, Loki log aggregation, and distributed tracing (e.g. Tempo, Jaeger, or OpenTelemetry)
- Define and instrument SLIs/SLOs across services; build alerting strategies that reduce noise and surface actionable signals
AI SRE Agent (Sherlock)
- Own the Kubernetes infrastructure for Sherlock: five independently-scaled worker pools, each tuned for its agent’s compute profile with HPA autoscaling
- Design and maintain the CloudSQL schema, migration pipeline, task queue (SKIP LOCKED), and pgvector IVFFlat index for 1,000+ RCA entries
- Build Grafana dashboards covering queue depth, worker latency, agent error rates, accuracy trends, and P50/P95 speed
- Own and maintain the benchmark CI gate in GitLab that blocks any prompt version merge regressing accuracy >5% or speed >15%
- Deliver capacity planning and cost dashboards for Sherlock’s GKE node pools
- By month 3, serve as the primary on-call engineer for all Sherlock infrastructure
Required
- 4+ years of experience in a DevOps or SRE role supporting SaaS services on GCP and/or AWS
- Bachelor’s degree in Computer Science or related field
- Production Kubernetes experience: authored and owns Deployments, HPAs, and resource limits — not just applied YAML
- Strong proficiency in Kubernetes, microservices architecture, Helm, GitLab CI/CD, and ArgoCD
- Deep hands-on experience with the Grafana observability stack: Prometheus/Mimir (metrics), Loki (logs), and distributed tracing (Tempo, Jaeger, or OpenTelemetry)
- Ability to design SLI/SLO frameworks, build alerting rules, and reduce alert fatigue across complex microservices
- PostgreSQL fluency: schema design, indexing, migrations, and query optimisation
- Async / queue-based architecture experience: debugged stuck queues, consumer lag, and duplicate processing
- Programming proficiency in Python or Go
- Strong ownership mindset and comfort with production on-call responsibility
Highly Desired
- GCP expertise: Cloud SQL, GKE, IAM, Pub/Sub
- pgvector or other vector database experience
- CI/CD pipeline ownership (GitLab CI or GitHub Actions)
- Familiarity with LLM APIs (Anthropic, Bedrock, or Vertex)
- Understanding of AI agent design patterns and frameworks
- Experience with Kafka, Elasticsearch, ScyllaDB, Databricks, Dagster, Sentry, or Kong