foresite-labs-fl2024-006

Staff Engineer, CI/CD & Cloud Infrastructure

San Diego Presencial Publicado Mai 1, 2026
LocalizaçãoSan Diego
ModalidadePresencial
Publicado1 de Maio de 2026
Última verificação7 de Maio de 2026

Staff Engineer, CI/CD & Cloud Infrastructure

Location: San Diego, CA

Job Type: Full-Time

Salary Range: $ 175,000 - $185,000

Position Overview

We are looking for a Staff CI/CD & Cloud Infrastructure Engineer to own and evolve our build pipelines, deployment workflows, and cloud infrastructure. You will be responsible for ensuring that software — spanning Python, C/C++, and CUDA on Linux — is built, tested, versioned, and deployed reliably across both AWS cloud environments and a fleet of complex embedded instruments operated in our central lab facility.

This is a senior hands-on role for an engineer who thrives at the intersection of DevOps automation, cloud infrastructure management, and release engineering. You will design and maintain CI/CD pipelines, manage complex AWS infrastructure as code, and ensure full traceability from source commits through builds, tests, artifacts, and deployments. You will work cross-functionally with firmware, application, and HPC engineers to keep the entire delivery pipeline fast, reliable, and observable.

Key Responsibilities

CI/CD & Build Engineering

  • Design, build, and maintain CI/CD pipelines using GitHub Actions or similar platforms

  • Manage build systems for Python, C/C++, and CUDA codebases on Linux

  • Integrate build tools (CMake, Make, pip, setuptools) into automated pipelines

  • Implement robust versioning, tagging, and artifact management strategies

  • Ensure full traceability of builds, test results, and artifacts from commit to deployment

  • Manage Docker-based build environments including base images, caching, and reproducibility

  • Maintain and optimize build performance, parallelism, and reliability

Cloud Infrastructure (AWS)

  • Architect and manage complex AWS infrastructure including:

    • IAM roles, policies, and access management

    • Storage services (S3, EBS, EFS) with tiered lifecycle policies

    • Databases (RDS, DynamoDB, or similar) with backup and

      failover strategies

    • Data workflow and pipeline engines (Step Functions, Airflow, or

      similar)

    • Compute services (EC2, ECS, EKS, Lambda) scaled to workload

      requirements

  • Implement infrastructure as code using Terraform

  • Manage Kubernetes clusters and Helm charts for containerized

  • workloads

  • Design for scalability, high availability, and disaster recovery

  • Manage cost optimization, resource tagging, and infrastructure

  • governance

  • Support multi-account and multi-region strategies as needed

  • Familiarity with Azure and GCP for secondary or hybrid

  • requirements

On-Premises HPC & Hybrid Infrastructure

  • Provision, configure, and manage on-premises Linux HPC nodes used for secondary and tertiary data processing

  • Define infrastructure-as-code (Terraform, Ansible, or similar) for reproducible HPC node provisioning and configuration

  • Manage high-speed networking infrastructure between instruments, HPC nodes, and storage (configuration, monitoring, troubleshooting)

  • Implement and manage shared storage systems (NFS, parallel filesystems, or similar) accessible to both local HPC and cloud compute

  • Design and operate hybrid burst-to-cloud infrastructure — provision and manage AWS compute resources that extend local HPC capacity on demand

  • Collaborate with the data pipeline team to ensure infrastructure meets throughput, latency, and reliability requirements

  • Manage OS patching, driver updates, and GPU runtime environments across HPC nodes

  • Monitor HPC cluster health, utilization, and capacity to inform scaling decisions

    Experiment Data Management & Pipelines

  • Design and operate data ingestion pipelines for high-volume experiment data from lab instruments

  • Implement tiered storage strategies (hot/warm/cold) to balance accessibility, performance, and cost

  • Deploy and manage search infrastructure (Elasticsearch/ OpenSearch) to make experiment data universally discoverable and queryable

  • Build data cataloging and metadata tagging systems so datasets are well-organized and self-describing

  • Integrate visualization tools (Grafana, Kibana, or similar) to enable engineers and scientists to explore and analyze experiment data

  • Design data lifecycle policies including retention, archival, and compliance requirements

  • Ensure data pipelines are reliable, idempotent, and observable with clear error handling and retry logic

  • Work with engineering and science teams to define data schemas, access patterns, and query requirements

Deployment & Release Engineering

  • Own deployment workflows for software delivered to embedded instruments in our central lab

  • Manage release processes for a small number of complex, high- value lab-operated instruments

  • Design deployment strategies that account for rollback, validation, and minimal downtime

  • Coordinate versioned releases across multiple software components and dependencies

  • Support development, staging, and production environment parity

Logging, Observability & Traceability

  • Implement centralized log collection and aggregation across cloud and on-site systems

  • Deploy and manage observability tooling (Prometheus, Grafana, Loki, CloudWatch, or similar)

  • Ensure structured, searchable logging with clear correlation across services

  • Build dashboards and alerting for infrastructure health, pipeline status, and deployment state

  • Establish traceability standards linking builds, tests, artifacts, and deployments

  • Support diagnostics and post-mortem analysis for production incidents

AI-Augmented DevOps

  • Integrate agentic AI tools into CI/CD workflows to automate code review, test generation, and pipeline troubleshooting

  • Evaluate and deploy AI-powered assistants for infrastructure management, incident response, and operational tasks

  • Design guardrails and human-in-the-loop controls for AI-driven automation in production environments

  • Stay current with the rapidly evolving landscape of AI-augmented development and DevOps tooling

  • Champion adoption of agentic AI across engineering workflows to accelerate delivery and improve reliability

Qualifications

Education:

BS/MS in Computer Science or Engineering

Required:

  • Experience & Technical Skills

  • 7+ years of experience in DevOps, CI/CD, or cloud infrastructure roles

  • Strong, hands-on Linux expertise (administration, debugging, performance tuning)

  • Deep experience designing and operating CI/CD pipelines (GitHub Actions preferred)

  • Proven experience managing complex AWS infrastructure at scale

  • Strong knowledge of Docker including multi-stage builds, registries, and orchestration

  • Experience with infrastructure as code using Terraform

  • Experience with Kubernetes and Helm for container orchestration

  • Solid understanding of versioning strategies, artifact management, and release engineering

  • Experience integrating agentic AI into DevOps workflows and CI/CD pipelines

    Programming & Build Systems

  • Proficiency in Python and shell scripting for automation and tooling

  • Ability to read, debug, and build C/C++ and CUDA applications on Linux

  • Experience integrating build systems (CMake, Make) into CI pipelines

  • Familiarity with package management and dependency resolution across languages

    Cloud & Infrastructure

  • Deep AWS experience across IAM, networking (VPC, security groups), storage, compute, and database services

  • Experience managing on-premises Linux HPC infrastructure alongside cloud resources

  • Experience designing for high availability, failover, and disaster recovery

  • Experience with data pipeline and workflow orchestration tools (Step Functions, Airflow, or similar)

  • Experience with search and indexing platforms (Elasticsearch, OpenSearch, or similar)

  • Understanding of tiered storage strategies and data lifecycle management

  • Knowledge of cost management, tagging strategies, and infrastructure governance

    Observability & Traceability

  • Experience with logging and monitoring stacks (Prometheus,

  • Grafana, Loki, ELK, or CloudWatch)

  • Understanding of build and artifact traceability practices

  • Experience with structured logging and distributed tracing concepts

    Preferred:

  • Experience deploying software to embedded or lab-operated instruments

  • Experience with high-speed networking (InfiniBand, RDMA, or 10/25/100GbE) in HPC environments

  • Experience with CUDA build toolchains and GPU-accelerated workloads

  • Familiarity with Azure or GCP in addition to AWS

  • Experience in regulated or reliability-sensitive environments

  • Experience with GitOps workflows and progressive delivery

    strategies

  • Familiarity with secrets management (Vault, AWS Secrets Manager)


    We are an equal opportunity employer. We thrive on diversity and collaboration.

Antes de sair

Deixe o seu e-mail para acompanhar esta vaga e receber alertas relevantes. Também pode continuar sem o partilhar.