nahc

Site Reliability Engineer (SRE)

🇹🇼 Taipei City, Taiwan, Taiwan Presencial Publicado Jun 4, 2026

Candidatar-se

Modalidade Presencial

Idioma English

Publicado 4 de Junho de 2026

Última verificação 4 de Junho de 2026

Onde esta vaga está disponível

Recolhido por padrão para manter a descrição fácil de ler.

2 localizações

Taiwan

Taipei City, Taiwan
Taiwan

Contexto da JobGrid

Resumo da vaga pela JobGrid

Site Reliability Engineer (SRE) at nahc in Taipei City, Taiwan, on-site. JobGrid normalizes this into a comparable role record from the source posting and keeps the employer description separate; no salary was listed in the structured source data.

Primary location is Taipei City, Taiwan, with Taiwan also listed as a broader on-site location.
Source freshness: posted on 2026-06-04 and last checked on 2026-06-04.
Work language is English.
Applicants are routed to the original public application page with non-personal referral parameters.

Our client is an innovative technology company operating large-scale cloud and edge infrastructure supporting AI-driven products and services. As the platform continues to expand, they are looking for a Site Reliability Engineer to help build highly reliable, observable, and secure systems that power mission-critical applications.

This role offers the opportunity to work across cloud infrastructure, Kubernetes, observability, security, automation, and emerging AI operational platforms in a fast-growing environment.

What you will do:

Design and maintain monitoring, alerting, and dashboarding systems across cloud and edge environments.
Build visibility into system health through metrics, logs, traces, and performance analytics.
Define and manage SLIs, SLOs, and service reliability targets.
Develop proactive monitoring and anomaly detection capabilities to identify issues before they impact users.
Deploy, manage, and optimize containerized workloads running on Kubernetes.
Maintain scalable cloud infrastructure across production environments.
Improve system performance, availability, and operational efficiency.
Support infrastructure provisioning through Infrastructure-as-Code practices.
Implement secure access controls and audit mechanisms across infrastructure environments.
Monitor for cybersecurity threats, unauthorized access attempts, and service disruptions.
Develop alerting and response procedures for security-related incidents.
Contribute to operational security best practices and governance initiatives.
Automate repetitive operational tasks to reduce manual effort and improve reliability.
Build tooling and scripts to streamline infrastructure operations.
Support CI/CD workflows and deployment automation.
Promote documentation, operational standards, and continuous improvement.
Participate in on-call rotations and incident management.
Lead troubleshooting efforts during production incidents.
Conduct root-cause analysis and post-mortem reviews.
Drive long-term improvements that enhance system resilience.
Work closely with software, AI, machine learning, hardware, and product teams.
Ensure new services are production-ready with appropriate monitoring, security, and reliability measures.
Support the operational needs of both cloud-based and distributed edge computing environments.

What you will need:

3+ years of experience in Site Reliability Engineering, DevOps, Platform Engineering, or Production Operations.
Hands-on experience with AWS or other major cloud platforms.
Strong understanding of observability and monitoring tools such as Grafana, Prometheus, or similar platforms.
Solid Linux administration and troubleshooting skills.
Experience with Docker, Kubernetes, and containerized workloads.
Experience with Infrastructure as Code tools such as Terraform.
Proficiency in at least one scripting or programming language (Python, Bash, etc.).
Understanding of networking fundamentals and infrastructure security concepts.
Experience supporting production systems and participating in incident response.
Strong automation mindset and commitment to operational excellence.

Nice-to-haves:

Experience operating large-scale edge computing or IoT deployments.
Familiarity with zero-trust access management platforms.
Experience in security operations, threat detection, or infrastructure security.
Exposure to AI infrastructure, LLM-based applications, or workflow automation platforms.
Knowledge of AI-Ops, anomaly detection, or intelligent monitoring solutions.
Familiarity with compliance and security frameworks such as ISO 27001.