Dónde está disponible este puesto
Plegado de forma predeterminada para que la descripción sea fácil de leer.
- Taipei City, Taiwán
- Taiwán
Resumen del puesto por JobGrid
Site Reliability Engineer (SRE) at nahc in Taipei City, Taiwan, on-site. JobGrid normalizes this into a comparable role record from the source posting and keeps the employer description separate; no salary was listed in the structured source data.
- Primary location is Taipei City, Taiwan, with Taiwan also listed as a broader on-site location.
- Source freshness: posted on 2026-06-04 and last checked on 2026-06-04.
- Work language is English.
- Applicants are routed to the original public application page with non-personal referral parameters.
Our client is an innovative technology company operating large-scale cloud and edge infrastructure supporting AI-driven products and services. As the platform continues to expand, they are looking for a Site Reliability Engineer to help build highly reliable, observable, and secure systems that power mission-critical applications.
This role offers the opportunity to work across cloud infrastructure, Kubernetes, observability, security, automation, and emerging AI operational platforms in a fast-growing environment.
What you will do:
- Design and maintain monitoring, alerting, and dashboarding systems across cloud and edge environments.
- Build visibility into system health through metrics, logs, traces, and performance analytics.
- Define and manage SLIs, SLOs, and service reliability targets.
- Develop proactive monitoring and anomaly detection capabilities to identify issues before they impact users.
- Deploy, manage, and optimize containerized workloads running on Kubernetes.
- Maintain scalable cloud infrastructure across production environments.
- Improve system performance, availability, and operational efficiency.
- Support infrastructure provisioning through Infrastructure-as-Code practices.
- Implement secure access controls and audit mechanisms across infrastructure environments.
- Monitor for cybersecurity threats, unauthorized access attempts, and service disruptions.
- Develop alerting and response procedures for security-related incidents.
- Contribute to operational security best practices and governance initiatives.
- Automate repetitive operational tasks to reduce manual effort and improve reliability.
- Build tooling and scripts to streamline infrastructure operations.
- Support CI/CD workflows and deployment automation.
- Promote documentation, operational standards, and continuous improvement.
- Participate in on-call rotations and incident management.
- Lead troubleshooting efforts during production incidents.
- Conduct root-cause analysis and post-mortem reviews.
- Drive long-term improvements that enhance system resilience.
- Work closely with software, AI, machine learning, hardware, and product teams.
- Ensure new services are production-ready with appropriate monitoring, security, and reliability measures.
- Support the operational needs of both cloud-based and distributed edge computing environments.
What you will need:
- 3+ years of experience in Site Reliability Engineering, DevOps, Platform Engineering, or Production Operations.
- Hands-on experience with AWS or other major cloud platforms.
- Strong understanding of observability and monitoring tools such as Grafana, Prometheus, or similar platforms.
- Solid Linux administration and troubleshooting skills.
- Experience with Docker, Kubernetes, and containerized workloads.
- Experience with Infrastructure as Code tools such as Terraform.
- Proficiency in at least one scripting or programming language (Python, Bash, etc.).
- Understanding of networking fundamentals and infrastructure security concepts.
- Experience supporting production systems and participating in incident response.
- Strong automation mindset and commitment to operational excellence.
Nice-to-haves:
- Experience operating large-scale edge computing or IoT deployments.
- Familiarity with zero-trust access management platforms.
- Experience in security operations, threat detection, or infrastructure security.
- Exposure to AI infrastructure, LLM-based applications, or workflow automation platforms.
- Knowledge of AI-Ops, anomaly detection, or intelligent monitoring solutions.
- Familiarity with compliance and security frameworks such as ISO 27001.