DevOps / Site Reliability Engineer (SRE)

Summary:

We are looking for DevOps / SRE team members who will take an active role in managing our highly available and scalable infrastructure within our Server Systems Management and DevOps team.

In this role, you will be responsible for ensuring the continuity of critical systems running in multi-cloud environments, primarily AWS and Huawei Cloud (HWC), and you will contribute to the improvement of automation, monitoring, and incident management processes.

Education:

  • Bachelor’s degree in Computer Engineering, Software Engineering, or a related engineering discipline, or equivalent practical experience.

Responsibilities:

  • Infrastructure management and optimization in AWS and Huawei Cloud environments
  • Setup, management, and monitoring of Kubernetes clusters (production / pre-production)
  • Setup and improvement of CI/CD pipelines
  • Management of High Availability (HA) and Disaster Recovery (DR) scenarios
  • Management of monitoring and alerting infrastructure using tools such as Prometheus, Grafana, Loki, etc.
  • Performance, capacity, and cost optimization activities
  • Participation in on-call rotation and response to critical incidents
  • Improvement of security, logging, and backup processes
  • Close collaboration with development teams to resolve deployment and runtime issues

Qualifications:

  • Proficiency in Linux system administration (preferably Ubuntu / CentOS)
  • Proficiency in Docker and container concepts
  • Proficiency in Kubernetes core components (Pods, Services, Ingress, Deployments, HPA, etc.)
  • Proficiency in at least one CI/CD tool (GitLab CI, GitHub Actions, Jenkins, etc.)
  • Knowledge of networking fundamentals (TCP/IP, DNS, NAT, Load Balancers, Firewalls)
  • Proficiency in monitoring and logging concepts
  • Strong problem-solving and analytical thinking skills
  • Ability to remain calm and act systematically during critical situations
  • Strong attention to documentation
  • Strong teamwork skills
  • Open to learning and self-improvement
  • Ability to adapt to rotational on-call duty when required
  • Strong sense of responsibility for minimizing downtime in critical systems
  • Ability to adapt to flexible working hours based on operational requirements

Preferred:

  • Experience with AWS services (EC2, VPC, ALB/NLB, RDS, IAM, CloudWatch)
  • Experience with Huawei Cloud (HWC) or other cloud service providers
  • Knowledge of Infrastructure as Code tools (Terraform, Ansible, Helm)
  • Knowledge of Prometheus, Grafana, Loki, ELK
  • Experience working with systems such as OpenStack, Ceph, Couchbase, Elasticsearch
  • Experience with on-call and incident management