DevOps / Site Reliability Engineer (SRE)

As the leading force of Türkiye’s National Technology Initiative, Baykar develops indigenous and national high-technology Unmanned Aerial Vehicle (UAV) systems and creates global impact with proven platforms. With our end-to-end engineering approach from R&D to production, we continue to deliver projects that push technological boundaries.

OUR DEPARTMENT:

Network, Information Technologies, and Information Security Systems Department is responsible for the end-to-end design, development, management, and security of the organization’s digital infrastructure. Under this department; software development, system and infrastructure management, network technologies, information security, artificial intelligence and data analytics, DevOps processes, ERP systems, and software testing activities are carried out with an integrated approach.

In line with the principles of high availability, scalability, and security, the department aims to contribute to the digital transformation of corporate processes by developing modern software architectures, cloud and on-premise infrastructures, automation solutions, and advanced analytics applications.

POSITION OBJECTIVE:

The purpose of this role is to ensure the sustainable and uninterrupted operation of infrastructure in line with high availability and scalability principles within the Server Systems Management and DevOps team. In this context, the role is responsible for managing, automating, monitoring, and enhancing incident management processes for critical systems operating in multi-cloud environments, primarily AWS and Huawei Cloud (HWC).

WHAT AWAITS YOU:

  • Infrastructure management and optimization in AWS and Huawei Cloud environments,
  • Setup, management, and monitoring of Kubernetes clusters (production / pre-production),
  • Setup and improvement of CI/CD pipelines,
  • Management of High Availability (HA) and Disaster Recovery (DR) scenarios,
  • Management of monitoring and alerting infrastructure using tools such as Prometheus, Grafana, Loki, etc.,
  • Performance, capacity, and cost optimization activities,
  • Participation in on-call rotation and response to critical incidents,
  • Improvement of security, logging, and backup processes,
  • Close collaboration with development teams to resolve deployment and runtime issues,

GENERAL QUALIFICATIONS:

  • Bachelor’s degree in Computer Engineering, Software Engineering, or a related engineering discipline, or equivalent practical experience,
  • Proficiency in Linux system administration (preferably Ubuntu / CentOS),
  • Proficiency in Docker and container concepts,
  • Proficiency in Kubernetes core components (Pods, Services, Ingress, Deployments, HPA, etc.),
  • Proficiency in at least one CI/CD tool (GitLab CI, GitHub Actions, Jenkins, etc.),
  • Knowledge of networking fundamentals (TCP/IP, DNS, NAT, Load Balancers, Firewalls),
  • Proficiency in monitoring and logging concepts,
  • Strong problem-solving and analytical thinking skills,
  • Ability to remain calm and act systematically during critical situations,
  • Strong attention to documentation,
  • Strong teamwork skills,
  • Open to learning and self-improvement,
  • Ability to adapt to rotational on-call duty when required,
  • Strong sense of responsibility for minimizing downtime in critical systems,
  • Ability to adapt to flexible working hours based on operational requirements.

ADDITIONAL QUALIFICATIONS (PREFERRED):

  • Experience with AWS services (EC2, VPC, ALB/NLB, RDS, IAM, CloudWatch),
  • Experience with Huawei Cloud (HWC) or other cloud service providers,
  • Knowledge of Infrastructure as Code tools (Terraform, Ansible, Helm),
  • Knowledge of Prometheus, Grafana, Loki, ELK,
  • Experience working with systems such as OpenStack, Ceph, Couchbase, Elasticsearch,
  • Experience with on-call and incident management.