Site Reliability Engineer / Cloud Infrastructure Engineer
Remote · 정규직
가장 먼저 지원하세요
- 경험
- 어느
- 샐러리
- —
- 채용 공고
- 1
- 게시됨
- 5시간 전
- Work mode
- 재택근무
- Eligibility
- Doha-based candidates are preferred. Remote candidates may also be considered if they bring strong ownership, reliability, and clear communication.
- Resume
- Required to apply
직무 설명
About the Role
We are seeking a highly capable Site Reliability Engineer to take charge of the reliability, scalability, security, and day-to-day stability of our live systems. This position sits at the core of the commerce platform and covers the full infrastructure and reliability stack, including AWS, Kubernetes, networking, databases, search services, observability, CI/CD, incident handling, and production support.
This is a hands-on role with significant ownership. It is best suited for someone who can work independently, solve difficult infrastructure challenges, and help establish a dependable foundation for a rapidly growing product.
Our environment includes AWS, EKS/Kubernetes, PostgreSQL, Redis/Valkey, Elasticsearch/OpenSearch, RabbitMQ, API Gateway, WAF, load balancers, Docker, Go/.NET Core microservices, Terraform, web applications, and several third-party integrations.
Responsibilities
- Take ownership of the reliability, uptime, performance, and security of production infrastructure.
- Administer and enhance the AWS environment, including EKS, networking, load balancing, API Gateway, WAF, RDS/PostgreSQL, caching, and managed cloud services.
- Run and fine-tune Kubernetes workloads, including deployments, autoscaling, resource tuning, pod health, service discovery, ingress, and environment setup.
- Improve database stability for PostgreSQL through performance troubleshooting, backup practices, monitoring, replication awareness, connection handling, and incident support.
- Support Elasticsearch/OpenSearch setups used for catalog and search functions.
- Expand observability across the platform through logs, metrics, dashboards, alerts, tracing, and practical production monitoring.
- Improve incident management by strengthening detection, triage, mitigation, post-incident reviews, and prevention of repeat issues.
- Make deployments safer and more dependable by improving CI/CD pipelines and release workflows.
- Partner with backend, mobile, product, and operations teams to support new functionality and confirm production readiness.
- Evaluate architecture and infrastructure choices with attention to reliability, cost, security, and scalability.
- Harden the platform using AWS security practices, WAF rules, IAM discipline, network restrictions, secrets handling, and awareness of vulnerabilities.
- Track cloud spending and identify savings opportunities without reducing reliability.
- Create and maintain operational documentation, runbooks, infrastructure notes, and recovery procedures.
Requirements
- Practical, production-level experience working in AWS environments.
- Hands-on background with Kubernetes, Docker, deployments, services, ingress, scaling, and troubleshooting.
- Strong grasp of networking basics such as DNS, TLS, load balancing, routing, security groups, firewalls, private/public networking, and HTTP traffic flow.
- Experience running PostgreSQL in production, including performance diagnosis, backups, monitoring, and connection issues.
- Exposure to Elasticsearch or OpenSearch in live environments.
- Working knowledge of observability tools and concepts such as metrics, logs, alerts, dashboards, tracing, SLIs/SLOs, and incident detection.
- Experience with CI/CD pipelines and modern release processes.
- Ability to investigate complex incidents across application, infrastructure, database, and network layers.
- Strong ownership mindset, good communication, and steady decision-making during incidents.
- Comfort in a fast-moving startup environment where priorities can shift quickly.
Eligibility
Doha-based candidates are preferred. Remote candidates may also be considered if they bring strong ownership, reliability, and clear communication.
What Success Looks Like
In the early months, the goal is to make the platform more stable, visible, secure, and predictable. Success means stronger production insight, fewer recurring incidents, better infrastructure ownership, and clear operating standards for deployments, alerts, incident handling, and recovery.
We are looking for someone who goes beyond basic server administration and actively strengthens the engineering foundation of the company.
Location
Doha, Qatar.