IFS
Lead / Senior Lead Site Reliability Engineer - (Portfolio Companies: WorkWave)
Company Description
IGT 1 Outsourcing Lanka (Private) Limited, hereafter referred to as ‘IGT 1 Lanka’, is a Port City registered offshore company owned by three of the largest private equity companies, and a sister company of the largest Sri Lanka technology company, IFS.
We are committed to reinventing company success via offshore growth, expansion, diversity, and an unwavering pursuit of quality. As a leading provider of technology and employee offshore services, we help organizations all over the world navigate the complexities of the modern business environment. Our goal is to provide our customers with an operation that maximize operations, spur growth, allows them to develop and deliver world-class SaaS platforms, and create long-term value.
At IGT1 Lanka we believe that our people are the key to our collective success. We have developed a workplace culture that promotes diversity, teamwork, and ongoing education. We are presently a team of 300+ employees with a plan to double this capacity in the next 12 months.
As such, we are always on the lookout for talented individuals who share our passion for innovation and excellence. Joining IGT1 Lanka means becoming part of a forward-thinking organization that is shaping the future of business within the vibrant new Port City. Together, we can drive change, push boundaries, and build a smarter, more connected world through our offshore operation.
Job Description
The WorkWave Team is seeking an experienced Lead / Senior Lead Site Reliability Engineer (SRE) to drive reliability, scalability, and operational excellence across our cloud-based infrastructure. This role is crucial in ensuring high availability, monitoring, and streamlined deployment processes across various environments, including AWS and hybrid systems. The Lead / Senior Lead SRE will work closely with cross-functional teams to optimize system reliability and efficiency, actively contributing to a robust infrastructure that supports business growth.
Responsibilities
-
Design, manage, and optimize scalable infrastructure across cloud environments with a focus on reliability, availability, and performance. Implement comprehensive monitoring and observability systems to ensure proactive issue detection and resolution.
-
Lead incident response for critical infrastructure issues across cloud platforms, drive root cause analysis, and implement corrective measures to minimize recurrence.
-
Collaborate with cross-functional teams to create efficient, automated CI/CD pipelines that support cloud, hybrid, and on-prem deployments, enabling smooth and reliable delivery.
-
Apply IaC best practices across environments using tools that ensure consistent provisioning, configuration, and management of resources in cloud environments.
-
Ensure new services meet reliability and scalability requirements across all environments before deployment. Conduct capacity planning and performance tuning to adapt to business needs.
-
Develop and maintain comprehensive documentation for infrastructure, deployment workflows, monitoring configurations, and incident management procedures, providing clear guidance across teams.
-
Provide mentorship and technical guidance to team members, sharing knowledge of best practices in reliability engineering and infrastructure management.
-
Research and integrate new tools and technologies to improve the efficiency, scalability, and resilience of our SRE processes across cloud and hybrid infrastructures.
Qualifications
-
Bachelor’s or Master’s Degree in Computer Science, Information Technology, or a related field.
-
4-5+ years of experience in Site Reliability Engineering or DevOps with a focus on multi-environment infrastructure and cloud platforms.
-
Strong track record of managing and optimizing infrastructure in production environments, including incident management and system troubleshooting.
-
Proficient in CI/CD pipeline automation and infrastructure as code practices across cloud and hybrid environments.
Skills and Competencies
- Expertise in monitoring, observability, and incident management using tools like Grafana, AWS X-Ray, and CloudWatch, with a focus on RCA and proactive alerting.
- Proficiency in automation and scripting (e.g., Python, Bash) and Infrastructure as Code (IaC) tools such as Terraform or AWS CloudFormation.
- In-depth knowledge of AWS services for reliability, including Auto Scaling, Elastic Load Balancing, RDS, and S3, with a focus on high availability and fault tolerance.
- Hands-on experience with CI/CD pipelines using AWS CodePipeline, CodeBuild, or third-party tools integrated with AWS services.
- Excellent communication and collaboration skills to drive system reliability and foster cross-functional teamwork in a cloud-first environment.