The Site Reliability Engineer for Machine Learning Systems will ensure the efficient operation of ML systems, improve resource management, oversee disaster recovery, and enhance service stability. Responsibilities include building tools for monitoring the ML infrastructure and providing global system support.
Responsibilities
- Responsible for ensuring our ML systems are operating and running efficiently for large model deployment, training, evaluation, and inference.
- Responsible for the stability of offline tasks/services in multi-data center, multi-region, and multi-cloud scenarios.
- Responsible for resource management and planning, cost and budget, including computing and storage resources.
- Responsible for global system disaster recovery, cluster machine governance, stability of business services, resource utilisation improvement and operation efficiency improvement.
- Build software tools, products and systems to monitor and manage the mL infrastructure and services efficiently.
- Be part of the global team roster that ensures system and business on-call support.
Qualifications:
- Bachelor's degree or above, majoring in Computer Science, computer engineering or related fields;
- Strong proficiency in at least one programming language such as Go/Python/Shell in Linux environment;
- Strong hands-on experience with Kubernetes and containers skills, and have ≥3 years of relevant operation and maintenance experience;
Preferred Qualifications
- Possess excellent logical analysis ability, able to reasonably abstract and split business logic, a strong sense of responsibility, good learning ability, communication ability, self-driven and good team spirit;
- Have good documentation principles and habits to be able to write and update workflow and technical documentation as required on time.
- Engage in the operation and maintenance of large-scale ML distributed systems;
- Experience in operation and maintenance of GPU servers.
Top Skills
Go
Python
Shell
Similar Jobs
Financial Services
As a Lead Infrastructure Engineer, you will enhance network performance, manage production incidents, change management, and collaborate with teams to innovate and troubleshoot complex issues within a global network. You will also leverage your technical acumen and problem-solving skills in a high-stakes environment.
Top Skills:
Python
Fintech • Mobile • Payments • Software • Financial Services
The Engineering Lead will manage the Platform Product Integrations team, focusing on designing and maintaining financial integrations, optimizing infrastructure, ensuring system performance in AWS, and leading a high-performing engineering group while collaborating on impactful projects.
Top Skills:
Java
Fintech • Mobile • Payments • Software • Financial Services
The Solutions Engineering Manager will oversee the performance and quality of their team, ensure successful partner solution delivery, and help improve internal processes and tooling. This role involves technical consulting, analyzing impacts on partners, and contributing to engineering goals while fostering a collaborative team environment.
Top Skills:
Software
What you need to know about the Singapore Tech Scene
The digital revolution has driven a constant demand for tech professionals across industries like software development, data analytics and cybersecurity. In Singapore, one of the largest cities in Southeast Asia, the demand for tech talent is so high that the government continues to invest millions into programs designed to develop a talent pipeline directly from universities while also scaling efforts in pre-employment training and mid-career upskilling to expand and elevate its workforce.