Hireio, Inc. Logo

Hireio, Inc.

Site Reliability Engineer - Machine Learning Systems

Posted Yesterday
Be an Early Applicant
Singapore
Mid level
Singapore
Mid level
The Site Reliability Engineer for Machine Learning Systems will ensure the efficient operation of ML systems, improve resource management, oversee disaster recovery, and enhance service stability. Responsibilities include building tools for monitoring the ML infrastructure and providing global system support.
The summary above was generated by AI

Description

Responsibilities

  • Responsible for ensuring our ML systems are operating and running efficiently for large model deployment, training, evaluation, and inference.
  • Responsible for the stability of offline tasks/services in multi-data center, multi-region, and multi-cloud scenarios.
  • Responsible for resource management and planning, cost and budget, including computing and storage resources.
  • Responsible for global system disaster recovery, cluster machine governance, stability of business services, resource utilisation improvement and operation efficiency improvement.
  • Build software tools, products and systems to monitor and manage the mL infrastructure and services efficiently.
  • Be part of the global team roster that ensures system and business on-call support.
Requirements

Qualifications:

  • Bachelor's degree or above, majoring in Computer Science, computer engineering or related fields;
  • Strong proficiency in at least one programming language such as Go/Python/Shell in Linux environment;
  • Strong hands-on experience with Kubernetes and containers skills, and have ≥3 years of relevant operation and maintenance experience;

Preferred Qualifications

  • Possess excellent logical analysis ability, able to reasonably abstract and split business logic, a strong sense of responsibility, good learning ability, communication ability, self-driven and good team spirit;
  • Have good documentation principles and habits to be able to write and update workflow and technical documentation as required on time.
  • Engage in the operation and maintenance of large-scale ML distributed systems;
  • Experience in operation and maintenance of GPU servers.

Top Skills

Go
Python
Shell

Similar Jobs

Yesterday
Hybrid
Singapore, SGP
Senior level
Senior level
Financial Services
As a Lead Infrastructure Engineer, you will enhance network performance, manage production incidents, change management, and collaborate with teams to innovate and troubleshoot complex issues within a global network. You will also leverage your technical acumen and problem-solving skills in a high-stakes environment.
Top Skills: Python
Yesterday
Hybrid
Singapore, SGP
Senior level
Senior level
Fintech • Mobile • Payments • Software • Financial Services
The Engineering Lead will manage the Platform Product Integrations team, focusing on designing and maintaining financial integrations, optimizing infrastructure, ensuring system performance in AWS, and leading a high-performing engineering group while collaborating on impactful projects.
Top Skills: Java
Yesterday
Hybrid
Singapore, SGP
Mid level
Mid level
Fintech • Mobile • Payments • Software • Financial Services
The Solutions Engineering Manager will oversee the performance and quality of their team, ensure successful partner solution delivery, and help improve internal processes and tooling. This role involves technical consulting, analyzing impacts on partners, and contributing to engineering goals while fostering a collaborative team environment.
Top Skills: Software

What you need to know about the Singapore Tech Scene

The digital revolution has driven a constant demand for tech professionals across industries like software development, data analytics and cybersecurity. In Singapore, one of the largest cities in Southeast Asia, the demand for tech talent is so high that the government continues to invest millions into programs designed to develop a talent pipeline directly from universities while also scaling efforts in pre-employment training and mid-career upskilling to expand and elevate its workforce.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account