Hireio, Inc. Logo

Hireio, Inc.

Site Reliability Engineer - Machine Learning Systems

Job Posted 4 Days Ago Reposted 4 Days Ago
Be an Early Applicant
In-Office
Singapore
Mid level
In-Office
Singapore
Mid level
The Site Reliability Engineer will ensure ML systems operate efficiently, manage resources, and improve stability and disaster recovery across multi-cloud environments.
The summary above was generated by AI
Description

Responsibilities

  • Responsible for ensuring our ML systems are operating and running efficiently for large model deployment, training, evaluation, and inference.
  • Responsible for the stability of offline tasks/services in multi-data center, multi-region, and multi-cloud scenarios.
  • Responsible for resource management and planning, cost and budget, including computing and storage resources.
  • Responsible for global system disaster recovery, cluster machine governance, stability of business services, resource utilisation improvement and operation efficiency improvement.
  • Build software tools, products and systems to monitor and manage the mL infrastructure and services efficiently.
  • Be part of the global team roster that ensures system and business on-call support.
Requirements

Qualifications:

  • Bachelor's degree or above, majoring in Computer Science, computer engineering or related fields;
  • Strong proficiency in at least one programming language such as Go/Python/Shell in Linux environment;
  • Strong hands-on experience with Kubernetes and containers skills, and have ≥3 years of relevant operation and maintenance experience;

Preferred Qualifications

  • Possess excellent logical analysis ability, able to reasonably abstract and split business logic, a strong sense of responsibility, good learning ability, communication ability, self-driven and good team spirit;
  • Have good documentation principles and habits to be able to write and update workflow and technical documentation as required on time.
  • Engage in the operation and maintenance of large-scale ML distributed systems;
  • Experience in operation and maintenance of GPU servers.

Top Skills

Containers
Go
Gpu Servers
Kubernetes
Linux
Ml Infrastructure
Python
Shell

Similar Jobs

20 Hours Ago
Hybrid
2 Locations
Mid level
Mid level
Cloud • Information Technology • Security • Software • Cybersecurity
Responsible for building and expanding the global network, collaborating with various teams for datacenter management and ensuring effective processes are implemented for infrastructure operations.
Top Skills: AnsibleApacheArista EosBgpChefCisco IosCisco Nx-OsCwdmDwdmHaproxyJuniper JunosLinuxNginxPuppetSaltstackVarnish
20 Hours Ago
Easy Apply
Hybrid
Singapore, SGP
Easy Apply
Mid level
Mid level
Cloud • Information Technology • Security • Software
Conduct applied research in AI and ML for software engineering, develop algorithms, and enhance code quality and testing processes.
Top Skills: AIAWSMlPythonR
20 Hours Ago
Easy Apply
Hybrid
Singapore, SGP
Easy Apply
Mid level
Mid level
Cloud • Information Technology • Security • Software
As an AI Scientist, you'll conduct applied research in AI/ML for software engineering, develop novel algorithms, and collaborate with teams to improve code quality and maintainability.
Top Skills: AIAWSDevOpsGenerative AiLlmMlPythonR

What you need to know about the Singapore Tech Scene

The digital revolution has driven a constant demand for tech professionals across industries like software development, data analytics and cybersecurity. In Singapore, one of the largest cities in Southeast Asia, the demand for tech talent is so high that the government continues to invest millions into programs designed to develop a talent pipeline directly from universities while also scaling efforts in pre-employment training and mid-career upskilling to expand and elevate its workforce.
By clicking Apply you agree to share your profile information with the hiring company.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account