Hyphen Connect Limited Logo

Hyphen Connect Limited

LLM Pre-training & Distributed Engineer (AI Infrastructure)

Posted Yesterday
Be an Early Applicant
In-Office
Singapore, SGP
Senior level
In-Office
Singapore, SGP
Senior level
Design, run, and optimize large-scale LLM pre-training on 1,000+ GPUs. Manage distributed training with PyTorch/DeepSpeed/Megatron-LM, optimize networking and memory, and automate checkpointing and failure recovery for long runs.
The summary above was generated by AI

We are seeking a highly skilled LLM Pre-training & Distributed Systems Engineer. This role is essential for orchestrating large-scale machine learning training runs and optimizing  distributed infrastructure. The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering to ensure efficient and reliable training processes.

Responsibilities:

  • Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM.
  • Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.
  • Automate checkpointing and failure recovery during month-long training runs.

Required Skills:

  • Deep expertise in 3D parallelism (Data, Tensor, Pipeline).
  • Experience managing SLURM or Kubernetes-based GPU clusters.
  • Strong systems engineering background (C++, CUDA, Python).

Similar Jobs

Junior
eCommerce • Fashion • Retail • Sales • Wearables • Design
The analyst will manage lease accounting operations, ensure compliance with accounting standards, assist in tax filings, and support audits. They will also implement efficiency improvements within accounting processes.
Top Skills: Ai-Enabled ProcessesAutomation ToolsIfrsMS OfficeSAPUs Gaap
5 Hours Ago
In-Office or Remote
Singapore, SGP
Senior level
Senior level
Artificial Intelligence • Fintech • Payments • Business Intelligence • Financial Services • Generative AI
As a Staff Product Designer, you'll develop and maintain design systems, conduct component audits, ensure team alignment, advocate for design system usage, and enhance design architecture based on feedback and trends.
Top Skills: CSSFigmaHTML
5 Hours Ago
In-Office or Remote
Singapore, SGP
Senior level
Senior level
Artificial Intelligence • Fintech • Payments • Business Intelligence • Financial Services • Generative AI
Lead analytics for regulatory reports, standardize data usage, build test frameworks, analyze data anomalies, and mentor colleagues in compliance domains.
Top Skills: LookerPythonRShellSQLTableau

What you need to know about the Singapore Tech Scene

The digital revolution has driven a constant demand for tech professionals across industries like software development, data analytics and cybersecurity. In Singapore, one of the largest cities in Southeast Asia, the demand for tech talent is so high that the government continues to invest millions into programs designed to develop a talent pipeline directly from universities while also scaling efforts in pre-employment training and mid-career upskilling to expand and elevate its workforce.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account