Hyphen Connect Limited
LLM Pre-training & Distributed Engineer (AI Infrastructure)
Be an Early Applicant
Design, run, and optimize large-scale LLM pre-training on 1,000+ GPUs. Manage distributed training with PyTorch/DeepSpeed/Megatron-LM, optimize networking and memory, and automate checkpointing and failure recovery for long runs.
We are seeking a highly skilled LLM Pre-training & Distributed Systems Engineer. This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure. The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering to ensure efficient and reliable training processes.
Responsibilities:
- Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM.
- Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.
- Automate checkpointing and failure recovery during month-long training runs.
Required Skills:
- Deep expertise in 3D parallelism (Data, Tensor, Pipeline).
- Experience managing SLURM or Kubernetes-based GPU clusters.
- Strong systems engineering background (C++, CUDA, Python).
Similar Jobs
5 Hours Ago
eCommerce • Fashion • Retail • Sales • Wearables • Design
The analyst will manage lease accounting operations, ensure compliance with accounting standards, assist in tax filings, and support audits. They will also implement efficiency improvements within accounting processes.
Top Skills:
Ai-Enabled ProcessesAutomation ToolsIfrsMS OfficeSAPUs Gaap
Artificial Intelligence • Fintech • Payments • Business Intelligence • Financial Services • Generative AI
As a Staff Product Designer, you'll develop and maintain design systems, conduct component audits, ensure team alignment, advocate for design system usage, and enhance design architecture based on feedback and trends.
Top Skills:
CSSFigmaHTML
Artificial Intelligence • Fintech • Payments • Business Intelligence • Financial Services • Generative AI
Lead analytics for regulatory reports, standardize data usage, build test frameworks, analyze data anomalies, and mentor colleagues in compliance domains.
Top Skills:
LookerPythonRShellSQLTableau
What you need to know about the Singapore Tech Scene
The digital revolution has driven a constant demand for tech professionals across industries like software development, data analytics and cybersecurity. In Singapore, one of the largest cities in Southeast Asia, the demand for tech talent is so high that the government continues to invest millions into programs designed to develop a talent pipeline directly from universities while also scaling efforts in pre-employment training and mid-career upskilling to expand and elevate its workforce.


