Centre for Strategic Infocomm Technologies Logo

Centre for Strategic Infocomm Technologies

Staff Platform Engineer - High Performance Computing Platform Management

Posted 5 Hours Ago
Be an Early Applicant
In-Office
Singapore, SGP
Senior level
In-Office
Singapore, SGP
Senior level
Lead design, implementation, and operation of a resilient, scalable HPC platform covering compute, storage, networking, job scheduling, capacity planning, security, and stakeholder support. Optimize cluster performance, implement storage and high-performance networking, manage cloud/container workflows, and lead engineering teams to meet organizational HPC needs.
The summary above was generated by AI
You will be part of the dynamic team responsible for building resilient network infrastructure using cutting-edge technologies such as cloud-based and software-defined networking e.g. SD-WAN, ACI and NSX. You must have a good understanding of IT infrastructure systems, and knowledge in the latest networking technologies and platforms. You will be a technical specialist in a team, and must be keen to take on new challenges and keep abreast with rapidly evolving technology landscape.

Role

  • We are seeking an experienced HPC Staff Engineer to join our team, responsible for managing and optimizing our HPC infrastructure platform. The successful candidate will have a deep understanding of HPC systems, architectures and technologies, as well as experience with managing large-scale computing environments. The role will involve designing, implementing and maintaining the HPC infrastructure platform, ensuring high availability, scalability and performance.

Responsibilities

  • Lead a team to deliver resilient, scalable and secure HPC platform, including compute nodes, storage systems, networks and job scheduling systems. 
  • Lead, design, implement and manage the HPC infrastructure platform to meet organisational needs.
  • Design and implement storage solutions for HPC workloads to ensure efficient data storage and retrieval.
  • Design and implement high-performance networking solutions, including InfiniBand, Ethernet, and other interconnects.
  • Plan and manage HPC resource capacity, including forecasting, procurement and deployment of new hardware and software.
  • Manage HPC clusters, including optimizing, monitoring and troubleshooting cluster performance, as well as managing job scheduling and resource allocation. 
  • Ensure the security and compliance of the HPC infrastructure platform, including managing access controls, implementing security patches, and conducting regular security checks.
  • Collaborate with stakeholders like data scientists and developers to optimize application performance on the HPC platform and provide technical support on using the HPC infrastructure platform.

Requirements (Minimum Qualifications)

  • Background in Computer Science, Computer Engineering, or a related field.
  • 8+ years of experience in managing HPC systems, including experience with Linux, Unix, or other operating systems.
  • Strong knowledge of HPC architectures, including clusters, grids, and clouds.
  • Experience with HPC job scheduling systems, such as Slurm, Torque and LSF.
  • Strong understanding of storage systems, including SANs, NAS, and object storage.
  • Experience with high-performance networking, including InfiniBand, Ethernet, and other interconnects.
  • Experience with cloud computing platforms, such as AWS, Azure, or Google Cloud.
  • Experience with scripting languages, such as Python, Perl, or Bash.
  • Experience with containerization (Docker, Kubernetes) and proficient in a range of complementary technologies, including Knative, Run:AI, Grafana, Prometheus, Kyverno, ArgoCD, Rancher, NVIDIA BCM and knowledge of NVIDIA Superpod architecture.
  • Experience in leading engineering teams.

Nice to Have

  • Certifications in NVIDIA AI Infrastructure and Operations, and Certified Kubernetes Administrator.
  • Experience with machine learning or deep learning frameworks, such as TensorFlow or PyTorch.
  • Familiarity with agile development methodologies and version control systems, such as Git.
  •  

Why join us?

  • The work is purposeful and meaningful 
  • You will work with the best engineers 
  • We work with modern technologies and tech stacks 
  • We have excellent engineering culture and work-life balance 
  • We aspire to engineering and operational excellence 
  • We empower to innovate 
  • We grow together as a family 

As CSIT is an agency under the Ministry of Defence (Singapore), only Singapore Citizens will be considered.

Centre for Strategic Infocomm Technologies Singapore Office

Similar Jobs

6 Hours Ago
In-Office
Singapore, SGP
Senior level
Senior level
Artificial Intelligence • Hardware • Information Technology • Machine Learning
Lead a team to design and implement CMOS device solutions for NAND products, improving yield and reliability while managing cross-department collaborations and fostering team development.
Top Skills: AICmosNand
6 Hours Ago
In-Office
Singapore, SGP
Expert/Leader
Expert/Leader
Artificial Intelligence • Hardware • Information Technology • Machine Learning
Lead package reliability, qualification, and failure analysis across NPI and HVM. Drive test methodology, risk assessment, DFR/DFM integration, cross-functional collaboration, team development, and use of data/AI to improve reliability and enable product ramps across mobile, automotive, data center, and AI/HPC segments.
Top Skills: 8D Root Cause AnalysisAec-QAi-Enabled ToolsBoard-Level ReliabilityCross-SectioningDesign For Manufacturability (Dfm)Design For Reliability (Dfr)Digital Quality SystemsDramEdxFailure AnalysisHastHbmHybrid BondingJedecMslNandOsatsPreconditioningPredictive Reliability AnalyticsSamSemSystem-In-PackageTemperature CyclingX-Ray
6 Hours Ago
In-Office
Singapore, SGP
Mid level
Mid level
Artificial Intelligence • Hardware • Information Technology • Machine Learning
Design and implement analytics, optimization, and web solutions to improve semiconductor manufacturing efficiency. Develop models for scheduling, capacity, and cycle time, collaborate with stakeholders, manage project requirements and deliverables, and communicate findings to varied audiences.
Top Skills: AWSAzureBusiness IntelligenceC#Data AnalyticsGCPMachine LearningPythonSQLWeb Application

What you need to know about the Singapore Tech Scene

The digital revolution has driven a constant demand for tech professionals across industries like software development, data analytics and cybersecurity. In Singapore, one of the largest cities in Southeast Asia, the demand for tech talent is so high that the government continues to invest millions into programs designed to develop a talent pipeline directly from universities while also scaling efforts in pre-employment training and mid-career upskilling to expand and elevate its workforce.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account