Centre for Strategic Infocomm Technologies
System Reliability Engineer (Data Centre)
Be an Early Applicant
Ensure reliability, availability, and performance of data centre IT operations. Manage day-to-day monitoring, incident lifecycle, capacity planning, documentation, observability and network monitoring, remote management tools, and define SRE metrics (SLO/SLI/error budgets). Collaborate with facilities on power, cooling, and physical infrastructure.
You will be part of a dynamic team responsible for ensuring the reliability, availability, and performance of our data centre's IT operations. As a System Reliability Engineer (Data Centre), you will oversee the day-to-day IT operations within the data centre, working closely with various teams to ensure seamless IT service delivery. While knowledge of data centre power and cooling infrastructure is beneficial, the primary focus of this role is on IT operations. You will collaborate with Data Centre Facilities teams on matters related to power, cooling, and physical infrastructure as needed. You must have a good understanding of cloud infrastructure technologies, architecture, and site reliability engineering (SRE) principles.
Responsibilities
- Oversee and manage IT operations within the data centre, including day-to-day monitoring, incident management, and problem management
- Lead the end-to-end incident management lifecycle that encompass immediate troubleshooting, root cause identification, and resolution implementation to restore services, followed by comprehensive post-incident analysis
- Develop and maintain documentation on IT infrastructure, operations, and procedures within the data centre
- Perform capacity planning to ensure IT infrastructure is scalable for future demands
- Collaborate and coordinate with Data Centre Facilities teams on matters related to power, cooling, and physical infrastructure
- Design and implement robust observability platform alongside network monitoring tools for performance monitoring and real-time alerting of IT devices and networks
- Implement and manage remote management tools for out-of-band access and control of IT devices and servers
- Define, implement, and track SRE metrics, including SLO, SLI, and error budgets to improve data centre IT reliability
Requirements (Minimum Qualifications)
- Background in Computer Science, Computer or Electrical Engineering, Information Technology or a related field
- Good technical knowledge in IT infrastructure, including servers, storage, networking, and cloud technologies
- Proficient in IT management software and tools
- 2 years of working experience in IT operations is preferred
- Fresh graduates are welcomed to apply
As CSIT is an agency under the Ministry of Defence (Singapore), only Singapore Citizens will be considered.
Centre for Strategic Infocomm Technologies Singapore Office
Similar Jobs
Artificial Intelligence • Hardware • Information Technology • Machine Learning
Lead a team to design and implement CMOS device solutions for NAND products, improving yield and reliability while managing cross-department collaborations and fostering team development.
Top Skills:
AICmosNand
Artificial Intelligence • Hardware • Information Technology • Machine Learning
Lead package reliability, qualification, and failure analysis across NPI and HVM. Drive test methodology, risk assessment, DFR/DFM integration, cross-functional collaboration, team development, and use of data/AI to improve reliability and enable product ramps across mobile, automotive, data center, and AI/HPC segments.
Top Skills:
8D Root Cause AnalysisAec-QAi-Enabled ToolsBoard-Level ReliabilityCross-SectioningDesign For Manufacturability (Dfm)Design For Reliability (Dfr)Digital Quality SystemsDramEdxFailure AnalysisHastHbmHybrid BondingJedecMslNandOsatsPreconditioningPredictive Reliability AnalyticsSamSemSystem-In-PackageTemperature CyclingX-Ray
Artificial Intelligence • Hardware • Information Technology • Machine Learning
Design and implement analytics, optimization, and web solutions to improve semiconductor manufacturing efficiency. Develop models for scheduling, capacity, and cycle time, collaborate with stakeholders, manage project requirements and deliverables, and communicate findings to varied audiences.
Top Skills:
AWSAzureBusiness IntelligenceC#Data AnalyticsGCPMachine LearningPythonSQLWeb Application
What you need to know about the Singapore Tech Scene
The digital revolution has driven a constant demand for tech professionals across industries like software development, data analytics and cybersecurity. In Singapore, one of the largest cities in Southeast Asia, the demand for tech talent is so high that the government continues to invest millions into programs designed to develop a talent pipeline directly from universities while also scaling efforts in pre-employment training and mid-career upskilling to expand and elevate its workforce.

.jpeg)