As a Site Reliability Engineer, you'll design software to automate operations, integrate AI, manage system performance, and troubleshoot incidents while collaborating with diverse teams.
Job Description
As an experienced Site Reliability Engineering professional, you'll be making decisions of a global and strategic nature that impact our customers, clients, and businesses around the globe. Your expertise in analyzing complex data systems, anticipating problems, and finding ways to mitigate risk, will be key in leading a high-performing team to successfully design and navigate the program roadmap. By incorporating your knowledge of business drivers, you will affect change and lead the development of innovative improvements and world-class practices. You will help build a meaningful engineering discipline, combining software and systems to develop creative engineering solutions to operations problems. Additionally, your skills in software development on Python and AI implementation will enhance our capabilities in automating processes and improving system intelligence.
You will be responsible for driving results, and implementing multiple, complex programs that span the breadth of the business, and they'll look to you to provide the technical leadership to move them forward. And while you will be part of a tight-knit team that shares your passion for technology, you'll also gain access to the best minds in the business- as part of the JPMorgan Chase and through our partnerships with some of the most important tech firms in the world. In this environment, you'll take the lead on relevant projects, supported by an organization that provides the support and mentorship you need to learn and grow. As an SRE, you'll be focused on running better production applications and systems, with an emphasis on integrating AI capabilities to enhance system reliability and performance.
Job responsibilities:
Required qualifications, capabilities, and skills:
This role requires a wide variety of strengths and capabilities, including:
Preferred qualifications, capabilities, and skills
As an experienced Site Reliability Engineering professional, you'll be making decisions of a global and strategic nature that impact our customers, clients, and businesses around the globe. Your expertise in analyzing complex data systems, anticipating problems, and finding ways to mitigate risk, will be key in leading a high-performing team to successfully design and navigate the program roadmap. By incorporating your knowledge of business drivers, you will affect change and lead the development of innovative improvements and world-class practices. You will help build a meaningful engineering discipline, combining software and systems to develop creative engineering solutions to operations problems. Additionally, your skills in software development on Python and AI implementation will enhance our capabilities in automating processes and improving system intelligence.
You will be responsible for driving results, and implementing multiple, complex programs that span the breadth of the business, and they'll look to you to provide the technical leadership to move them forward. And while you will be part of a tight-knit team that shares your passion for technology, you'll also gain access to the best minds in the business- as part of the JPMorgan Chase and through our partnerships with some of the most important tech firms in the world. In this environment, you'll take the lead on relevant projects, supported by an organization that provides the support and mentorship you need to learn and grow. As an SRE, you'll be focused on running better production applications and systems, with an emphasis on integrating AI capabilities to enhance system reliability and performance.
Job responsibilities:
- Design, code, test, and deliver software to automate manual operational work, with a focus on Python development and AI integration.
- Partner with different application teams throughout the life cycle to understand their application infrastructure and monitoring and apply Site Reliability Principles to baseline and setup SLOs. Identify application patterns and data-driven approaches to defend and improve service level objectives.
- Develop automated software upgrades, change management, release management, and self-healing solutions, leveraging AI technologies where applicable.
- Responsibilities extend to application deployment, change management, incident management, capacity upgrades, reporting, system integration, and essentially ensuring the availability of stable and performing platforms.
- Troubleshoot priority incidents, facilitate blameless post-mortems, and ensure permanent closure of incidents.
- Excellent communication and ability to build and maintain relationships is essential.
- This role requires someone who is a quick learner, detail-oriented with the experience of working efficiently and effectively in a global distributed environment.
- Designs and conducts performance tests, identifies bottlenecks, opportunities for optimization, and capacity demand.
- Defines and drives adoption of a best-in-class monitoring framework to accomplish end-to-end flow monitoring and noiseless alerting.
- Facilitates maximum speed of delivery by objectively binding to error budgets of the service.
- Coaches other team members and manages teams as needed.
- Executes small to medium projects independently with initial direction and eventually graduates to designing and delivering projects by yourself
- Leverages technology to solve business problems by writing high quality, maintainable, and robust code following best practices in software engineering
- Participates in triaging, examining, diagnosing, and resolving incidents and work with others to solve problems at their root
- Recognizes the toil within your role and proactively works towards eliminating it through either systems engineering or updating application code
- Understands observability patterns and strives to implement and improve service level indicators, objectives monitoring, and alerting solutions for optimal transparency and analysis
Required qualifications, capabilities, and skills:
This role requires a wide variety of strengths and capabilities, including:
- BS/BA degree or equivalent experience.
- Advanced understanding of application monitoring stack (Metrics, Events, Traces, Alerts, Logs) and ability to visualize and setup end-to-end observability (Infrastructure and Application components).
- Strong experience in using industry-standard monitoring tools (APM, Synthetic monitoring, Splunk/ELK, Prometheus, Grafana, etc.).
- Experience in deploying applications to cloud platforms (Kubernetes, CloudFoundry preferable).
- Experience in using Continuous Integration and Continuous Deployment tools. Build and Release skills using Jenkins, Maven, Gradle, Groovy, Git, Jenkins, etc.
- Experience in working with automation tools like Ansible, Puppet, etc.
- Good SQL skills and experience on databases.
- Experience in Resiliency patterns (Scalability, infra-as-code), Self-healing, Chaos Engineering, Performance Monitoring, Continuous performance improvements.
- Strong understanding of Site Reliability Engineering concept (SLO, SLIs & Error Budget).
- Proficient in Python programming, with experience in AI implementation and scripting.
- Knowledge of Web Services - Apache, Tomcat, SOAP, REST.
- Experience of working in Agile Methodology.
- Demonstrated ability to conceptualize, launch, and deliver multiple IT projects on time and within budget.
- Ability to articulate to more experienced management a technical strategy in clear, concise, understandable terms.
- Ability to code in at least one programming language
- Experience maintaining a Cloud-base infrastructure
- Familiar with site reliability concepts, principles, and practices
- Familiar with observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others
- Familiarity with containers or a common Server OS such as Linux and Windows
- Emerging knowledge of software, applications and technical processes within a given technical discipline (e.g., Cloud, artificial intelligence, Android, etc.)
- Emerging knowledge of continuous integration and continuous delivery tools like Jenkins, GitLab, or Terraform
- Emerging knowledge of common networking technologies
- Ability to work in a large, collaborative team and demonstrates the willingness to vocalize ideas with peers and managers
- Understanding of how to prioritize and adjust work plans to adapt to changes in assigned responsibilities and projects
- Eagerness to participate in learning opportunities to enhance one's effectiveness in executing day-to-day project activities
- Ability to demonstrate and apply existing and new system processes, methodologies, and skills to contribute to the development of systems
Preferred qualifications, capabilities, and skills
- General knowledge of financial services industry
Top Skills
AI
Ansible
Apache
Cloudfoundry
Elk
Git
Gradle
Grafana
Groovy
Jenkins
Kubernetes
Maven
Prometheus
Puppet
Python
Rest
Soap
Splunk
SQL
Tomcat
JPMorganChase Singapore Office
One@Changi City, Changi Business Park Central 1, Singapore, 486036
Similar Jobs at JPMorganChase
Financial Services
As a Senior Software Engineer, you will design and deliver scalable systems using Java, ensure software quality, handle data visualization, and improve coding standards.
Top Skills:
Artificial IntelligenceCi/CdCloudJavaMachine LearningMobileModern Front-End Technologies
Financial Services
As a Lead Infrastructure Engineer, you will manage projects, drive infrastructure technologies, troubleshoot technical issues, and collaborate across teams to enhance technology processes.
Top Skills:
AnsibleAPIsAtlassianBitbucketConfluenceF5 NetworksJIRAJSONRest
Financial Services
The Associate will enhance control oversight by standardizing operational risk reporting, managing user support tickets, and driving process improvements.
Top Skills:
ItilJIRAServicenow
What you need to know about the Singapore Tech Scene
The digital revolution has driven a constant demand for tech professionals across industries like software development, data analytics and cybersecurity. In Singapore, one of the largest cities in Southeast Asia, the demand for tech talent is so high that the government continues to invest millions into programs designed to develop a talent pipeline directly from universities while also scaling efforts in pre-employment training and mid-career upskilling to expand and elevate its workforce.