Bitdeer Group Jobs

AI Infrastructure Engineer

Bitdeer Group

AI Infrastructure Engineer

Reposted 5 Days Ago

Be an Early Applicant

In-Office

Singapore, SGP

Senior level

In-Office

Singapore, SGP

Senior level

As an AI Infrastructure Engineer at Bitdeer, you'll optimize GPU clusters, manage inference jobs, tune runtimes, and build observability tools in a distributed environment.

The summary above was generated by AI

About Bitdeer:

Bitdeer is a world-leading technology company for Bitcoin mining and AI cloud.
Bitdeer is committed to providing comprehensive Bitcoin mining solutions for its customers. Apart from designing industry-leading ASIC chips and manufacturing mining rigs, the Group handles complex processes involved in computing across the value chain. This includes equipment procurement, transport logistics, datacenter design and construction, equipment management, and network and facility operations. Bitdeer also offers advanced cloud capabilities to customers with a high demand for artificial intelligence.
Headquartered in Singapore, Bitdeer operates globally with a diversified 3 GW energy portfolio, and deploys Bitcoin mining and HPC datacenters in the United States, Bhutan, Norway, Canada, Malaysia, and Ethiopia.

What you will be responsible for:

Operate and optimize GPU clusters using Kubernetes, Slurm, and Ray across multiple regions.
Implement elastic scheduling and unified orchestration for inference and training jobs (Kueue / NVIDIA KAI Scheduler / KEDA), including preemption and dynamic capacity arbitration between training and serving on the GPU resource pool.
Manage and tune vLLM / SGLang runtimes for high-throughput, low-latency serving — including continuous batching, KV-cache paging, and prefill/decode disaggregation with RDMA / NIXL KV transfer.
Optimize distributed scheduling for multi-replica, multi-tenant serving; own model hot-swapping and zero-downtime rollout paths.
Benchmark and profile performance across workloads and model sizes (dense / MoE, 7B → 70B+, FP8 / AWQ / GPTQ).
Tune distributed communication stacks — NCCL / RCCL, RDMA over RoCEv2, and InfiniBand.
Build observability with Prometheus, Grafana, and Ray Dashboard to monitor GPU utilization, TTFT / ITL latency, and anomalies; integrate with the platform-wide OpenTelemetry + Grafana LGTM+ stack.

How you will stand out:

Bachelor's or Master's degree in Computer Science, Electrical Engineering, or related field; PhD preferred for advanced R&D or innovation-oriented roles.
3-5+ years in ML Infrastructure, HPC, or Systems Engineering.
Hands-on experience with Kubernetes, Slurm, or Ray.
Familiarity with vLLM, SGLang, or similar inference frameworks.
Strong background in PyTorch / JAX, distributed systems, and communication stacks (NCCL / RCCL, RDMA).
Proficiency in Python plus one of Go / C++ / Rust.
Experience building observability with Prometheus and Grafana.
Fluent in English; experience working in multinational or cross-cultural environments is a plus.
Experience with major cloud platforms is strongly preferred, particularly in designing large-scale, production-grade architectures or cloud services.

What you will experience working with us:

A culture that values authenticity and diversity of thoughts and backgrounds;
An inclusive and respectable environment with open workspaces and exciting start-up spirit;
Fast-growing company with the chance to network with industrial pioneers and enthusiasts;
Ability to contribute directly and make an impact on the future of the digital asset industry;
Involvement in new projects, developing processes/systems;
Personal accountability, autonomy, fast growth, and learning opportunities;
Attractive welfare benefits and developmental opportunities such as training and mentoring.

--------------------------------------------------------------------

Bitdeer is committed to providing equal employment opportunities in accordance with country, state, and local laws. Bitdeer does not discriminate against employees or applicants based on conditions such as race, colour, gender identity and/or expression, sexual orientation, marital and/or parental status, religion, political opinion, nationality, ethnic background or social origin, social status, disability, age, indigenous status, and union.

#LI-ST1

Singapore, Singapore

Similar Jobs

Airwallex

Staff Software Engineer

2 Days Ago

In-Office

Singapore, SGP

Senior level

Artificial Intelligence • Fintech • Payments • Business Intelligence • Financial Services • Generative AI

Lead design and development of scalable data infrastructure and self-service tooling for data engineering teams. Provide technical leadership, implement robust systems (Kubernetes/Terraform/GitOps), ensure quality through testing and maintenance, troubleshoot production issues, and collaborate cross-functionally to deliver reliable data platforms.

Top Skills: FlinkGitopsKafkaKubernetesSparkTerraform

Hyphen Connect Limited

Platform Engineer

9 Days Ago

In-Office

Singapore, SGP

Mid level

Agency • Artificial Intelligence • Blockchain • Web3

Design, deploy, and maintain MLOps and agentic AI infrastructure: manage model registries and continuous training loops, implement A/B testing, deploy agents as scalable microservices on Kubernetes, and build observability dashboards to monitor token usage, latency, and agent reasoning paths.

Top Skills: DockerKubernetesLangsmithMlflowTerraformWeights & Biases

Hyphen Connect Limited

LLM Pre-training & Distributed Engineer (AI Infrastructure)

9 Days Ago

In-Office

Singapore, SGP

Senior level

Agency • Artificial Intelligence • Blockchain • Web3

Design, run, and optimize large-scale LLM pre-training on 1,000+ GPUs. Manage distributed training with PyTorch/DeepSpeed/Megatron-LM, optimize networking and memory, and automate checkpointing and failure recovery for long runs.

Top Skills: 3D ParallelismC++CudaDeepspeedGpu ClustersInfinibandKubernetesMegatron-LmPythonPyTorchRdmaSlurm

What you need to know about the Singapore Tech Scene

The digital revolution has driven a constant demand for tech professionals across industries like software development, data analytics and cybersecurity. In Singapore, one of the largest cities in Southeast Asia, the demand for tech talent is so high that the government continues to invest millions into programs designed to develop a talent pipeline directly from universities while also scaling efforts in pre-employment training and mid-career upskilling to expand and elevate its workforce.

Bitdeer Group

AI Infrastructure Engineer

Bitdeer Group Singapore, Singapore, SGP Office

Similar Jobs

Staff Software Engineer

Platform Engineer

LLM Pre-training & Distributed Engineer (AI Infrastructure)

What you need to know about the Singapore Tech Scene