About Bitdeer:
Bitdeer is a world-leading technology company for Bitcoin mining and AI cloud.
Bitdeer is committed to providing comprehensive Bitcoin mining solutions for its customers. Apart from designing industry-leading ASIC chips and manufacturing mining rigs, the Group handles complex processes involved in computing across the value chain. This includes equipment procurement, transport logistics, datacenter design and construction, equipment management, and network and facility operations. Bitdeer also offers advanced cloud capabilities to customers with a high demand for artificial intelligence.
Headquartered in Singapore, Bitdeer operates globally with a diversified 3 GW energy portfolio, and deploys Bitcoin mining and HPC datacenters in the United States, Bhutan, Norway, Canada, Malaysia, and Ethiopia.
What you will be responsible for:
- Operate and optimize GPU clusters using Kubernetes, Slurm, and Ray across multiple regions.
- Implement elastic scheduling and unified orchestration for inference and training jobs (Kueue / NVIDIA KAI Scheduler / KEDA), including preemption and dynamic capacity arbitration between training and serving on the GPU resource pool.
- Manage and tune vLLM / SGLang runtimes for high-throughput, low-latency serving — including continuous batching, KV-cache paging, and prefill/decode disaggregation with RDMA / NIXL KV transfer.
- Optimize distributed scheduling for multi-replica, multi-tenant serving; own model hot-swapping and zero-downtime rollout paths.
- Benchmark and profile performance across workloads and model sizes (dense / MoE, 7B → 70B+, FP8 / AWQ / GPTQ).
- Tune distributed communication stacks — NCCL / RCCL, RDMA over RoCEv2, and InfiniBand.
- Build observability with Prometheus, Grafana, and Ray Dashboard to monitor GPU utilization, TTFT / ITL latency, and anomalies; integrate with the platform-wide OpenTelemetry + Grafana LGTM+ stack.
How you will stand out:
- Bachelor's or Master's degree in Computer Science, Electrical Engineering, or related field; PhD preferred for advanced R&D or innovation-oriented roles.
- 3-5+ years in ML Infrastructure, HPC, or Systems Engineering.
- Hands-on experience with Kubernetes, Slurm, or Ray.
- Familiarity with vLLM, SGLang, or similar inference frameworks.
- Strong background in PyTorch / JAX, distributed systems, and communication stacks (NCCL / RCCL, RDMA).
- Proficiency in Python plus one of Go / C++ / Rust.
- Experience building observability with Prometheus and Grafana.
- Fluent in English; experience working in multinational or cross-cultural environments is a plus.
- Experience with major cloud platforms is strongly preferred, particularly in designing large-scale, production-grade architectures or cloud services.
What you will experience working with us:
- A culture that values authenticity and diversity of thoughts and backgrounds;
- An inclusive and respectable environment with open workspaces and exciting start-up spirit;
- Fast-growing company with the chance to network with industrial pioneers and enthusiasts;
- Ability to contribute directly and make an impact on the future of the digital asset industry;
- Involvement in new projects, developing processes/systems;
- Personal accountability, autonomy, fast growth, and learning opportunities;
- Attractive welfare benefits and developmental opportunities such as training and mentoring.
--------------------------------------------------------------------
Bitdeer is committed to providing equal employment opportunities in accordance with country, state, and local laws. Bitdeer does not discriminate against employees or applicants based on conditions such as race, colour, gender identity and/or expression, sexual orientation, marital and/or parental status, religion, political opinion, nationality, ethnic background or social origin, social status, disability, age, indigenous status, and union.
#LI-ST1
Bitdeer Group Singapore, Singapore, SGP Office
Singapore, Singapore, Singapore

