Yang Wang · Software Engineer II, Azure HPC, Microsoft
I build high-performance, production-grade infrastructure for large-scale AI and HPC systems.
My work spans resilient data center networking (MRC and SRv6), topology-aware interconnect benchmarking across NVLink, InfiniBand, and Ethernet, GPU and fabric observability, and agentic tooling that connects LLM agents to real infrastructure workflows. I like building systems that make cloud-scale infrastructure measurable, diagnosable, and resilient in production.
Before joining Microsoft, I received my M.S. in Electrical Engineering from Nagoya University and my B.S. in Electronic and Information Engineering from Xi'an University of Technology.
About
I am a systems engineer who likes working close to the real constraints of infrastructure: throughput, latency, scheduling, failure recovery, and operational clarity. I care about systems that are fast not just on paper, but measurable, diagnosable, and resilient in production.
At Microsoft Azure HPC, I work at the intersection of data center networking, high-performance computing, and LLM-driven automation. Recent work includes resilient transport for large training clusters (MRC and SRv6), the open-source SuperBench and Moneo projects for benchmarking and observability, and an agentic platform that pairs LLM agents with infrastructure evaluation — all aimed at making failures easier to detect, isolate, and recover from at cluster scale.
Experience Highlights
Azure HPC · Microsoft Canada
Working on networking benchmarks, deployment readiness, and infrastructure automation for AI/HPC clusters in Vancouver.
MRC and SRv6
Contributing to resilient AI supercomputer networking for large training clusters and production-grade failure tolerance.
SuperBench and Moneo
Open-source, topology-aware benchmarking and GPU/InfiniBand observability for cloud AI infrastructure (USENIX ATC 2024 Best Paper).
Agentic Infrastructure
An agentic platform pairing LLM agents with benchmark selection, infrastructure evaluation, and operational automation.
Read more about my experience →
News
- June 2026: Our cross-company paper The Multipath Reliable Connection (MRC) Transport appeared on arXiv, presenting an open, production-grade transport for large-scale AI/ML training over best-effort Ethernet.
- May 2026: Our cross-company paper Resilient AI Supercomputer Networking using MRC and SRv6 appeared on arXiv, with collaboration across OpenAI, Microsoft, AMD, Broadcom, and NVIDIA.
- Apr. 2026: SuperBench was extended in ACM Transactions on Computer Systems.
- Nov. 2024: Transferred from Microsoft China in Beijing to Microsoft Canada Development Centre in Vancouver.
- July 2024: SuperBench was published at USENIX ATC 2024 and received a Best Paper Award.
- Oct. 2021: Joined Microsoft.
- Sept. 2021: Completed my master’s degree at Nagoya University.
Research Interests
- High-performance networking and resilient transport for large-scale AI and HPC clusters
- Interconnect benchmarking, validation, and observability for AI infrastructure
- LLM-driven agents and automation for infrastructure evaluation and operations
Selected Publications
The Multipath Reliable Connection (MRC) Transport
Resilient AI Supercomputer Networking using MRC and SRv6
Authors are grouped by company; within each company, names are listed alphabetically.
