Yang Wang · Software Engineer II, Azure HPC Team, Microsoft
I build high-performance, high-reliability network infrastructure for large-scale AI systems.
My work focuses on resilient data center networking, interconnect benchmarking, telemetry, and agentic infrastructure systems for large-scale AI and HPC clusters. I enjoy building practical systems that make cloud-scale infrastructure more measurable, reliable, and easier to operate.
Before joining Microsoft, I received my M.S. in Electrical Engineering from Nagoya University and my B.S. in Electronic and Information Engineering from Xi'an University of Technology.
About
I am a systems engineer who likes working close to the real constraints of infrastructure: throughput, latency, scheduling, failure recovery, deployment complexity, and operational clarity. I care about systems that are not only fast on paper, but also measurable, diagnosable, and resilient in production.
Currently, my work centers on AI infrastructure at Microsoft. I am especially interested in the intersection of data center networking, high-performance computing, distributed systems, and LLM-driven infrastructure automation.
A major theme of my recent work is building reliable infrastructure for large AI clusters, including networking mechanisms, benchmarking algorithms, telemetry systems, and agentic tools that make failures easier to detect, isolate, and recover from in production environments.
Outside of work, I keep an active interest in machine learning systems, networking, and the broader tooling ecosystem around modern infrastructure.
Experience Highlights
Azure HPC · Microsoft Canada
Working on networking benchmarks, deployment readiness, and infrastructure automation for AI/HPC clusters in Vancouver.
MRC and SRv6
Contributing to resilient AI supercomputer networking for large training clusters and production-grade failure tolerance.
Benchmarking and Telemetry
Building practical benchmarking and observability systems for cloud-scale AI infrastructure.
Agentic Infrastructure
Exploring LLM-driven agents and automation for infrastructure evaluation and operations.
Read more about my experience →
News
- May 2026: Our cross-company paper Resilient AI Supercomputer Networking using MRC and SRv6 appeared on arXiv, with collaboration across OpenAI, Microsoft, AMD, Broadcom, and NVIDIA.
- Apr. 2026: SuperBench was extended in ACM Transactions on Computer Systems.
- Nov. 2024: Transferred from Microsoft China in Beijing to Microsoft Canada Development Centre in Vancouver.
- July 2024: SuperBench was published at USENIX ATC and received a Best Paper Award.
- Oct. 2021: Joined Microsoft.
- Sept. 2021: Completed my master’s degree at Nagoya University.
Research Interests
- High-performance, high-reliability network infrastructure
- AI infrastructure benchmarking and observability
- Resilient networking with MRC and SRv6
Selected Publications
Resilient AI Supercomputer Networking using MRC and SRv6
Authors are grouped by company; within each company, names are listed alphabetically.
