Yang Wang · Software Engineer II, Azure HPC, Microsoft

I build high-performance, production-grade infrastructure for large-scale AI and HPC systems.

My work spans resilient data center networking (MRC and SRv6), topology-aware interconnect benchmarking across NVLink, InfiniBand, and Ethernet, GPU and fabric observability, and agentic tooling that connects LLM agents to real infrastructure workflows. I like building systems that make cloud-scale infrastructure measurable, diagnosable, and resilient in production.

Before joining Microsoft, I received my M.S. in Electrical Engineering from Nagoya University and my B.S. in Electronic and Information Engineering from Xi'an University of Technology.

About

I am a systems engineer who likes working close to the real constraints of infrastructure: throughput, latency, scheduling, failure recovery, and operational clarity. I care about systems that are fast not just on paper, but measurable, diagnosable, and resilient in production.

At Microsoft Azure HPC, I work at the intersection of data center networking, high-performance computing, and LLM-driven automation. Recent work includes resilient transport for large training clusters (MRC and SRv6), the open-source SuperBench and Moneo projects for benchmarking and observability, and an agentic platform that pairs LLM agents with infrastructure evaluation — all aimed at making failures easier to detect, isolate, and recover from at cluster scale.

Experience Highlights

Current

Azure HPC · Microsoft Canada

Working on networking benchmarks, deployment readiness, and infrastructure automation for AI/HPC clusters in Vancouver.

Networking

MRC and SRv6

Contributing to resilient AI supercomputer networking for large training clusters and production-grade failure tolerance.

Benchmark

SuperBench and Moneo

Open-source, topology-aware benchmarking and GPU/InfiniBand observability for cloud AI infrastructure (USENIX ATC 2024 Best Paper).

Agents

Agentic Infrastructure

An agentic platform pairing LLM agents with benchmark selection, infrastructure evaluation, and operational automation.

News

  • June 2026: Our cross-company paper The Multipath Reliable Connection (MRC) Transport appeared on arXiv, presenting an open, production-grade transport for large-scale AI/ML training over best-effort Ethernet.
  • May 2026: Our cross-company paper Resilient AI Supercomputer Networking using MRC and SRv6 appeared on arXiv, with collaboration across OpenAI, Microsoft, AMD, Broadcom, and NVIDIA.
  • Apr. 2026: SuperBench was extended in ACM Transactions on Computer Systems.
  • Nov. 2024: Transferred from Microsoft China in Beijing to Microsoft Canada Development Centre in Vancouver.
  • July 2024: SuperBench was published at USENIX ATC 2024 and received a Best Paper Award.
  • Oct. 2021: Joined Microsoft.
  • Sept. 2021: Completed my master’s degree at Nagoya University.

Research Interests

  • High-performance networking and resilient transport for large-scale AI and HPC clusters
  • Interconnect benchmarking, validation, and observability for AI infrastructure
  • LLM-driven agents and automation for infrastructure evaluation and operations

Selected Publications

See my full list of publications →