Yang Wang · Software Engineer II, Azure HPC, Microsoft

I build high-performance, production-grade infrastructure for large-scale AI and HPC systems.

My work focuses on resilient data center networking, interconnect benchmarking, infrastructure observability, and agentic systems for large-scale AI and HPC clusters. I enjoy building production-oriented systems that make cloud-scale infrastructure measurable, diagnosable, resilient, and easier to operate.

Before joining Microsoft, I received my M.S. in Electrical Engineering from Nagoya University and my B.S. in Electronic and Information Engineering from Xi'an University of Technology.

About

I am a systems engineer who likes working close to the real constraints of infrastructure: throughput, latency, scheduling, failure recovery, deployment complexity, and operational clarity. I care about systems that are not only fast on paper, but also measurable, diagnosable, and resilient in production.

Currently, my work centers on AI infrastructure at Microsoft. I am especially interested in the intersection of data center networking, high-performance computing, distributed systems, and LLM-driven infrastructure automation.

A major theme of my recent work is building reliable infrastructure for large AI clusters, including networking mechanisms, benchmarking algorithms, observability systems, and agentic tools that make failures easier to detect, isolate, and recover from in production environments.

Outside of work, I keep an active interest in machine learning systems, networking, and the broader tooling ecosystem around modern infrastructure.

Experience Highlights

Current

Azure HPC · Microsoft Canada

Working on networking benchmarks, deployment readiness, and infrastructure automation for AI/HPC clusters in Vancouver.

Networking

MRC and SRv6

Contributing to resilient AI supercomputer networking for large training clusters and production-grade failure tolerance.

Benchmark

Benchmarking and Observability

Building practical validation and observability systems for cloud-scale AI infrastructure.

Agents

Agentic Infrastructure

Exploring LLM-driven agents and automation for infrastructure evaluation and operations.

News

  • May 2026: Our cross-company paper Resilient AI Supercomputer Networking using MRC and SRv6 appeared on arXiv, with collaboration across OpenAI, Microsoft, AMD, Broadcom, and NVIDIA.
  • Apr. 2026: SuperBench was extended in ACM Transactions on Computer Systems.
  • Nov. 2024: Transferred from Microsoft China in Beijing to Microsoft Canada Development Centre in Vancouver.
  • July 2024: SuperBench was published at USENIX ATC 2024 and received a Best Paper Award.
  • Oct. 2021: Joined Microsoft.
  • Sept. 2021: Completed my master’s degree at Nagoya University.

Research Interests

  • High-performance networking and resilient systems for large-scale AI and HPC clusters
  • AI infrastructure validation, interconnect benchmarking, and observability systems

Selected Publications

See my full list of publications →