Yang Wang - AI Networking Engineer

Yang Wang · Software Engineer II, Azure HPC, Microsoft

I build high-performance, production-grade infrastructure for large-scale AI and HPC systems.

My work spans resilient data center networking (MRC and SRv6), topology-aware interconnect benchmarking across NVLink, InfiniBand, and Ethernet, GPU and fabric observability, and agentic tooling that connects LLM agents to real infrastructure workflows. I like building systems that make cloud-scale infrastructure measurable, diagnosable, and resilient in production.

Before joining Microsoft, I received my M.S. in Electrical Engineering from Nagoya University and my B.S. in Electronic and Information Engineering from Xi'an University of Technology.

Experience Publications

About

I am a systems engineer who likes working close to the real constraints of infrastructure: throughput, latency, scheduling, failure recovery, and operational clarity. I care about systems that are fast not just on paper, but measurable, diagnosable, and resilient in production.

At Microsoft Azure HPC, I work at the intersection of data center networking, high-performance computing, and LLM-driven automation. Recent work includes resilient transport for large training clusters (MRC and SRv6), the open-source SuperBench and Moneo projects for benchmarking and observability, and an agentic platform that pairs LLM agents with infrastructure evaluation — all aimed at making failures easier to detect, isolate, and recover from at cluster scale.

Experience Highlights

Current

Azure HPC · Microsoft Canada

Working on networking benchmarks, deployment readiness, and infrastructure automation for AI/HPC clusters in Vancouver.

Networking

MRC and SRv6

Contributing to resilient AI supercomputer networking for large training clusters and production-grade failure tolerance.

Benchmark

SuperBench and Moneo

Open-source, topology-aware benchmarking and GPU/InfiniBand observability for cloud AI infrastructure (USENIX ATC 2024 Best Paper).

Agents

Agentic Infrastructure

An agentic platform pairing LLM agents with benchmark selection, infrastructure evaluation, and operational automation.

News

June 2026: Our cross-company paper The Multipath Reliable Connection (MRC) Transport appeared on arXiv, presenting an open, production-grade transport for large-scale AI/ML training over best-effort Ethernet.
May 2026: Our cross-company paper Resilient AI Supercomputer Networking using MRC and SRv6 appeared on arXiv, with collaboration across OpenAI, Microsoft, AMD, Broadcom, and NVIDIA.
Apr. 2026: SuperBench was extended in ACM Transactions on Computer Systems.
Nov. 2024: Transferred from Microsoft China in Beijing to Microsoft Canada Development Centre in Vancouver.
July 2024: SuperBench was published at USENIX ATC 2024 and received a Best Paper Award.
Oct. 2021: Joined Microsoft.
Sept. 2021: Completed my master’s degree at Nagoya University.

Research Interests

High-performance networking and resilient transport for large-scale AI and HPC clusters
Interconnect benchmarking, validation, and observability for AI infrastructure
LLM-driven agents and automation for infrastructure evaluation and operations

Selected Publications

The Multipath Reliable Connection (MRC) Transport

Rip Sohan, Eric Spada, Eric Davis, Mark Handley, Idan Burstein, Tony Hurson, Jithin Jose, Vivek Kashyap, Rong Pan, Sayantan Sur, Sreevatsa Anantharamu, Aviv Barnea, Adrian Caulfield, Elazar Cohen, Elliot Edmunds, Yamin Friedman, Mahdieh Ghazi, Murali Guramali, Torsten Hoefler, Vipin Jain, Abdul Kabbani, Noam Katz, Yanfang Le, Charlie Mbariky, Guglielmo Morandin, Masoud Moshref, Shane O'Neil, Michael Papamichael, Jonas Pfefferle, Siva Santosh Pyla, Costin Raiciu, David Riddoch, Karen Schramm, Yuval Shpigelman, Shahaf Shuler, Shy Shyman, Raghava Sivaramu, Amin Tootoonchian, Yang Wang

MRCarXiv 2026

Resilient AI Supercomputer Networking using MRC and SRv6

Authors are grouped by company; within each company, names are listed alphabetically.

OpenAI: Joao Araujo, Alex Chow, Mark Handley, Ryder Lewis, Christoph Paasch, Jitendra Padhye, Michael Papamichael, Greg Steinbrecher, Amin Tootoonchian, Lihua Yuan

Microsoft: S. Anantharamu, Abhishek Dosi, Mohit Garg, Mahdieh Ghazi, Torsten Hoefler, Deepal Jayasinghe, Jithin Jose, Abdul Kabbani, Guohan Lu, Yang Wang

AMD: K. Doddapaneni, Murali Garimella, Vipin Jain, Yanfang Le, H. Nagulapalli, S. Narayanan, Rong Pan, Rathina Sabesan, Raghava Sivaramu, Rip Sohan

Broadcom: Eric Davis, Dragos Dumitrescu, Mohan Kalkunte, Bhaswar Mitra, Guglielmo Morandin, Adrian Popa, Costin Raiciu, Eric Spada, John Spillane, Niranjan Vaidya

NVIDIA: Aviv Barnea, Idan Burstein, Elazar Cohen, Yamin Friedman, Noam Katz, Masoud Moshref, Yuval Shpigelman, Shahaf Shuler, Shy Shyman, Sayantan Sur

MRCarXiv 2026

SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation ⭐ Best Paper Award

Yifan Xiong, Yuting Jiang, Ziyue Yang, Lei Qu, Guoshuai Zhao, Shuguang Liu, Dong Zhong, Boris Pinzur, Jie Zhang, Yang Wang, Jithin Jose, Hossein Pourreza, Jeff Baxter, Kushal Datta, Prabhat Ram, Luke Melton, Joe Chau, Peng Cheng, Yongqiang Xiong, Lidong Zhou

ATCUSENIX ATC 2024 / TOCS 2026

Resource Assignment Based on Core-State Value Evaluation to Handle Crosstalk and Spectrum Fragments in SDM Elastic Optical Networks ⭐ Best Student Paper Award

Yang Wang, Yojiro Mori, Hiroshi Hasegawa

OECCOpto-Electronics and Communications Conference, 2020

See my full list of publications →