Experience

Experience

Building high-performance, high-reliability network infrastructure for AI at cloud scale.

I work on networking benchmarks, telemetry, resilient transport, and agentic infrastructure systems for large-scale AI and HPC clusters. My work spans production engineering, research collaboration, and practical systems design for cloud-scale infrastructure.

Software Engineer II · Azure HPC Team

Microsoft · Vancouver, Canada / Beijing, China

MRC and Resilient AI Supercomputer Networking

Network Benchmarking for AI/HPC Infrastructure

  • Work on networking benchmarks and deployment readiness for high-performance AI clusters.
  • Focus on topology-aware benchmarking across NVLink, multi-node NVLink, InfiniBand, and Ethernet fabrics.
  • Develop practical reliability and performance signals for large-scale cluster buildout and production readiness.

Agentic Platform for Infrastructure Workflows

  • Explore an Agentic Platform for AI infrastructure workflows, connecting LLM-driven agents with benchmark selection, infrastructure evaluation, and operational automation.
  • Build reliable interfaces between model capabilities and infrastructure engineering tasks.

Software Engineer · Azure HPC Team

Microsoft China · Beijing, China

SuperBench: Benchmarking and Topology-Aware Evaluation

  • Worked on the open-source SuperBench benchmarking framework for cloud AI infrastructure.
  • Focused on making benchmarking scalable, topology-aware, and useful for production readiness; related paper: USENIX ATC 2024 Best Paper.

Moneo: Telemetry and Performance Observability

  • Worked on the open-source Moneo telemetry stack for GPU, InfiniBand, and custom performance signals.
  • Helped turn low-level system metrics into actionable signals for anomaly detection and infrastructure optimization.

Education

Nagoya University

Nagoya, Japan

M.Eng. in Electrical Engineering · GPA 3.91/4.0

Oct. 2019 – Sept. 2021

Supervisor: Prof. Hiroshi Hasegawa

Thesis: Resource Allocation in Elastic Optical Networks via Reinforcement Learning

Skills

PythonC/C++RustGolangShellTorchSlurmInfiniBandMPINCCLMegatron-LMAzure OpenAILLM Agents