Network Architecture Lead
Location: Remote, U.S.-based (Bay Area strongly preferred)
Industry: AI Infrastructure | High-Growth Startup | Confidential Client
Travel: Minimal, mostly for integration testing or data center visits.
Work Authorization: U.S. citizens and green card holders only at this time.
About the Company
This high-impact role supports a confidential client that is building one of the world’s most advanced AI compute platforms - scaling from a 50MW initial deployment to a projected 3GW over the next few years. With more than $1.5B in funding secured and tight-knit partnerships with GPU hardware leaders and major AI labs, this company is redefining what's possible in hyperscale infrastructure.
About the Role
We are hiring a Network Architecture Lead to build and scale next-gen network fabrics tailored for high-throughput, low-latency AI training workloads. This is a senior-level, hands-on architecture role. You’ll take ownership of the end-to-end network design for GPU training clusters at an extreme scale, with performance and uptime requirements that exceed hyperscaler norms.
This isn’t an optimization or maintenance role - this is foundational infrastructure design. You’ll work with top minds in AI and systems architecture to design RoCEv2 and InfiniBand fabrics that support thousands of GPUs and cutting-edge model training.
Key Responsibilities
- Architect high-radix, scalable network topologies (Clos, Dragonfly, Fat-Tree, or custom designs) optimized for AI/ML workloads.
- Engineer and validate large-scale RoCEv2 and InfiniBand fabrics that support 50K+ GPU clusters.
- Develop congestion control, traffic engineering, QoS, and tuning strategies for AI training throughput and consistency.
- Lead deployment efforts from turn-up to benchmarking and real-time troubleshooting.
- Collaborate with hardware vendors, data center teams, and internal engineering to meet aggressive scaling and performance goals.
- Integrate telemetry, observability, and automation (CI/CD pipelines, NetOps) into the network stack.
- Own the decision-making process for vendors, silicon, switching, and protocols.
- Mentor other network engineers and help define the long-term architecture vision.
Skills & Experience
Required:
- 10–20 years of experience in high-performance data center networking.
- Hands-on experience with RoCEv2, InfiniBand, ECN, PFC, RDMA, and DCB protocols.
- Strong knowledge of AI, HPC, or hyperscale network design (Clos, Dragonfly, Fat-Tree).
- Proven ability to manage performance, telemetry, and congestion at multi-terabit scale.
- Deep background in debugging, validation, and real-time performance tuning.
- Bachelor’s or higher in Computer Engineering, Electrical Engineering, or Computer Science.
Preferred:
- Experience with hyperscale firms (e.g., AWS, Meta, Oracle Cloud, xAI, Tesla).
- Familiarity with NVIDIA networking solutions (e.g., ConnectX, Spectrum, NVLink).
- Exposure to large-scale CI/CD or NetOps environments.
- Deep interest in the intersection of AI infrastructure, systems design, and hardware enablement.
Who You Are
- A systems thinker who thrives in complexity and scale.
- Curious, driven, and not afraid to get into the weeds.
- Motivated by speed, impact, and building something foundational.
- Willing to trade predictability for purpose and opportunity.
- A mentor, leader, and collaborator by nature.
Why Apply?
- Impact: Your designs will power AI systems used globally - starting with OpenAI’s next-gen deployments.
- Ownership: High equity, fast-moving culture, flat hierarchy.
- Velocity: This company builds fast. Decisions are quick, and your work matters.
- Tech: Work with NVIDIA, leading cloud hardware partners, and some of the largest GPU clusters ever assembled.
- People: Join a founding team of industry veterans from top-tier AI labs and cloud platforms.
About Blue Signal:
Blue Signal is an award-winning, executive search firm specializing in various specialties. Our recruiters have a proven track record of placing top-tier talent across industry verticals, with deep expertise in numerous professional services. Learn more at bit.ly/46Gs4yS