GPU Infrastructure & Networking Architect
Maharashtra, India · À temps plein
Soyez le premier à postuler
- Expérience
- 10 ans et plus
- Salaire
- —
- Ouvertures
- 1
- Publié
- il y a 8 heures
- Mode de travail
- Au bureau
- Admissibilité
- <ul><li>Open to candidates with at least 10 years of relevant experience.</li></ul>
- CV
- Candidature requise
Votre lieu de travail
Description de l'emploi
Role overview
This position focuses on low-level design and day-one implementation of a physical GPU compute environment and DPU-driven networking stack for a sovereign AI cloud platform. The architect will translate the high-level solution into deployable, tested configurations that follow NVIDIA reference guidance.
The scope covers the bare-metal GB300 NVL72 compute layer, NVIDIA BlueField-3 DPU deployment in DPF mode, Spectrum-X GPU TAN, Spectrum-3 supporting fabric, and internet egress using F5 and Netris. The role also includes validation of the build against NVIDIA’s dual-plane networking architecture.
Key responsibilities
- Create rack-level and node-level low-level designs for GB300 NVL72 bare-metal GPU nodes, including cabling, power distribution, and cooling layout.
- Design and document BlueField-3 DPU deployment in DPF mode for all relevant node types, and verify DOCA DPF Operator setup and lifecycle handling.
- Define the NVIDIA NICo zero-trust onboarding process for GB300 systems, including pre-boot attestation steps and BMC integration.
- Build IP plans for the GPU TAN on Spectrum-X, the auxiliary compute fabric on Spectrum-3, out-of-band management, and storage networks.
- Prepare day-zero runbooks for Spectrum-X spine-leaf deployment, rail-optimized RoCE configuration, and MTU/ECN tuning for GB300 workloads.
- Design distributed BGP egress with FRR DaemonSet and /32 EIP injection, including eBGP peering between the compute and spine autonomous systems.
- Own the F5 AWAF hardware egress setup as the only internet exit route, and define co-deployment of F5 BNK and DPF on BlueField-3 for north-south traffic.
- Design the OVN-Kubernetes overlay for ancillary nodes and DPF host-trusted integration, and validate SF/VF coexistence on BlueField-3.
- Prepare VXLAN segmentation maps, VRF isolation design, and scaling analysis for HBN VRF within a ZoneVPC DaemonSet model.
- Define RDMA/RoCE policy requirements and lossless Ethernet expectations for large-scale GB300 training workloads.
- Implement Netris-based cloud virtual functions and related networking constructs.
- Design storage connectivity for VAST NFS on GB300 nodes, including performance validation and NFS tuning.
- Design NetApp ONTAP block storage connectivity for ancillary KubeVirt virtual machines, including iSCSI and NVMe-oF path setup.
- Prepare storage low-level designs for StorageGRID object connectivity, zone affinity, and multi-tenant namespace isolation.
- Validate hardware bring-up against NVIDIA’s GB300 NVL72 dual-plane networking reference architecture.
- Work with NVIDIA field engineering on DPF-mode Prometheus port access and UFM 6.x metric naming compatibility.
- Develop test plans and acceptance criteria for the fabric, GPU TAN, and egress path, and support NVIDIA NCP validation reviews.
Experience and technical requirements
- At least 10 years of experience in data-centre infrastructure architecture, including 3+ years working on large-scale GPU or AI cluster deployments of 100+ nodes.
- Practical experience with NVIDIA BlueField DPUs, preferably BF-2 or BF-3, along with DOCA SDK, DPF Operator, and OVN-K integration.
- Strong command of BGP, including both eBGP and iBGP, plus RoCE/RDMA networking and lossless Ethernet design concepts such as PFC, ECN, and DCQCN.
- Solid working knowledge of NVIDIA Spectrum switching, UFM, SHARP, and rail-optimized topologies, or equivalent AI fabric environments.
- Hands-on experience with F5 BIG-IP hardware AWAF and BIG-IP Next for Kubernetes, including use of BIG-IP for Kubernetes ingress and egress.
- Deep Linux networking expertise across VXLAN, VRF, VLAN, OVN/OVS, kernel datapath, SR-IOV, and VF/SF configuration.
- Experience implementing and customizing Netris.
- Ability to use Python and/or Go for automation and infrastructure-as-code, plus Ansible and Terraform for day-zero provisioning.
- Familiarity with NVIDIA NICo zero-trust bare-metal enrollment and attestation is an added advantage.
- Exposure to NetApp ONTAP and VAST Data NFS platforms in high-performance compute environments is preferred.
- Knowledge of NVIDIA UFM 6.x and telemetry streaming for GPU fabric observability is beneficial.
- Prior work with NVIDIA Cloud Partner programs or sovereign AI cloud deployments is a plus.
- Understanding of Kubernetes CNI technologies such as OVN-Kubernetes and Cilium, especially in relation to DPU offload, is desirable.
Additional details
This is a full-time onsite role based in Maharashtra, India, with an additional location listed as Bangalore. There is one opening available. The role is to be filled immediately.
Applicants must have a minimum of 10 years of experience. Internship-specific fields, probation details, and salary/stipend amounts are not provided.