
Description
WHAT YOU DO AT AMD CHANGES EVERYTHING
We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences – the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our mission is the AMD culture. We push the limits of innovation to solve the world's most important challenges. We strive for execution excellence while being direct, humble, collaborative, and inclusive of diverse perspectives.
AMD together we advance_
THE ROLE:
AI Software development engineer on teams building and optimizing Deep Learning applications and AI frameworks for AMD GPU compute platforms. Work as part of an AMD development team and open-source community to analyze, develop, test and deploy improvements to make AMD the best platform for machine learning applications.
THE PERSON:
We are looking for an experienced network engineer with a deep understanding of AI/LLM workload requirements, particularly in large-scale GPU-based environments. The ideal candidate should have expertise in designing, optimizing, and troubleshooting high-performance network architectures, with a strong focus on supporting AI training and inference tasks. A collaborative problem-solver, you should bring experience working closely with both internal teams and customers to deliver optimized network solutions that ensure efficient and scalable AI workloads.
KEY RESPONSIBILITIES:
- Network Debugging & Troubleshooting: Resolve network performance issues in AI environments to ensure efficient GPU utilization and smooth data flow during large-scale training and inference.
- Customer Network Solutions: Collaborate with customers to design optimized network configurations for large-scale AI workloads, enhancing performance for training and inference.
- High-Speed Network Design: Develop and optimize network solutions to meet the demands of AI/LLM workloads in training, inference, and storage.
- AI Network System Support: Assist in the development and optimization of large-scale AI network platforms, focusing on fault localization and performance analysis for reliability.
- AI Networking Research & Innovation: Drive research and develop advanced communication frameworks and network protocols to ensure high performance and scalability.
- Low-Latency Interconnect Solutions: Develop ultra-low latency, high-speed interconnects to optimize communication and performance in large-scale AI environments.
- Collaborate on GPU Libraries & Frameworks: Work with internal teams and open-source maintainers to optimize AMD GPU performance in frameworks like TensorFlow and PyTorch for AI workloads.
PREFERRED EXPERIENCE:
- GPU Networking Expertise: Strong expertise in AMD ROCm, other AMD GPU communication technologies, and familiarity with NVIDIA NVLink and GPUDirect for optimizing GPU-based systems.
- Low-Latency Networking Protocols: In-depth knowledge of RDMA, InfiniBand, PCIe, and other low-latency protocols to optimize high-performance AI workloads on AMD GPUs.
- Distributed Systems Networking: Hands-on experience with distributed systems networking, including proficiency in communication libraries like MPI, NCCL, and RCCL for AI/ML workloads.
- Performance Tuning & Benchmarking: Proven experience in performance tuning and benchmarking AI workloads to optimize throughput, latency, and bandwidth, particularly on AMD GPU systems.
- Cloud & Hybrid Environments: Experience with cloud and hybrid network environments, ensuring seamless integration of on-premise and cloud-based AMD GPU systems for AI workloads.
- Fault Tolerance & Scalability: Strong background in designing fault-tolerant, high availability, and scalable network solutions to ensure system reliability and performance for AI and high-performance computing environments.
- 5+ years of professional experience in technical software development within the network engineering industry, with a strong emphasis on network design, architecture, and optimization for high-performance systems.
ACADEMIC CREDENTIALS:
- Master's or higher in Computer Science, Computer Engineering, Electrical Engineering, or a related field.
#LI-FL1
Benefits offered are described: AMD benefits at a glance.
AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.
Apply on company website