Network Engineer
Company: OpenAI
Location: San Francisco
Posted on: April 1, 2025
|
|
Job Description:
As a Networking Engineer focused on WAN and LAN, you will play a critical role in developing, managing, and optimizing the front end network components of OpenAI's supercomputing infrastructure.Your expertise will ensure that our networks are fast, reliable, and scalable to meet the demands of training frontier AI models.This includes managing both local (LAN) and long-distance (WAN) connectivity across our data centers, optimizing performance, and ensuring seamless communication between compute nodes and clusters. Finally, this also includes writing code to instrument and observe the network.Our team primarily uses Python and some Rust, so familiarity with or interest in working with this stack is essential.This role is based in San Francisco, CA, with a hybrid work model of 3 days per week in the office. Relocation assistance is available.In this role, you will:
- Design, manage, and optimize WAN and LAN infrastructure for OpenAI's supercomputers.
- Develop and maintain data collection and monitoring systems to ensure network visibility and performance.
- Troubleshoot and resolve network issues, such as TCP/IP, BGP, and physical.
- Automate network issue detection and resolution to reduce operational overhead.
- Work closely with hardware and systems engineers to meet the performance demands of distributed AI training workloads.You might thrive in this role if you:
- Have 5+ years of experience in networking or related infrastructure roles.
- Possess strong expertise in networking technologies, protocols, and design principles.
- Have hands-on experience with troubleshooting complex networking issues, including both LAN and WAN environments.
- Deeply understand how to set up TCP/IP networks from scratch (e.g., BGP, ECMP routing, etc.).
- Have a deep understanding of network protocols such as TCP/IP, BGP, & VLAN.
- Are familiar with optical connectors and optical circuit switches (OCS).
- Understand advanced concepts in routing, forwarding, and network management systems.
- Have experience with telemetry, traffic engineering, and congestion management to optimize network performance.
- Are skilled in collaborating across teams, combining technical expertise with excellent problem-solving and communication abilities.
- Exhibit ownership of problems end-to-end and maintain a commitment to continuous learning to effectively solve challenges.
- Are familiar with InfiniBand, RoCE, or RDMA in HPC
(High-Performance Computing) or similar environments.About
OpenAIOpenAI is an AI research and deployment company dedicated to
ensuring that general-purpose artificial intelligence benefits all
of humanity. We push the boundaries of the capabilities of AI
systems and seek to safely deploy them to the world through our
products. AI is an extremely powerful tool that must be created
with safety and human needs at its core, and to achieve our
mission, we must encompass and value the many different
perspectives, voices, and experiences that form the full spectrum
of humanity.We are an equal opportunity employer and do not
discriminate on the basis of race, religion, national origin,
gender, sexual orientation, age, veteran status, disability or any
other legally protected status.For US Based Candidates: Pursuant to
the San Francisco Fair Chance Ordinance, we will consider qualified
applicants with arrest and conviction records.We are committed to
providing reasonable accommodations to applicants with
disabilities, and requests can be made via this link.At OpenAI, we
believe artificial intelligence has the potential to help people
solve immense global challenges, and we want the upside of AI to be
widely shared. Join us in shaping the future of technology.
#J-18808-Ljbffr
Keywords: OpenAI, San Francisco , Network Engineer, Engineering , San Francisco, California
Click
here to apply!
|