Home Artificial Intelligence Doug Fuller, VP of Software Engineering at Cornelis Networks – Interview Series

Doug Fuller, VP of Software Engineering at Cornelis Networks – Interview Series

4
Doug Fuller, VP of Software Engineering at Cornelis Networks – Interview Series

As Vice President of Software Engineering, Doug is chargeable for all elements of the Cornelis Networks’ software stack, including the Omni-Path Architecture drivers, messaging software, and embedded device control systems. Before joining Cornelis Networks, Doug led software engineering teams at Red Hat in cloud storage and data services. Doug’s profession in HPC and cloud computing began at Ames National Laboratory’s Scalable Computing Laboratory. Following several roles in university research computing, Doug joined the US Department of Energy’s Oak Ridge National Laboratory in 2009, where he developed and integrated latest technologies on the world-class Oak Ridge Leadership Computing Facility.

Cornelis Networks is a technology leader delivering purpose-built high-performance fabrics for High Performance Computing (HPC), High Performance Data Analytics (HPDA), and Artificial Intelligence (AI) to leading industrial, scientific, academic, and government organizations.

What initially attracted you to computer science?

I just appeared to enjoy working with technology. I enjoyed working with the computers growing up; we had a modem at our college that permit me check out the Web and I discovered it interesting. As a freshman in college, I met a USDOE computational scientist while volunteering for the National Science Bowl. He invited me to tour his HPC lab and I used to be hooked. I have been a supercomputer geek ever since.

You worked at Red Hat from 2015 to 2019, what were among the projects you worked on and your key takeaways from this experience?

My major project at Red Hat was Ceph distributed storage. I’d previously focused entirely on HPC and this gave me a possibility to work on technologies that were critical to cloud infrastructure. It rhymes. Most of the principles of scalability, manageability, and reliability are extremely similar though they’re aimed toward solving barely different problems. When it comes to technology, my most significant takeaway was that cloud and HPC have lots to learn from each other. We’re increasingly constructing different projects with the identical Lego set. It’s really helped me understand how the enabling technologies, including fabrics, can come to bear on HPC, cloud, and AI applications alike. It is also where I actually got here to know the worth of Open Source and execute the Open Source, upstream-first software development philosophy that I brought with me to Cornelis Networks. Personally, Red Hat was where I actually grew and matured as a pacesetter.

You’re currently the Vice President of Software Engineering at Cornelis Networks, what are a few of your responsibilities and what does your average day seem like?

As Vice President of Software Engineering, I’m chargeable for all elements of the Cornelis Networks’ software stack, including the Omni-Path Architecture drivers, messaging software, fabric management, and embedded device control systems. Cornelis Networks is an exciting place to be, especially on this moment and this market. Due to that, I’m undecided I even have an “average” day. Some days I’m working with my team to resolve the newest technology challenge. Other days I’m interacting with our hardware architects to make sure that our next-generation products will deliver for our customers. I’m often in the sector meeting with our amazing community of consumers and collaborators ensuring we understand and anticipate their needs.

Cornelis Networks offers next generation networking for High Performance Computing and AI applications, could you share some details on the hardware that is obtainable?

Our hardware consists of a high-performance switched fabric type network fabric solution. To that end, we offer all of the obligatory devices to totally integrate HPC, cloud, and AI fabrics. The Omni-Path Host-Fabric Interface (HFI) is a low-profile PCIe card for endpoint devices. We also produce a 48-port 1U “top-of-rack” switch. For larger deployments, we make two fully-integrated “director-class” switches; one which packs 288 ports in 7U and an 1152-port, 20U device.

Are you able to discuss the software that manages this infrastructure and the way it’s designed to decrease latency?

First, our embedded management platform provides easy installation and configuration in addition to access to a wide range of performance and configuration metrics produced by our switch ASICs.

Our driver software is developed as a part of the Linux kernel. Actually, we submit all our software patches to the Linux kernel community directly. That ensures that every one of our customers enjoy maximum compatibility across Linux distributions and straightforward integration with other software reminiscent of Lustre. While not within the latency path, having an in-tree driver dramatically reduces installation complexity.

The Omni-Path fabric manager (FM) configures and routes an Omni-Path fabric. By optimizing traffic routes and recovering quickly from faults, the FM provides industry-leading performance and reliability on fabrics from tens to 1000’s of nodes.

Omni-Path Express (OPX) is our high-performance messaging software, recently released in November 2022. It was specifically designed to cut back latency in comparison with our earlier messaging software. We ran cycle-accurate simulations of our send and receive code paths in an effort to minimize instruction count and cache utilization. This produced dramatic results: once you’re within the microsecond regime, every cycle counts!

We also integrated with the OpenFabrics Interfaces (OFI), an open standard produced by the OpenFabrics Alliance. OFI’s modular architecture helps minimize latency by allowing higher-level software, reminiscent of MPI, to leverage fabric features without additional function calls.

Your entire network can also be designed to extend scalability, could you share some details on the way it is in a position to scale so well?

Scalability is on the core of Omni-Path’s design principles. At the bottom levels, we use Cray link-layer technology to correct link errors with no latency impact. This affects fabrics in any respect scales but is especially necessary for large-scale fabrics, which naturally experience more link errors. Our fabric manager is targeted each on programming optimal routing tables and on doing so in a rapid manner. This ensures that routing for even the most important fabrics may be accomplished in a minimum period of time.

Scalability can also be a critical component of OPX. Minimizing cache utilization improves scalability on individual nodes with large core counts. Minimizing latency also improves scalability by improving time to completion for collective algorithms. Using our host-fabric interface resources more efficiently enables each core to speak with more distant peers. The strategic selection of libfabric allows us to leverage software features like scalable endpoints using standard interfaces.

Could you share some details on how AI is incorporated into among the workflow at Cornelis Networks?

We’re not quite able to talk externally about our internal uses of and plans for AI. That said, we do eat our own pet food, so we get to make the most of the latency and scalability enhancements we have made to Omni-Path to support AI workloads. It makes us all of the more excited to share those advantages with our customers and partners. Now we have definitely observed that, like in traditional HPC, scaling out infrastructure is the one path forward, however the challenge is that network performance is definitely stifled by Ethernet and other traditional networks.

What are some changes that you just foresee within the industry with the arrival of generative AI?

First off, the usage of generative AI will make people more productive – no technology in history has made human beings obsolete. Every technology evolution and revolution we’ve had from the cotton gin to the automated loom to the phone, web and beyond have made certain jobs more efficient, but we haven’t worked humanity out of existence.

Through the applying of generative AI, I feel corporations will technologically advance at a faster rate because those running the corporate may have more free time to concentrate on those advancements. For example, if generative AI provides more accurate forecasting, reporting, planning, etc. – corporations can concentrate on innovation of their field of experience

I specifically feel that AI will make each of us a multidisciplinary expert. For instance, as a scalable software expert, I understand the connections between HPC, big data, cloud, and AI applications that drive them toward solutions like Omni-Path. Equipped with a generative AI assistant, I can delve deeper into the of the applications utilized by our customers. I even have little doubt that this may help us design even more practical hardware and software for the markets and customers we serve.

I also foresee an overall improvement in software quality. AI can effectively function as “one other set of eyes” to statically analyze code and develop insights into bugs and performance problems. This might be particularly interesting at large scales where performance issues may be particularly difficult to identify and expensive to breed.

Finally, I hope and imagine that generative AI will help our industry to coach and onboard more software professionals without previous experience in AI and HPC. Our field can seem daunting to many and it could actually take time to learn to “think in parallel.” Fundamentally, identical to machines made it easier to fabricate things, generative AI will make it easier to contemplate and reason about concepts.

Is there anything that you prefer to to share about your work or Cornelis Networks generally?

I’d prefer to encourage anyone with the interest to pursue a profession in computing, especially in HPC and AI. On this field, we’re equipped with essentially the most powerful computing resources ever built and we bring them to bear against humanity’s biggest challenges. It’s an exciting place to be, and I’ve enjoyed it every step of the way in which. Generative AI brings our field to even newer heights because the demand for increasing capability increases drastically. I am unable to wait to see where we go next.

4 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here