AWS announced significant enhancements to its networking infrastructure, including the deployment of a new network fabric designed to meet the demanding needs of AI-driven workloads. The centerpiece of these enhancements is the new 10p10u network fabric that offers a groundbreaking 10 petabytes of capacity and sub-10 microsecond latency, essential for powering large-scale AI training clusters.
Scaling for AI: A New Era of Networking
The key to it all, said Peter DeSantis, Senior Vice President of AWS Utility Computing, is scaling cloud infrastructure to handle these modern AI applications. "A great AI network shares many similarities with a great cloud network, but with much higher demands," he said. "If this were a Vegas fight, it wouldn't even be a close fight."
The 10p10u network is specifically designed to handle the vast bandwidth and low-latency requirements of AI models. It supports AWS's UltraServer technology, which runs high-performance AI workloads using the new Trainium2 chips. Each server within this system communicates with every other server simultaneously, and therefore, a robust and efficient network is crucial to avoid any bottlenecks.
Key Features of the 10p10u Network
Game-over for AWS, though: its network fabric, 10p10u, offers mass scalability, low-latency connectivity that enables scaling from a couple of racks to huge clusters within or across various data center campuses. It is one critical point for hosting these demandful AI jobs that do demand continuous communication at a fast velocity amongst its thousands of servers.
But also enabling these technologies are many advanced new technologies recently introduced to support those innovations, including AWS has introduced:
Trunk Connectors: AWS created a proprietary connector that grouped 16 fiber optic cables into a single unit. This greatly simplified installation and reduced connection errors. These innovations have cut AI rack installation time by 54%, reduced clutter, and made maintenance much easier.
Firefly Optical Plugs: These are plugged-in modules that serve as signal reflectors in miniature, enabling pre-testing and validation of network connections prior to even installing the equipment at a data center. In so doing, AWS can negate any installation delays and forestall performance issues due to problems such as dust. SIDR Protocol for Advanced Routing
Managing the scale of the 10p10u network requires advanced routing technology. AWS introduced the Scalable Intent Driven Routing protocol, which is designed to enhance the efficiency of the network by responding much faster in case of failure. In case there is a failure, the SIDR protocol allows the network to adapt in less than one second—ten times faster compared to the traditional routing methods.
SIDR effectively merges centralized planning with decentralized execution: The network makes autonomous decisions about how to handle issues instantly and can continue operations uninterruptively and efficiently.
NeuronLink: Changing the Way Servers Connect
AWS's commitment to high-bandwidth, low-latency connectivity is further reflected in NeuronLink, an innovative interconnect technology that links multiple Trainium2 servers into a single logical unit. With NeuronLink, servers can directly access the memory of other servers, which provides two terabytes per second of bandwidth at just one microsecond of latency.
With the added power of the Trainium2 chips, this capability is said to give five times the compute capacity and ten times the memory of existing EC2 AI servers, making it a key component in powering the next generation of AI applications.
Positioning AWS for the Future
These network advancements position AWS as a leader in AI infrastructure, meaning its cloud services can grow to keep up with growing demands driven by next-generation AI models. Thus, AWS supports massive clusters for AI training, the ability to scale large workloads, and ultra-fast networking speeds to deliver on the most ambitious AI projects of the future.