By Mylaraiah JN, Vice President, Enterprise Business, India & SAARC, CommScope
When popular science fiction depicts the “rise of machine intelligence,” it usually comes with lasers, explosions or, in some of the gentler examples, a mild philosophical dread. But there can be no doubt that interest in the possibilities of artificial intelligence (AI) and machine learning (ML) in real-life applications is on the rise, and new applications are popping up daily.
Millions of users globally are already engaging with AI using ChatGPT, Bard and other AI interfaces. In India, 75% of desk workers are using AI tools to drive productivity. But most of these users don’t realize that their cozy desktop exchanges with a curious AI assistant are actually driven by massive data centers all over the world.
Enterprises are investing in AI clusters within their data centers, building, training and refining their AI models to suit their business strategies. These AI cores are comprised of racks upon racks of GPUs (graphical processing units) that provide the incredible parallel processing power that AI models require for the exhaustive training of their algorithms.
With the data sets imported, inference AI analyzes that data and makes sense of it. This is the process that determines whether an image contains a cat or a small dog, based on its training of what characteristics are common to cats but not dogs. Then, generative AI can process that data to create entirely new images or text.
It’s this “intelligent” processing that has captured the imaginations of people, governments and enterprises everywhere—but creating a useful AI algorithm requires vast amounts of data for training purposes, and this is an expensive and power-intensive process.
Efficient training is where it begins
Data centers generally maintain discrete AI and compute clusters, which work together to process the data that trains the AI algorithm. The amount of heat generated by these power-hungry GPUs limits how many can be housed together in a given rack space, so optimizing the physical layout is a must, in order to reduce heat and minimize link latency.
AI clusters require a new data center architecture. The GPU servers require much more connectivity between servers, but there are fewer servers per rack due to power and heat constraints. This leads to situations where we have more inter-rack cabling than traditional data centers, with links that require 100G to 400G at distances that cannot be supported by copper.
It’s generally held that, when training a large-scale AI, about 30 percent of the required time is consumed by network latency, and the remaining 70 percent is spent on compute time. Since training a large model can cost up to $10 million, this networking time represents a significant cost. Even a latency saving of 50 nanoseconds, or 10 meters of fiber, is significant, and nearly all the links in AI clusters are limited to 100 meter reaches.
Trimming meters, nanoseconds and watts
Operators should carefully consider which optical transceivers and fiber cables they will use in their AI clusters to minimize cost and power consumption.
Some important points to consider:
- Take advantage of transceivers with parallel fiber to avoid the requirement of optical multiplexers and demultiplexers used for wavelength division multiplexing
- The transceiver cost savings more than offset the small increase in cost for a multifiber cable instead of a duplex fiber cable
- Links up to 100 meters can be supported by singlemode and multimode fiber. While multimode fiber has a slightly higher cost than singlemode fiber, the difference between the both multifiber cables is smaller since cable costs are dominated by MPO connectors
- In addition, high-speed multimode transceivers use one to two watts less power than their singlemode counterparts. This may seem small, however for AI clusters, any opportunity to save power can deliver significant savings during training and operation
Transceivers vs. Active optical cables
Many AI/ML clusters use active optical cables (AOCs) which are fiber cables with integrated optical transmitters and receivers on either end to interconnect GPUs and switches. However, the transmitters and receivers in an AOC may be the same as in analogous transceivers but are typically the castoffs.
AOC transmitters and receivers mostly only need to operate with the specific unit attached to the other end of the cable. And since no optical connectors are accessible to the installer, the skills required to clean and inspect fiber connectors are not needed. In addition, installing AOCs can be a time-consuming and delicate operation as it requires cable to be routed with the attached transceiver and correctly installing AOCs with breakouts is especially challenging.
Overall, the failure rate for AOCs is double that of equivalent transceivers. When an AOC fails, or its time to upgrade the network links a new AOC must be routed through the network which takes away from the compute time. With transceivers, the fiber cabling is part of the infrastructure and may remain in place for several generations of data rates.
The Age of AI and ML in Data Centers
AI/ML has arrived, and it’s only going to become a more important and integrated part of the way people, enterprises and devices interact with one another. As per a Salesforce report, about 95% of Indian IT leaders believe that Generative AI models will soon have a prominent role in their organisations, indicating the growing demand.
While interfacing with an AI service can literally happen in the palm of your hand, it still depends on large-scale data center infrastructure and all the power and that drives it, and the enterprises that train AI quickly and efficiently will have an important leg up in our fast-changing, super-connected world. Careful consideration of the cabling of AI clusters will help save cost, power, and installation time. The right fiber cabling will enable organizations to fully benefit from artificial intelligence. Investing today in the advanced fiber infrastructure that drives AI training and operation will deliver incredible results tomorrow.