For Calient Technologies, an approach by one of the world’s largest data centre operators changed the company’s direction.
The company had been selling its 320x320 non-blocking optical circuit switch (OCS) for applications such as submarine cable landing sites and for government intelligence. Then, five years ago, a large internet content provider contacted Calient, saying it had figured out exactly where Calient’s OCS could play a role in the data centre.
This solution could deliver a significant percentage-utilisation improvement
But before the hyper-scale data centre operator would adopt Calient’s switch, it wanted the platform re-engineered. It viewed Calient’s then-product as power-hungry and had concerns about the switch’s reliability given it had not been deployed in volume. If Calient could make its switch cheaper, more power efficient and prove its reliability, the internet content provider would use it in its data centres.
Calient undertook the re-engineering challenge. The company did not change the 3D micro electromechanical system (MEMS) chip at the heart of the OCS, but it fundamentally redesigned the optics, electronics and control system surrounding the MEMS. The result is Calient’s S-Series switch family, the first product of which was launched three years ago.
“That switch family represented huge growth for us,” says Daniel Tardent, Calient’s vice president of marketing and product line manager. “We went from making one switch every two weeks to a large number of switches each week, just for this one customer.”
Calient has remained focussed on the data centre ever since, working to understand the key connectivity issues facing the hyper-scale data centre operators.
“All these big cloud facilities have very large engineering teams that work on customising their architectures for exactly the applications they are running,” says Tardent. “There isn’t one solution that fits all.”
There are commonalities among the players in how Calient’s OCS can be deployed but what differs is the dimensioning and the connectivity used by each.
Greater commonality will be needed by those customers that represent the tier below the largest data centre players, says Tardent: “These don’t have a lot of engineering resource and want a more packaged solution.”
What really interests the big data players is how they can better utilise their compute and storage resources because that is where their major cost is
LightConnect fabric manager
Calient unveiled at the OFC 2015 show held in Los Angeles last month its LightConnect fabric manager software. The software, working with Calient’s S-Series switches, is designed to better share the data centre’s computing and storage resources.
The move to improve the utilisation of data centre resources is a new venture for the company. Initially, the company tackled how the OCS could improve data centre networking linking the servers and storage. The company explored using its OCS products to offload large packet flows - dubbed elephant flows - to improve overall efficiency.
Elephant flows are specific packet flows that need to be moved across the data centre. Examples include moving a virtual machine from one server to another, or replicating or relocating storage. Different data centre operators have different definitions as to what is an elephant flow but one data centre defines it as any piece of data greater than 20 Gbyte, says Calient.
If persistent elephant flows run through the network, they clog up the network buffers, impeding the shorter ‘mice’ flows that are just as important for the efficient working of the data centre. Congested buffers increase latency and adversely affect workloads. “If you are moving a large piece of data across the data centre, you want to move it quickly and efficiently,” says Tardent.
Calient’s OCS, by connecting top-of-rack switches, can be used to offload the elephants flows. In effect, the OCS acts as an optical expressway, bypassing the electrical switch fabric.
Now, with the launch of the LightConnect fabric manager, Calient is tackling a bigger issue: how to benefit the overall economics of the data centre by improving server and storage utilisation.
“What really interests the big data players is how they can better utilise their compute and storage resources because that is where their major cost is,” says Tardent.
Today’s data centres run at up to 40 percent server utilisation. Given that the largest data centres can house 100,000 servers, just one percent improvement in usage has a significant impact on overall cost.
Calient claims that a 1.6 percent improvement in server utilisation covers the cost of introducing its OCS into the data centre. An average of nine to 14 percent utilisation improvement and all the data centre’s networking costs are covered. “The nine to 14 percent is a range that depends on how ‘thin’ or ‘fat’ the network layer is,” says Tardent. “A thin network design has less bandwidth and is less expensive.” Both types exist depending on the particular functions of the data centre, he says.
Data centres are typically organised into pods or clusters. A pod is a collection of servers, storage and networking. A pod varies with data centre operator but an example is 16 rows of 8 server racks plus storage.
Pods are popular among the large data centre players because they enable quick replication, whether it is bringing resources online or by enabling pod maintenance by switching in a replacement pod first.
One issue data centre managers must grapple with is when one pod is heavily loaded while others have free resources. One approach is to move the workload to the other lightly-used pods. This is non-trivial, though; it requires policies, advanced planning and it is not something that can be done in real-time, says Tardent: “And when you move a big workload between pods, you create a series of elephant flows."
An alternative approach is to move part of the workload to a less-used pod. But this runs the risk of increasing the latency between different parts of the workload. “In a big cloud facility with some big applications, they require very tightly-coupled resources,” he says.
Instead, data centre players favour over-provisioning: deliberately under-utilising their pods to leave headroom for worst-case workload expansion. “You spend a lot of money to over-provision every pod to allow for a theoretical worst-case,” says Tardent.
Calient proposes that its OCS switch fabric be used to effectively move platform resources to pods that are resource-constrained rather move workloads to pods. Hence the term virtual pods or v-pods.
For example, some of the resources in two under-utilised pods can be connected to a third, heavily-loaded pod to create a virtual pod with effectively more rows of racks. “Because you are doing it at the physical layer as opposed to going through a layer-2 or layer-3 network, it truly is within the same physical pod,” says Tardent. “It is as if you have driven a forklift, picked up that row and moved it to the other pod.”
In practice, data centre managers can pull resources from anywhere in the data centre, or they can allocate particular resources permanently to one pod by not going through the OCS optical layer.
The LightConnect fabric manager software can be used as a standalone system to control and monitor the OCS switch fabric. Or the fabric manager software can be integrated within the existing data centre management system using several application programming interfaces.
Calient has not quoted exact utilisation improvement figures that result from using its OCS switches and LightConnect software.
“We have had acknowledgement that this solution could deliver a significant percentage-utilisation improvement and we will be going into a proof-of-concept deployment with one of the large cloud data centres very soon,” says Tardent. Calient is also in discussion with several other cloud providers.
LightConnect will be a commercially deployable system starting mid-year.