Using Intel® Optane™ Technology with Ceph* to Build High-Performance Cloud Storage Solutions on Intel® Xeon® Scalable Processors

Download Ceph* configuration file [2KB]

Introduction

Ceph* is the most popular block and object storage backend. It is an open source distributed storage software solution, widely adopted in the public and private cloud. As solid-state drives (SSDs) become more affordable, and cloud providers are working to provide high-performance, highly reliable, all-flash–based storage for their customers, there is a strong demand for Ceph-based, all-flash reference architectures, performance numbers, and optimization best-known methods.

Intel® Optane™ technology provides an unparalleled combination of high throughput, low latency, high quality of service, and high endurance. It is a unique combination of 3D XPoint™ Memory Media, Intel® Memory Controllers and Intel® Storage Controllers, Intel® Interconnect IP, and Intel® software. Together, these building blocks deliver a revolutionary leap forward in decreasing latency and accelerating systems for workloads demanding large capacity and fast storage.

The Intel® Xeon® Scalable processors with Intel® C620 series chipsets are workload-optimized to support hybrid cloud infrastructures and the most high-demand applications, providing high data throughput and low latency. Ideal for storage and data-intensive solutions, Intel Xeon Scalable processors offer a range of performance, scalability, and feature options to meet a wide variety of workloads in the data center, from the entry-level (Intel® Xeon® Bronze 3XXX processor family) to the most advanced (Intel® Xeon® Platinum 8XXX processor family).

As a follow up to the work of our previous article, Use Intel® Optane™ Technology and Intel® 3D NAND SSDs Technology with Ceph to Build High-Performance Cloud Storage Solutions, we’d like to share the progress on Ceph all-flash storage system reference architectures and software optimizations based on Intel Xeon Scalable processors. In this paper, we present the latest Ceph reference architectures and performance results with the RADOS Block Device (RBD) interface using Intel Optane technology with the Intel Xeon Scalable processors family (Intel Xeon Platinum 8180 processor and Intel® Xeon® Gold 6140 processor). Moreover, we include several Ceph software tunings that resulted in significant performance improvement for random workloads.

Ceph* Performance Optimization History

Working closely with the community, ecosystem, and partners, Intel has kept track of Ceph performance since the Ceph Giant release. Figure.1 shows the performance optimization history for 4K random write workloads on Ceph major releases and different Intel platforms. With new Ceph major releases, backend storage, combined with core platform changes and SSD upgrades, the 4K random write performance of a single node was improved by 27x (3,673 input/output operations per second (IOPS) per node to 100,052 IOPS per node)! This makes it possible to use Ceph to build high performance storage solutions.

Figure 1. Ceph 4K RW per node performance optimization history.

Intel® Optane™ Technology with Ceph AFA on Intel® Xeon® Scalable Processors

In this section, we present Intel Optane technology with a Ceph all-flash array (AFA) reference architecture on Intel Xeon Scalable processors, together with performance results and system characteristics for typical workloads.

Configuration with Intel® Xeon® Platinum 8180 processor

The Intel Xeon Platinum 8180 processor is the most advanced processor in the Intel Xeon Scalable processors family. Working with Intel® Optane™ Solid State Drives (SSD) as the WAL device, NAND-based SSD for the data drive, and Mellanox* 40 GbE network interface card (NIC) as high-speed Ethernet data ports provides the best performance (throughput and latency) configuration. It is ideally suited for an input/output heavily intensive workload.

Cluster topology

Figure 2. Cluster topology.

Table 1. Cluster configuration.

Ceph Configuration with Intel Xeon Platinum 8180 Processor
CPU	Intel Xeon Platinum 8180 processor @ 2.50 GHz
Memory	384 GB
NIC	Mellanox 2x 40 GbE (80 Gb for Ceph nodes), Mellanox 1x 40 GbE (40 Gb for client nodes)
Storage	Data: 4x Intel® SSD DC P3520/2x 2.0 TB WAL: 1x Intel® Optane™ SSD DC P4800X 375 GB
Software Configuration	Ubuntu* 16.04, Linux* Kernel 4.8, Ceph version 12.2.2

The test system consists of five Ceph storage servers and five client nodes. Each storage node is configured with an Intel Xeon Platinum 8180 processor and 384 GB memory, using 1x Intel Optane SSD DC P4800X 375 GB as the BlueStore WAL device, 4x Intel® SSD DC P3520 2 TB as data drives, and 2x Mellanox 40 GbE NIC as separate cluster and public networks for Ceph.

For clients, each node is set up with an Intel Xeon Platinum 8180 processor, 384 GB memory, and 1x Mellanox 40GbE NIC.

Ceph 12.2.2 was used, and each Intel® SSD DC P3520 Series ran one object storage daemon (OSD). The RBD pool used for the testing was configured with two replications; the system topology is described in Figure 1.

Testing methodology

We designed four different workloads to simulate a typical all-flash Ceph cluster in the cloud, based on fio with librbd, including 4K random read and write, and 64K sequential read and write, to simulate the random workloads and sequential workloads, respectively. For each test case, the throughput (IOPS or bandwidth) was measured with the number of volumes scaling (to 100), with each volume configured to be 30 GB. The volumes were pre-allocated to eliminate the Ceph thin-provision mechanism’s impact to generate stable and reproducible results. The OSD page cache was dropped before each run to eliminate page cache impact. For each test case, fio was configured with a 300-second warm up and 300-second data collection. Detailed fio testing parameters are included in the downloadable Ceph configuration file.

Performance overview

The Intel Optane technology-based Ceph AFA cluster demonstrated excellent throughput and latency. The 64K sequential read and write throughput is 21,949 MB/s and 8,714 MB/s, respectively (maximums with 40 GbE NIC). The 4K random read throughput is 2,453K IOPS with 5.36 ms average latency, while 4K random write throughput is 500K IOPS with 12.79 ms average latency.

Table 2. Performance overview.

	Peak Performance	Avg. Latency (ms)	Avg. CPU %	IOPS/CPU
4K Random Write	500,259 IOPS	12.79	50	10005
4K Random Read	2,453,200 IOPS	5.36	60.87	40302
64K Sequential Read	21,949 MB/s	36.78	30.4	722
64K Sequential Write	8,714 MB/s	45.87	18.4	474

System characteristics

To better understand the system characteristics and for performance projection, we did a deep investigation of system-level characteristics including CPU utilization, memory utilization, network bandwidth, disk IOPS, and latency.

For the random workloads, the CPU utilization is 50 percent for 4K random writes and 60 percent for 4K random reads, while memory and network consumption are relatively low. The average IOPS on each P3520 drive is 20K for random write and 80K for random read, which still has lots of headroom for further performance improvement. For sequential workloads, CPU utilization and memory consumption of sequential write is quite low, and it is obvious that the NIC bandwidth is the bottleneck for sequential read cases.

4K Random write characteristics

CPU utilization of user space consumption is 37 percent, which is 75 percent of total CPU utilization. Profiling results showed most of the CPU cycles are consumed by the Ceph OSD process; the suspected reason for the CPU headroom is that the software threading and locking model implementation limits Ceph scale-up ability on a single node, which remains as next step optimization work.

System metrics for 4 K random write charts

Figure 3. System metrics for 4K random write.

4K Random read characteristics

CPU utilization is about 60 percent, among which IOWAIT takes about 15 percent, so the real CPU consumption is also about 45 percent; similar to a random write case. The OSD disk’s read IOPS is quite steady at 80K, and 40 GBbE NIC bandwidth is about 2.1 GB/s. No obvious hardware bottlenecks were observed; the suspected software bottleneck is similar to 4K random write cases and needs further investigation.

System metrics for 4 K random read charts

Figure 4. System metrics for 4K random read.

64K Sequential write characteristics

CPU utilization and memory consumption for sequential write is quite low. Since the OSD replication number is 2, transfer bandwidth from NIC data is twice that of received bandwidth, and transfer bandwidth consists of two NICs’ bandwidth, one for the public network and one for the cluster network, and each NIC takes about 1.8 GB/s per port. OSD disk AWAIT time suffers a serious fluctuation and the highest disk latency is over 4 seconds, while the disk IOPS is quite steady.

System metrics of 64 K sequential write charts

Figure 5. System metrics of 64K sequential write.

64K Sequential read characteristics

For the sequential read case, we observed that the bandwidth of one NIC reaches 4.4 GB/s, which is about 88 percent of total bandwidth. CPU utilization and memory consumption of sequential write is quite low. OSD disk read IOPS and latency is steady.

System metrics of 64 K sequential read charts

Figure 6. System metrics of 64K sequential read.

Performance comparison with Intel® Xeon® processor E5 2699

Table 3. Intel® Xeon® processor E5 2699 cluster configuration.

Ceph Configuration with Intel® Xeon® Processor E5 2699
CPU	Intel Xeon processor E5-2699 v4 @ 2.2 GHz
Memory	128 GB
NIC	Mellanox 2x 40 GbE (80 Gb for Ceph nodes), Mellanox 1x 40 GbE (40 Gb for client nodes)
Storage	Data: 4x Intel® SSD DC P3520/2x 2.0 TB WAL : 1x Intel® Optane™ SSD DC P4800X 375 GB
Software Configuration	Ubuntu* 16.04, Linux* Kernel 4.8, Ceph version 12.2.2

The test system with Intel® Xeon® processor E5 2699 shares the same cluster topology and hardware configuration as the test system with an Intel Xeon Platinum 8180 processor. The only difference is that each server or client node is set up with Intel Xeon processor E5-2699 v4 and 128 GB memory.

For the software configuration, Ceph 12.0.3 was adopted in the Intel Xeon processor E5 2699 test, and each Intel® SSD DC P3520 Series runs four OSD daemons. This differs from the Intel Xeon Platinum 8180 processor configuration, which runs only one OSD daemon.

Table 4. Performance comparison overview.

	Intel Xeon processor E5 2699 + 12.0.3	Intel Xeon Platinum 8180 Processor + 12.2.2
4K Random Write	452,760 IOPS	500,259 IOPS
4K Random Read	2,037,400 IOPS	2,453,200 IOPS
64K Sequential Write	7,324 MB/s	8,714 MB/s
64K Sequential Read	21,264 MB/s	21,949 MB/s

As shown in Table 4, all four input/output pattern test results with Intel Xeon Platinum 8180 processors are better than with Intel Xeon processor E5. Especially for 4K random write and read test, the throughputs using Intel Xeon Platinum 8180 processors improved by 10 percent and 20 percent, respectively.

Performance comparison with Intel® Xeon® Gold 6140 processor

Table 5. Cluster configuration.

Ceph Configuration with Intel Xeon Gold 6140 Processor
CPU	Intel Xeon Gold 6140 processor @ 2.30 GHz
Memory	192 GB
NIC	Mellanox 2x 40 GbE (80 Gb for Ceph nodes), Mellanox 1x 40 GbE (40 Gb for client nodes)
Storage	Data: 4x Intel® SSD DC P3520/2x 2.0 TB WAL: 1x Intel® Optane™ SSD DC P4800X 375 GB
Software Configuration	Ubuntu* 16.04, Linux* Kernel 4.8, Ceph version 12.2.2

The test system consists of five Ceph storage servers and five client nodes. For servers, each node is set up with an Intel Xeon Platinum 8180 processor and 384 GB memory, using 1x Intel Optane SSD DC P4800X 375 GB as a BlueStore WAL device, 4x Intel SSD DC P3520 2TB as a data drive, and 2x Mellanox 40 GbE NIC as separate cluster and public networks for Ceph.

For clients, each node is set up with an Intel Xeon Gold 6140 processor with 192 GB memory and 1x Mellanox 40 GbE NIC.

Ceph 12.2.2 was used, and each Intel SSD DC P3520 Series runs one OSD daemon. The RBD pool used for the testing was configured with two replications system topology, described in Figure 1.

Table 6. Performance comparison.

	Intel Xeon Platinum 8180 Processor	Intel Xeon Gold 6140 Processor
4K Random Write	500,259 IOPS	450,553 IOPS
4K Random Read	2,453,200 IOPS	2,025,400 IOPS
64K Sequential Write	8,714 MB/s	7,379 MB/s
64K Sequential Read	21,949 MB/s	22,182 MB/s

As shown in Table 5, performance by Intel Xeon Platinum 8180 processors is better than with Intel Xeon Gold 6140 processors in 4K random read (1.21x), 4K random write (1.11x), and 64K sequential write (1.18x). Since 64K sequential read of these two configurations both hit the 40 GbE hardware limitation, bandwidth results of 64K sequential read are similar.

Ceph Software Optimization

Background

A Ceph Block Device stripes a block device image over multiple objects in the Ceph Storage Cluster, where each object gets mapped to a placement group and distributed, and the placement groups are spread across separate Ceph OSD daemons throughout the cluster. This is to say when object requests are being processed, CRUSH (Controlled, Scalable, Decentralized Placement of Replicated Data) maps each object to a placement group separately, since requests to an OSD are sharded by their placement group identifier. Each shard has its own queue and these queues neither interact nor share information with each other. The number of shards can be controlled with the configuration options “osd_op_num_shards” and “osd_op_num_threads_per_shard”. A proper number makes better use of CPU and memory, and the impact on output performance.

Table 7. Ceph OSD configuration description.

Configuration	Description
osd_op_num_shards	Number of queues to requests
osd_op_num_threads_per_shard	Threads number for each queue

Performance evaluation of osd_op_num_shards

Tuning osd_op_num_shards chart

Figure 7. Ceph OSD tuning performance comparison.

From performance evaluation results, we observed a 1.19x performance improvement after tuning osd_op_num_shards to 64, while continuously increasing osd_op_num_shards from 64 to 128 showed a slight performance regression.

Performance evaluation of osd_op_num_threads_per_shard

Ceph OSD tuning performance comparison chart

Figure 8. Ceph OSD tuning performance comparison.

From performance evaluation results, we observed a 1.17x performance improvement after optimization.

Ongoing and Future Optimizations

Based on the above performance numbers and system metrics, we need further optimization on the Ceph software stack to resolve the OSD scale-up issues, to take full advantage of hardware capability and features.

Better RDMA integration with Ceph Aio Messenger

From the CPU utilization frame graph of 4K random read and 4K random write, 28.8 percent and 22.24 percent CPU is used to handle network-related work, respectively, using an Ethernet. With an increasing demand to replace the Ethernet with remote direct memory access (RDMA) and an optimized implementation in Ceph RDMA integration, these parts of CPU utilization will be brought down and freed for other applications.

CPU profiling for 4K RR

CPU profiling for 4K RW

Figure 9. Ceph OSD CPU profiling.

A new native SPDK/NVMe-Focused object store based on BlueStore

A log-structured BTree object store is now under discussion by the Ceph Community, which aims to improve performance significantly for small objects input/output when using non-volatile memory express (NVMe) devices. This would involve moving the fast paths of the OSD into a reactive framework (Seastar*), and eliminating the severe rewrite performance limit by using RocksDB (log-structured merge-tree) as a foundational building block. See New ObjectStore.

A new async-osd

Ceph OSD is now being refactored to become an async-osd for future generations. The goal is to integrate Seastar, a futures-based, designed for shared-nothing userspace scheduling and networking framework into Ceph OSD codes, so it works better with the coming fast (non-volatile random-access memory-speed) devices.

Summary

In this paper, we presented performance results of Intel Optane technology with Ceph AFA reference architecture on Intel Xeon Scalable processors. This configuration demonstrated excellent throughput and latency. The 64K sequential read and write throughput is 21,949 MB/s and 8,714 MB/s, respectively (maximums with 40 GbE NIC). 4K random read throughput is 2,453K IOPS with 5.36 ms average latency, while 4K random write throughput is 500K IOPS with 12.79 ms average latency.

For read-intensive workloads, especially with small blocks, a top-bin processor from the Intel Xeon Scalable processor family, such as the Intel Xeon Platinum 8180 processor, is recommended. It provides up to 20 percent performance improvement compared with the Intel Xeon Gold 6140 processor.

Software tuning and optimization also provided up to 19 percent performance improvement for both read and write compared to default-configured Intel Optane technology with Ceph AFA cluster on Intel Xeon Scalable processors. Since hardware headroom is observed with the current hardware configuration, performance promises continuous improvement with ongoing Ceph optimizations like RDMA messenger, NVMe-focused object store, async-osd, and so on, in the near future.

About the Authors

Chendi Xue is a member of the Cloud Storage Engineering team from Intel Asia-Pacific Research & Development Ltd. She has five years’ experience in Linux cloud storage system development, optimization and benchmark, including Ceph benchmark and tuning, CeTune(a ceph benchmark tool) development, and HDCS(a hyper-converged distributed cache storage system) development.

Jian Zhang manages the cloud storage engineering team in Intel Asia-Pacific Research & Development Ltd. The team’s focus is primarily on open source cloud storage performance analysis and optimization, and building reference solutions for customers based on OpenStack Swift and Ceph. Jian Zhang is an expert on performance analysis and optimization for many open source projects, including Xen, KVM, Swift and Ceph, and benchmarking workloads like SPEC*. He has worked on performance tuning and optimization for seven years and has authored many publications related to virtualization and cloud storage.

Jianpeng Ma is a member of the Cloud Storage Engineering team from Intel Asia-Pacific Research & Development Ltd. He is currently focused on Ceph development and performance tuning for Intel platforms and reference architectures. Jianpeng gained software development and performance optimization experience for the md driver of linux kernel before joining Intel.

Jack Zhang is currently a senior SSD Enterprise Architect in Intel’s NVM (non-volatile memory) solution group. He manages and leads SSD solutions and optimizations and next generation 3D XPoint solutions and enabling across various vertical segments. He also leads SSD solutions and optimizations for various open source storage solutions, including SDS, OpenStack, Ceph, and big data. Jack held several senior engineering management positions before joining Intel in 2005. He has many years’ design experience in firmware, hardware, software kernel and drivers, system architectures, as well as new technology ecosystem enabling and market developments.

Reference

Ceph website
Architecture and technology: Intel® Optane™ Technology
A 3D animation: Intel® 3D NAND Technology Transforms the Economics of Storage
Our earlier article: Use Intel® Optane™ Technology and Intel® 3D NAND SSDs to Build High-Performance Cloud Storage Solutions
The New ObjectStore
Description of async-osd