A Survey of System Software Techniques for Emerging NVMs

2017-03-29 12:52:49BAITongxinDONGZhe

ZTE Communications 2017年1期

BAI+Tongxin+DONG+Zhenjiang+CAI+Manyi+FAN+Xiaopeng

Abstract The challenges of power consumption and memory capacity of computers have driven rapid development on non?volatile memories （NVM）. NVMs are generally faster than traditional secondary storage devices， write persistently and many offer byte addressing capability. Despite these appealing features， NVMs are difficult to manage and program， which makes it hard to use them as a drop?in replacement for dynamic random?access memory （DRAM）. Instead， a majority of modern systems use NVMs through the IO and the file system abstractions. Hiding NVMs under these interfaces poses challenges on how to exploit the new hardwares performance potential in the existing system software framework. In this article， we survey the key technical issues arisen in this area and introduce several recently developed systems each of which offers novel solutions around these issues.

Keywords non?volatile memory； persistent memory； file system； IO system

1 Introduction

Ever since the Internet and mobile computing dominated peoples daily life， the continuing supply of big data， which relies on immense computing power to extract the hidden big values， has demanded higher speed of data storage. The big data trend challenges the design of computer systems， on both hardware and software， to sustain the development of new data intensive applications. For example， deep data analytics and in?memory computing demand shorter turn?around time between processing iterations， which translates to faster data transportation between storage and processors. As of its current stage， in?memory computing uses dynamic random?access memory （DRAM） as the main media for hot data storage given its advantage in speed and bandwidth. However， DRAM would soon hit energy wall when the total memory capacity keeps growing in a data center. According to a recent study， 100 petabytes main memory with DDR 3 DRAM would consume 52MW power， which is far beyond the energy budget for building a future exascale data center [1]， [2].

The combined requirements on capacity， performance and power have motivated both industry and academia to pursue new technologies and build alternative memory devices to bridge the gap between fast DRAM and slow disks. In recent years， semiconductor manufacturers have invested heavily on non?volatile memory （NVM） devices. As the most mature NVM in its class， Flash is already widely used in commercial servers due to its high density and low static power consumption. However， Flash has its notable downside. Memory wear and block erasure make it ill?fit for random read and write for which DRAM has outstanding performance [3]. Alternative NVM technologies have advanced rapidly， each expected to improve upon some or all of the Flash weaknesses. Among the most influential new types of NVMs are Phase Change Memory（PCM）， Spin Transfer Torque RAM（STT?RAM）， Resistive RAM （RRAM）， Racetrack Memory， and Domain Wall Memory （DWM）. Most recently， Intel and Micron jointly announced 3D XPoint and claimed the new NVM delivers a performance of 1000 times shorter latency and longer endurance than of the conventional Flash. The density and performance boost is said to be enabled by a novel structure of memory cell with a stackable data access array.

Other than higher read?write performance， most of the new NVMs support some extent of byte?addressable random access. The fine?grained access capability would revolutionize the way data is communicated between on?core and off?core memory. Potentially， CPU could directly address the data on the secondary storage without first sending instructions to a device controller. A secondary storage with byte?addressing capability is usually called Storage Class Memory （SCM）. Once SCM is widely deployed， changes must be applied to the IO interface as well as the file system for they are traditionally optimized for slow devices. The fast storage would make many optimizations less effective， or even harmful， and favor a simpler design of the software stack for better performance.

The promising future of new hardware motivates software innovations which promote the lower level improvement to higher level usability. In this article we survey and introduce recent advances on how the NVM adaptation is addressed in system software， particularly in IO and file system.

2 IO Subsystem for NVM

Fig. 1 summarizes a standard hierarchy of system components in which NVM devices shared the same IO interface with other block devices. The usual procedure of reading and writing data on a block device begins with a user program issuing a system call with arguments specifying the location and size into the target file in the file system. After the program is trapped to the kernel mode， the system call request is transitioned through multiple layers of kernel components， including the file system and the block IO interface， until its finally translated into a sequence of low level instructions to the device driver. Since accessing block devices can pose a long latency， the IO requests are usually processed in an asynchronous way， leveraging Direct Memory Access （DMA） and interrupt handling to complete the data transfer while saving a large amount of CPU time. The software overhead during the request processing is caused by program state transition， file system management， IO scheduling， interrupt handling， buffer cache management and so on. Studies show that the software latency for processing blocked IO request is in the 10 microseconds on modern Intel processors running Linux operating system （OS）. In reality， the exact software overhead varies for specific systems. Swanson et al. [4] measured that a single 512 byte IO request incurred about 19 microseconds of software overhead on Intel Nehalem 2.27GHz processor. In the test conducted by Yang et al. [5]， a 512 byte IO operation cost 5 to 7 microseconds in software on 2.93GHz Intel Xeon processor. Compared to tens of milliseconds of latency due to disk access， the relative cost of software is very small. However， with NVM storage taking place， the relative cost of software increases significantly. The modern Flash solid?state drive （SSD） offers read?write latency of tens of microseconds. Its projected that newer generations of NVM can further reduce the state?of?the?art by 10 times [6]. With device latencygreatly reduced， the originally small software and interface overhead will dominate the cost of an IO operation. Therefore， in the future NVM systems， software and interface optimizations are the key to better storage performance. In the following we will introduce a number of recent developments on high performance IO interface and processing techniques for NVM storage.

2.1 Moneta

Caulfield et al. proposed Moneta [7] in 2010， an experimental NVM interface framework. Based on the simulations of Phase?Change Memory （PCM）， the experiment results show that the Moneta interface helps random read and write performance with an increment of 18 times than that of the baseline. The software overhead is reduced by 60%. 4KB random read and write throughput can be maintained at the level of the 450k IOPs. Moneta is based on a design comprised of a token ring network and memory controller array， which improves the IO rate by exploiting the latency and parallelism advantage of the hardware. The detailed structure of Moneta IO is shown in Fig. 2. The IO scheduler is responsible for coordinating the data transmission and request scheduling. Data is transferred between memory and device through the PCIe interface. The requests are exchanged in the form of a command via a token ring network connecting the NVM memory controller and the request queue. When the system is running， the driver software issues IO requests which are transmitted via the Peripheral Component Interconnect Express （PCIe） interface to the Moneta scheduler and then are inserted into a first?in first?out （FIFO） queue.Requests larger than 8KB need to be decomposed and transmitted in sequence. To streamline data transfer， each Moneta memory controller is equipped with two 8KB buffers， used for caching the data read from and written to the storage after the requests are successfully processed. To make full use of the 2GB bandwidth of PCIe， it is not enough to expand the capacity of the device interface. As Caulfield et al. [7] have tried， a series of kernel optimization are applied to improve the efficiency of request processing in CPU. These software approaches include 1） avoiding the default IO scheduler， 2） using the Moneta built?in atomic read?write operations in place of locked kernel IO operations， 3） allowing multiple threads to process device interrupts in parallel， and 4） using spinlocks to avoid unnecessary context switch overhead caused by interrupts. Combining the above methods， Moneta reduces the IO access latency to around 1μs， achieving a significant improvement relative to the baseline latency of 10 μs. Correspondingly， the effective bandwidth increases to hundreds of mega bytes per second. Compared to traditional IO， Moneta excels with its comprehensive overhead reduction on interrupts， synchronization and scheduling.

For evaluation， the Moneta prototype was tested against a set of database applications. One interesting observation is that traditional databases optimization against disk storage can produce counter effect on Monetas own optimization for NVM. In particular， PostgresSQL and MySQL receive less performance gainfrom the new framework than the simpler BerkeleyDB does.

2.2 Linux Multiqueue Block IO

The low latency and good parallelism of NVM storage would be underutilized features if the original Linux IO subsystem remains using a single request queue， which would easily become a performance bottleneck when the IO request rate approaches million per second. In order to solve the scalability problem in an NVM based system， Linux adopts a new block device interface blk?mq starting from kernel version 3.13. The internal structure of blk?mq is shown in Fig. 3. In the new IO framework， each IO request is processed in two phases separately. When an IO request arrives at the kernel through a system call， it is pushed onto a software staging queue which is dedicated to the CPU core on which the working thread is running. The request stays in the software queue while the kernel applies scheduling logics； then its transmitted to a hardware dispatch queue waiting for the hardware to be ready to process it. To achieve high concurrency， a single storage device can be configured with multiple dispatch queues to better utilize the parallel processing capability of CPUs and in?band signaled interrupts of devices. In fact， the hardware queues can be allocated as many as several thousands， a number determined by how many virtual context a device can support. For example， a device supporting MSI?X can allocate 2048 queues for it can register 2048 interrupts. This new design of the IO subsystem promotes a fast and localized IOrequest processing scheme， especially by reducing unnecessary remote memory accesses in an NUMA environment.

The blk?mq block device interface successfully separates the software scheduling and the hardware message buffering functionalities which used to share a single request queue. The separation reduces synchronization and buffer congestion and hence leads to a much improved IO scalability.

2.3 Poll or Interrupt？

Despite that a redesign of IO subsystem internal structure fundamentally improves the IO scalability， there remain other system software overheads affecting IO performance. Most noticeably， the overhead comes from interrupt handling and context switch. In a typical OS setting， after an IO request is submitted to the kernel waiting to be processed， the calling thread returns or simply blocks for response. The completion of the request starts with a device interrupt notifying the CPU that the data is ready. After the device driver picks up the interrupt and finishes the handling procedure it will notify the IO subsystem which then will complete the remaining work and wake up the suspended thread. This asynchronous style of IO operation saves valuable CPU time since the program waiting for response can yield the CPU temporarily for other programs to use. In theory， however， the benefit of asynchronous IO only exists if the hardware latency is larger than the combined software overhead. When the device latency is reduced to microsecond level， the benefit will diminish.

Polling provides a lighter weight means for checking device status than waiting for interrupt. To find out which is better with high performance NVM， Yang et al. compared the throughput and latency results using a simulated environment [8]. The study shows that synchronous completion requires shorter time and induces a better CPU utilization when that device latency is as low as several microseconds. Another interesting observation is， in the synchronous mode， a better hardware performance can reduce the software cost， whereas there is no such benefit for asynchronous IO. Furthermore， by studying throughput scalability， based on the case of 512 byte random reads， the study finds that the throughput of synchronous IO scales linearly with increased number CPUs. In contrast， asynchronous IO can only achieve 60%-70% throughput of the synchronous IO. Because asynchronous IO is suitable for processing long wait， when the system has a complex device setup， it can be suggested that the IO request be processed by a mixed mode of synchronous and asynchronous IO， achieving a load balance between CPU and the device.

2.4 NVM Express

NVM Express （NVMe） [9]， [10] is a new software interface specification for accessing NVM devices attached to the PCIe bus. A working group on NVMe was formed in 2007. Technical work on the specification started in 2007 and the first release was finished in 2011. NVMe was designed from the ground up as an open device interface to exploit the low latency and parallelism of the future NVM devices， leading to an increased capacity of data path between CPU and storage. As a key design goal， the NVMe interface allows the processing power to fully utilize the internal parallelism available in the NVM devices and the bandwidth of the PCIe bus， hence effectively improving the IO performance. An example IO subsystem supporting NVMe is shown in Fig. 4.

NVMe can support up to 65，536 request queues， with request submission and completion stages allocating on different sets of queues （SQ and CQ）.Separating the two stages reduces the likelihood of IO congestion， an issue often raised when using a single request queue in dealing with a large number of IO requests. The actual IO operations over the NVMe interface involve the software and the device exchanging commands and data on a dedicated memory mapped area in the host programs address space. Moreover， an NVMe device can support up to 2048 virtual contexts. With highly scalable multi?queue based IO scheduling， the NVMe interface supports a high throughput and concurrent data path between CPU and storage.

The ecology of NVMe?based systems is collaboratively built by the device manufacturers， chipsets providers as well as software venders. The first NVMe drives came from Samsung [11]， LSI [12]， Kingston [13] and Intel [14]. Early operating systems supported include Linux 3.3， Windows 8.1 and Windows Server 2012 R2. On chipsets， the major venders have already added NVMe support in their new product lines. To further promote software development using NVMe features， Intel announced the Storage Performance Development Kit （SPDK）， a development toolkit providing the user level and poll based IO programming interface.

Performance studies on actual NVMe systems have recently been carried out by a group of researchers. Xu et al. measured and analyzed the performance differences of several database systems using NVMe SSD and Serial Advanced Technology Attachment （SATA） SSD on real machines [15]. The experiment results show that， compared to SATA SSD， the software overhead of NVMe SSD is reduced from 25% to 7% meanwhile 4 KB read throughput is increased from 70k IOPS to 750k IOPS. NVMe helps bring about 8 times performance gain in the tested database applications. This study is a solid proof that a redesigned device interface is necessary for exploiting the potential of new storage hardware. The performance demonstration also assures the system designers that NVM storage is ready to be a major investment for boosting overall service performance.

3 File Systems for NVM

A file system provides a data persistence service over a named space organized in directories. It strives to meet two practical goals： to maintain a consistent view of the data and to guarantee persisted data are reliably stored. In order to ensure the reliability and consistency， the file system needs careful organization of data layout as well as correct implementation of the operation semantics.

In a file system， stored objects are comprised of data and metadata， both of which may be accessed and modified when a single file operation is involved. The latency of accessing a conventional storage device can be long and unstable， which is a major concern of modern file system optimization. Techniques such as grouping metadata based on accessing pattern can improve data locality， leading to a better use of buffer cache. For reliability and consistency purposes， journaling and block level copy?on?write are typically employed to guard against system crash and power outage. These techniques themselves bring up additional operation overhead which may contribute to severe performance degradation. When a file system is migrated to an NVM?based system， the change of storage structure can incur a host of new issues meanwhile eclipse the effect of existing optimizations. For example， fast NVMs make data prefetching and caching no longer a key mechanism for latency reduction. Moreover， the software overhead introduced by prefetching and caching can be significant in the setting of new systems.

ZTE Communications2017年1期

ZTE Communications的其它文章: Introduction to ZTE Communications; Call for Papers： Special Issue on Cloud Computing Fog Computing; Nonbinary LDPC BICM for Next?Generation High?SpeedOptical Transmission Systems; A Method for Constructing Open?Domain Chinese Entity Hypernym Hierarchical Structure; An Indoor Positioning Scheme for Visible Light Using Fingerprint; Measurement?Based Spatial?Consistent Channel Modeling Involving Clusters of Scatterers