Improving File IO performance on Intel® Xeon Phi™ Coprocessors

Authors: Ravi Murty and Rajesh Sudarsan

Introduction

The importance of file IO in HPC applications is well known. Large HPC applications read their inputs from files and write their outputs to files throughout the lifetime of their execution. In some instances the output of an application is reused as input in subsequent runs of the same or different application and the data is shared via files. Additionally, applications running on large number of nodes in a cluster store a snapshot of their execution state periodically as checkpoint data. This checkpoint data can be used to restart the application in the event that the application is somehow interrupted, for example because of a node failure in the cluster. Other simple usage models that demand good file IO performance include simple system administration activities like installation of packages (e.g. RPMs).

These types of usage models demand good file IO performance from the underlying operating system. In this article we describe our investigation of system calls used for file IO, like read(2) and write(2), metrics used to measure their performance and the optimizations implemented that improved their performance on the Intel® Xeon Phi™ coprocessor.

The Virtual File System (VFS) layer

Before we get into the details of our investigations and describe the improvements made on the Intel® Xeon Phi™ coprocessor, let’s take a quick look at the VFS layer in Linux*. As shown in Figure 1, the VFS is the generic file system layer in the kernel that provides the file system interface to application programs.

Image may be NSFW.
Clik here to view.

Figure 1: The VFS layer is the heart of the file system in the Linux kernel running on the Intel® Xeon Phi™ coprocessor

Additionally, it provides an abstraction in the kernel which allows different file system implementations to “plug in” and co-exist. The figure shows the commonly used file system types (mounted file systems) available on the Intel® Xeon Phi™ coprocessor, namely tmpfs [1][2], ramfs and NFS [4]. If you have the Linux kernel sources handy, Documentation/filesystems/vfs.txt describes the VFS layer in greater detail.

There are a couple of other interesting points to note here:

Both tmpfs and ramfs are what we call local file systems. They are similar because data written to files that “live” in either file system (or mount points) is stored in memory and not on any kind of persistent storage device (i.e. the files are lost after the kernel reboots). Documentation/filesystems/tmpfs and Documentation/filesystems/ramfs-rootfs-initramfs.txt describe the tmpfs and ramfs respectively. The differences between the two include the fact that tmpfs provides support for limit checking and the ability to swap out pages of data to a swap partition.
The network file system (NFS) is what we call a remote file system. In this case, data written to files on this mount point actually lives in persistent storage on a remote server. The local NFS client sends and receives commands (e.g. to open a file) and data over a network transport protocol like TCP to a remote NFS server that reads and writes to storage.

The page cache

One important aspect of the VFS layer is called the page cache [5]. It is an application “transparent” cache of data that is read or written to a file on storage.

Image may be NSFW.
Clik here to view.

Figure 2: Interaction of the page cache with the VFS for write(2)

It should be fairly intuitive as to why Linux maintains this cache of file data in memory – disk reads and writes are slow and caching file data in memory helps speed up file accesses. Figure 2 show the involvement of the page cache in the context of the write(2) system call. With the page cache in place, if a read or write system finds (the right side of the figure) the target page in memory (i.e. it is a “hit”), the data in the page can simply be read or written to, and marked dirty in case of write, without requiring access to the disk. If a page isn’t found, one is allocated and zeroed or updated by reading the latest file data from disk (the left side of the figure). In the 2.6.38 kernel, the page cache is implemented using a radix-tree data structure where each page is indexed using the current file offset (position in the file where the current read or write will take place).

So how does the page cache help when reads are being performed for the first time on an existing file? The VFS layer tries to be very smart and perform something called a read-ahead [6] operation when it detects that the application is performing a sequential read pattern (i.e. the current read system call is requesting data from an offset that is exactly where the last read system call ended reading from). In this case, the kernel tries to cache pages with data from the disk in an asynchronous context so that subsequent read or write system calls will find the requested page in memory and complete the request really quickly. This is obviously simplifying the concept of read-ahead, but that’s the basic idea.

Finally, and with this background about the page cache, you can see that the ramfs and tmpfs are file systems built around the page cache – i.e. all the data for a file lives in pages in the page cache with no backing store on a disk. They are simple memory file systems!

Performance improvements of local file system on Intel® Xeon Phi™ Coprocessor

The “problems”

After analyzing the VFS layer on the Intel® Xeon Phi™ coprocessor, we identified a few “hot spots” in the kernel. We describe them here before delving into specific optimizations that helped mitigate the performance bottlenecks. Let’s focus on the write(2) system call, as shown in Figure 2, for brevity. When an application calls the write(2) system call, after checking for things like user permissions, the VFS executes the following set of operations in a while loop (assuming that the file is new and every page is allocated for the first time) going over the entire length of the user request:

allocate a page and lock it (for IO)
add the locked page to the page cache radix-tree data structure
copy data from the passed in user buffer to the page (the IO operation), taking care of file offsets and faulting in the user buffer
mark the page cache page uptodate and dirty (since data was copied into the page)
finally, unlock the page

Cost of allocating pages for the page cache – As mentioned above, if a page is not already in the page cache, the VFS layer allocates a new page to cache data in memory. In Linux, a request for a physical page is satisfied from a per-CPU page pool. In most cases this is a fairly inexpensive operation because accessing a per-CPU pool does not involve any locks as long as the thread doing it is not preempted. However when this pool is empty, the kernel physical page allocator is called upon for a free page. This involves grabbing the right zone lock, a LRU lock and finding a set of physical pages to re-fill the per-CPU pool with pages from which the requested single 4KB page can be returned. On a processor running at lower frequency and in-order cores, this is going to be slow. Additionally, given L1 and L2 cache sizes on the Intel® Xeon Phi™ coprocessor, the cost of a cache miss associated with accessing a bunch of kernel data structures is going to reduce the performance even further.
Cost of copying data from/to the user buffer– The kernel version of copy_from_user and copy_to_user (and their _inatomic variants) essentially turn into a memcpy routine with fault handling. On the Intel® Xeon Phi™ coprocessor a memcpy like this is expensive for the same two reasons listed above. A cache miss is expensive and the single threaded performance and frequency of the Intel® Xeon Phi™ coprocessor is lower than the Intel Xeon® host processors. Additionally, since the kernel version of memcpy does not take advantage of the highly efficient vector instructions available on the Intel® Xeon Phi™ coprocessor, copying a large amount of data in and out of the page cache can be very inefficient. As a side effect they pollute the caches with data that the kernel is not going to touch.
Cost of computing the page “wait queue” hash– When a page is being updated with data from a user buffer, the kernel keeps other threads from either reading that page or writing to it to avoid data corruption. This is accomplished by grabbing a “lock” bit associated with the page. If the page is already locked (step (a) above) other threads are put to “sleep” to wait for the page to be unlocked (step (e) above) before proceeding with a read or write to the page. The head of the queue on which these threads are put to sleep is computed, at runtime, via a simple hash of the address of the page structure describing the physical page in question. On a processor, where the cost of a cache miss is expensive, computing the hash as required (as opposed to reading it from memory) turns out to be much cheaper and is therefore preferred. However this hash computation turns out to be expensive on the Intel® Xeon Phi™ coprocessor.

Optimizations

After analyzing the hot spots in the VFS layer and identifying the three areas as opportunities for improvements, we implemented the three key optimizations described here:

Allocating “higher order” pages – The page cache is made up of 4KB pages and the cost of allocating a page is expensive especially when the length of the buffer being written into the file is large (since we have to allocate a lot of these pages). One important observation from looking at the write(2) system call is the fact that, in a single write(2) system call, the length parameter describes a sequential set of bytes that need to be written to the file. This means that we know, ahead of time, how many 4KB page cache pages we expect to write to. This strategy translates to a gain in performance because the cost of allocating single 4KB pages can be amortized by grabbing a higher order page [7][9] and splitting it into 4KB pages before adding them to the page cache. A future optimization involves performing a batched addition of a set of pages into the page cache (radix-tree data structure) to further optimize this.
Speeding up “memcpy” – As described in step (c) above, data is copied from the user buffer into a kernel page cache page using a kernel version of memcpy. This is very inefficient; by using the DMA engine or the vector instructions on the Intel® Xeon Phi™ coprocessor we have opportunity to improve performance. While extremely efficient with no cache pollution side effects, there is a cost associated with using the DMA engine. In addition to the cost associated with programming the “memory copy” descriptors to get the DMA engine going, the physical pages backing the user buffers have to be pinned in memory before the DMA operation can be initiated. To speed up the process of pinning user pages, our implementation uses __get_user_pages_fast which grabs no locks if certain conditions are met.
Pre-computing hash for page wait queue head – Noting the fact that the page structure addresses do not change after boot, we realized that the VFS layer can avoid computing the hash of the page structure address at runtime if it is pre-computed and stored in a new field in the page structure itself. With the hash pre-computed, the unlock_page() in step (e) above can simply use the stored value to get to the head of the wait queue associated with the page and look for any threads that might be waiting for the page to be unlocked. As a side note, if there was a race free method of knowing if anyone was on the wait queue, we could avoid doing any of this (another future optimization). One caveat with this optimization is that it adds an 8-byte field to the page in the kernel page data structure to store the pre-computed hash value. This is a bit of an overhead in terms of the kernel’s memory footprint because a page structure is required for every physical page in the system. For example, in a system with 16GB of memory, this option requires an additional 32MB of memory for some improvement in performance. For this reason, this optimization is turned off by default at compile time.

The “before and after”

The following graphs show the performance improvements due to VFS optimizations for file read and write to files in the tmpfs and ramfs file systems. The performance improvements are measured using IOzone* [8] across a range of buffer sizes from 4MB to 1GB (horizontal axis) using a 4M record size (other record sizes follow similar trends). They are compared against the existing read/write performance without any optimizations.

Image may be NSFW.
Clik here to view. Image may be NSFW.
Clik here to view.

Figure 3: write/re-write and read/re-read improvements on tmpfs as measured by IOzone

Image may be NSFW.
Clik here to view. Image may be NSFW.
Clik here to view.

Figure 4: write/re-write and read/re-read improvements on ramfs as measured by IOzone

In Figure 3 and Figure 4 the “–opt” in the legend denotes the result of our optimizations for read, write, re-read and re-write on both mount points. As seen in these results, our optimizations improve the performance of write by about 2X and read performance by 3.3X for tmpfs files. On ramfs, the speedup is even higher resulting in an improvement of 2.3X for write and 3.8X for read. While both file systems are similar, our optimizations provide bigger gains for ramfs because the tmpfs file system has the overhead of maintaining directory entries to keep track of the swap cache. Another observation is that in the case of read, use of DMA for copying data is the single major contributor to improving performance while for write it contributes to about 75% of the total improvement. The higher order page allocation certainly plays an important role in write especially when the length of the buffer is large. It must be noted that these results do not show the results of pre-computing the hash value as it is turned off by default at compile time.

Figure 5 and Figure 6 show results with page wait queue hash optimization turned on. In this case, our optimizations improve tmpfs write performance by an average of 5% and read performance by an average of 8.2%. This is the performance trade-off for a slightly increased memory footprint.

Image may be NSFW.
Clik here to view. Image may be NSFW.
Clik here to view.

Figure 5: write/re-write and read/re-read improvements on tmpfs as measured by IOzone (includes page wait queue optimization)

Image may be NSFW.
Clik here to view. Image may be NSFW.
Clik here to view.

Figure 6: write/re-write and read/re-read improvements on ramfs as measured by IOzone (includes page wait queue optimization)

So how can one best utilize these optimizations on the Intel® Xeon Phi™ coprocessor?

As noted with the results above, the use of the DMA engine provides the maximum gain in performance. To ensure best performance when reading and writing to files with these optimizations, keep the following in mind:

Data alignment: Since the HW DMA engine requires addresses to be aligned to cacheline boundaries, using cacheline aligned buffers provides best results because the optimizations described above can simply use the efficient DMA engine to copy data. If the user buffers are not aligned to cache lines, the implementation drops to using the vector memory copy routine resulting in slightly lower performance.
Multiple small file versus single large file: Reading or writing multiple small files (less than 16KB) may not yield the best performance because of the overhead of programming the DMA may eclipse any performance benefit gained by it. Hence it is recommended that wherever possible, combine multiple files into a single file before the read or write operation. With a large file, the overhead of programming the DMA is amortized over the entire write operation.

Kernel compile configuration settings

If you have the kernel sources for Intel® MPSS (Intel® Manycore Platform Software Stack) handy, the following kernel command line options can be changed to enable/disable these features (the default state for each configuration option is in parenthesis)

mm/Kconfig:
- CONFIG_PRECOMPUTE_WAITQ_HEAD (disabled)
- CONFIG_PAGE_HIGH_ORDER_PAGE_ALLOC (enabled)
- CONFIG_PAGE_CACHE_DMA (enabled)

arch/x86/Kconfig:
- CONFIG_VECTOR_MEMCPY (enabled)

Kernel command line options available in Intel® MPSS

In addition to the compile time options enabled by default as described above, the following kernel command line options provide additional control to enable or disable the read and write optimizations during boot.

vfs_read_optimization: can be turned on or off. If not specified, it is off by default. If on, it enables read side optimizations for files in the tmpfs and ramfs file systems.
vfs_write_optimization: can be turned on or off. If not specified, it is off by default. If on, it enables write side optimizations for files in the tmpfs and ramfs file systems.

As an example, to enable read optimizations, add vfs_read_optimization to the ExtraCommandLine as follows:

Edit /etc/mpss/default.conf
Append "vfs_read_optimization=on" to the ExtraCommandLine
Restart the mpss service.

These optimizations are available beginning with Intel® MPSS version 3.2 (initially as an experimental feature).

Conclusions and Future work

File IO performance is critical for HPC applications because of their need to read and write large amounts of data to files. Additionally application level check-pointing is another very important aspect of HPC applications that utilizes files to store periodic check-points during execution. The performance of file IO system calls on the Intel® Xeon Phi™ coprocessor falls far below user expectations, mainly because these system calls are single threaded running on lower frequency simple in-order cores with smaller caches compared to Intel Xeon® processor. We investigated the key bottlenecks observed in this environment and implemented some key VFS optimizations described in this document. The optimizations result in significant improvements in the read and write performance of file IO system calls which helps narrow the performance gap between Intel® Xeon Phi™ coprocessors and Intel Xeon® processors. A key point to remember here is that these optimizations are generic and can be applied to any system that is running Linux. Since these changes are applied at the VFS layer, any underlying file system in VFS will directly benefit from them. As next steps, we plan to extend these optimizations to support persistent file systems like NFS and Lustre*. The performance of the tmpfs file system can be made to match that of ramfs by eliminating the overhead associated with maintaining the swap cache data structures if there is no support for swap devices as is the case with the Intel® Xeon Phi™ coprocessor. Additionally, the higher order page allocation scheme can be enhanced by implementing a batched page cache insertion algorithm.

References

Shared memory virtual file system: http://www.makelinux.net/books/lvmm/understand015
tmpfs: A Virtual Memory File System: http://wiki.deimos.fr/images/1/1e/Solaris_tmpfs.pdf
ramfs: https://www.kernel.org/doc/Documentation/filesystems/ramfs-rootfs-initramfs.txt
NFS: http://linux-nfs.org/wiki/index.php/Main_Page
Page cache: http://www.makelinux.net/books/lkd2/ch15.html
Linux Readahead: less tricks for more: http://landley.net/kdocs/ols/2007/ols2007v2-pages-273-284.pdf
Physical page allocation: https://www.kernel.org/doc/gorman/html/understand/understand009.html
IOzone File System Benchmark: http://www.iozone.org/
get_free_pages and friends: http://www.makelinux.net/ldd3/chp-8-sect-3

General Performance Disclaimer/"Your Mileage May Vary"/Benchmark (FTC Disclaimer)

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

For more complete information about performance and benchmark results, visit Performance Test Disclosure

Intel® Xeon Phi™ Coprocessor