File systems and block I/O (devb-*) drivers

Here are the basic steps to improving the performance of your file systems and block I/O (devb-*) drivers.

  1. Optimize the file system options:
    • Determine how you want to balance system robustness and performance.
    • Concentrate on the cache and vnode (file system-independent inodes) options; the other sizes scale themselves to these.
    • Set the commit option (either globally or as a mount option) to force or disable synchronous writes.
    • Consider using a RAM disk for temporary files (for example, /tmp).
  2. Optimize application code:
    • Read and write in large chunks (16–32 KB is optimal).
    • Read and write in multiples of a disk block on block boundaries (typically 512 bytes, but you can use stat() or statvfs() to determine the value at runtime).
    • Avoid standard I/O where possible; use open(), read(), and write(), instead of fopen(), fread(), and fwrite(). The f* functions use an extra layer of buffering. The default size is given by BUFSIZ; you can use setvbuf() to specify a different buffer size.

      As a BlackBerry 10 OS extension, you can use the STDIO_DEFAULT_BUFSIZE environment variable to override BUFSIZ as the default buffer size for stream I/O. The value of STDIO_DEFAULT_BUFSIZE must be greater than that of BUFSIZ.

    • Pregrow files, if you know their ultimate sizes.
    • Use direct I/O (DMA to user space).
    • Use filenames that are no longer than 16 characters. If you do this, the file system won't use the .inodes file, so there won't be any inter-block references. In addition, there will be one less disk write, and hence, one less chance of corruption if the power fails.

      Long filenames (that is, longer than 48 characters) especially slow down the file system.

    • Big directories are slower than small ones, because the file system uses a linear search.

Performance and robustness

When you design or configure a file system, you have to balance performance and robustness.

  • Robustness involves synchronizing the user operations to the implementation of that operation to the successful response to the user.

    For example, the creation of a new file—via creat()—may perform all the physical disk writes that are necessary to add that new filename into a directory on the disk file system and only then reply back to the client.

  • Performance may decouple the actual implementation of the operation from the reply.

    For example, writing data into a file—via write()—might immediately reply to the client, but leave the data in a write-behind in-memory cache in an attempt to merge with later writes and construct a large, contiguous run for a single sequential disk access (but until that occurs, the data is vulnerable to loss if the power fails).

You must decide on the balance between robustness and performance that's appropriate for your installation, expectations, and requirements.

Metadata updates

Metadata is data about data, or all the overhead and attributes involved in storing the user data itself, such as the name of a file, the physical blocks it uses, modification and access timestamps, and so on. The most expensive operation of a file system is in updating the metadata. This is because:

  • The metadata is typically located on different disk cylinders from the data and is even disjoint to itself (bitmaps, inodes, directory entries) and hence, incurs seek delays.
  • The metadata is usually written to the disk with more urgency than user data (because the metadata affects the integrity of the filesystem structure) and hence may incur a transfer delay.

Almost all operations on the filesystem (even reading file data, unless you've specified the noatime option—see io-blk.so in Utilities) involve some metadata updates.

Ordering the updates to metadata

Some file system operations affect multiple blocks on disk. For example, consider the situation of creating or deleting a file. Most file systems separate the name of the file (or link) from the actual attributes of the file (the inode); this supports the POSIX concept of hard links, multiple names for the same file. Typically, the inodes reside in a fixed location on disk.

Creating a new filename thus involves allocating a free inode entry and populating it with the details for the new file, and then placing the name for the file into the appropriate directory. Deleting a file involves removing the name from the parent directory and marking the inode as available.

These operations must be performed in this order to prevent corruption should there be a power failure between the two writes; note that for creation the inode should be allocated before the name, as a crash results in an allocated inode that isn't referenced by any name (an orphaned resource that a file system's check procedure can later reclaim). If the operations are performed the other way around and a power failure occurred, the result are a name that refers to a stale or invalid inode, which is undetectable as an error. A similar argument applies, in reverse, for file deletion.

For traditional file systems, the only way of ordering these writes is to perform the first one (or, more generally, all but the last one of a multiple-block sequence) synchronously (that is, immediately and waiting for I/O to complete before continuing). A synchronous write is very expensive, because it involves a disk-head seek, interrupts any active sequential disk streaming, and blocks the thread until the write has completed—potentially milliseconds of dead time.

Throughput

Another key point is the performance of sequential access to a file, or raw throughput, where a large amount of data is written to a file (or an entire file is read). The file system itself can detect this type of sequential access and attempt to optimize the use of the disk, by doing:

  • Read-ahead on reads, so that the disk is being accessed for the predicted new data while the user processes the original data
  • Write-behind of writes to allow a large amount of dirty data to be coalesced into a single contiguous multiple-block write

The most efficient way of accessing the disk for high-performance is through the standard POSIX routines that work with file descriptors— open(), read(), and write()—because these allow direct access to the file system with no interference from libc.

If you're concerned about performance, we don't recommend that you use the standard I/O (<stdio.h>) routines that work with FILE variables, because they introduce another layer of code and another layer of buffering. In particular, the default buffer size is BUFSIZ, or 1 KB, so all access to the disk is carved up into chunks of that size, causing a large amount of overhead for passing messages and switching contexts.

There are some cases when the standard I/O facilities are useful, such as when processing a text file one line or character at a time, in which case the 1 KB of buffering provided by standard I/O greatly reduces the number of messages to the file system. You can improve performance by using:

  • setvbuf() or the STDIO_DEFAULT_BUFSIZE environment variable to increase the buffering size
  • fileno() to access the underlying file descriptor directly and to bypass the buffering during performance-critical sections

You can also optimize performance by accessing the disk in suitably sized chunks (large enough to minimize the overheads of BlackBerry 10 OS's context-switching and message-passing, but not too large to exceed disk driver limits for blocks per operation or overheads in large message-passing); an optimal size is 32 KB.

You should also access the file on block boundaries for whole multiples of a disk sector (since the smallest unit of access to a disk/block device is a single sector, partial writes will require a read/modify/write cycle); you can get the optimal I/O size by calling statvfs(), although most disks are 512 bytes/sector.

Finally, for very high performance situations (video streaming, and so on) it's possible to bypass all buffering in the file system and perform DMA directly between the user data areas and the disk. But note these caveats:

  • The disk and disk driver must support such access.
  • No coherency is offered between data transferred directly and any data in the file system buffer cache.
  • Some POSIX semantics (such as file access or modification time updates) are ignored.

We don't currently recommend that you use DMA unless absolutely necessary; not all disk drivers correctly support it, so there's no facility to query a disk driver for the DMA-safe requirements of its interface, and naive users can get themselves into trouble!

In some situations, where you know the total size of the final data file, it can be advantageous to pregrow it to this size, rather than allow it to be automatically extended piecemeal by the file system as it is written to. This lets the file system see a single explicit request for allocation instead of many implicit incremental updates; some file systems may be able to exploit this and allocate the file in a more optimal/contiguous fashion. It also reduces the number of metadata updates needed during the write phase, and so, improves the data write performance by not disrupting sequential streaming.

The POSIX function to extend a file is ftruncate(); the standard requires this function to zero-fill the new data space, meaning that the file is effectively written twice, so this technique is suitable when you can prepare the file during an initial phase where performance isn't critical. There's also a non-POSIX devctl() to extend a file without zero-filling it, which provides the above benefits without the cost of erasing the contents; the DCMD_FSYS_PREGROW_FILE, which is defined in <sys/dcmd_blk.h>, takes as its argument the file size, as a off64_t. For example:

int fd;
off64_t sz;

fd=open(...);
sz=...;

devctl(fd, DCMD_FSYS_PREGROW_FILE, &sz, sizeof(sz), NULL);

Configuration

You can control the balance between performance and robustness on either a global or per-file basis.

  • Specifying the O_SYNC bit when opening a file causes all I/O operations on that file (both data and metadata) to be performed synchronously.

    The fsync() and sync() functions let you flush the file system write-behind cache on demand; otherwise, any dirty data is flushed from cache under the control of the global blk delwri= option (the default is two seconds—see io-blk.so in Utilities).

  • You control the global configuration with the commit= option, either to io-blk.so as an option to apply to all file systems, or via the mount command as an option to apply to a single instance of a mounted file system). The levels are none, low, medium, and high, which differ in the degree in which metadata is written synchronously versus asynchronously, or even time-delayed.

    At any level less robust than the default (that is, medium), the file system doesn't guarantee the same level of integrity following an unexpected power loss, because multiple-block updates might not be ordered correctly.

Block I/O commit level

This table illustrates how the commit= affects the time it takes to create and delete a file on an x86 PIII-450 machine with a UDMA-2 EIDE disk, running a QNX 4 file system.

The table shows how many 0 KB files could be created and deleted per second:

commit level Number created Number deleted
high 866 1221
medium 1030 2703
low 1211 2710
none 1407 2718

Note that at the commit=high level, all disk writes are synchronous, so there's a noticeable cost in updating the directory entries and the POSIX mtime on the parent directory. At the commit=none level, all disk writes are time-delayed in the write-behind cache, and so multiple files can be created/deleted in the in-memory block without requiring any physical disk access at all (so, of course, any power failure here means that those files will not exist when the system is restarted).

Record size

This example illustrates how the record size affects sequential file access on an x86 PIII-725 machine with a UDMA-4 EIDE disk, using the QNX 4 file system. The table lists the rates, in megabytes per second, of writing and reading a 256 MB file:

Record size Writing Reading
1 KB 14 16
2 KB 16 19
4 KB 17 24
8 KB 18 30
16 KB 18 35
32 KB 19 36
64 KB 18 36
128 KB 17 37

Note that the sequential read rate doubles based on use of a suitable record size. This is because the overheads of context-switching and message-passing are reduced; consider that reading the 256 MB file 1 KB at a time requires 262,144 _IO_READ messages, whereas with 16 KB records, it requires only 16,384 such messages; 1/16th of the non-negligible overheads.

Write performance doesn't show the same dramatic change, because the user data is, by default, placed in the write-behind buffer cache and written in large contiguous runs under timer control—using O_SYNC illustrates a difference. The limiting factor here is the periodic need for synchronous update of the bitmap and inode for block allocation as the file grows.

Double buffering

This example illustrates the effect of double-buffering in the standard I/O library on an x86 PIII-725 machine with a UDMA-4 EIDE disk, using the QNX 4 file system.

The table shows the rate, in megabytes per second, of writing and reading a 256 MB file, with a record size of 8 KB:

Scenario Writing Reading
File descriptor 18 31
Standard I/O 13 16
setvbuf() 17 30

Here, you can see the effect of the default standard I/O buffer size (BUFSIZ, or 1 KB). When you ask it to transfer 8 KB, the library implements the transfer as 8 separate 1 KB operations. Note how the standard I/O case does match the above benchmark (see Record size) for a 1 KB record, and the file-descriptor case is the same as the 8 KB scenario).

When you use setvbuf() or the STDIO_DEFAULT_BUFSIZE environment variable to force the standard I/O buffering up to the 8 KB record size, then the results come closer to the optimal file-descriptor case (the small difference is due to the extra code complexity and the additional memcpy() between the user data and the internal standard I/O FILE buffer).

File descriptor vs standard I/O

Here's another example that compares access using file descriptors and standard I/O on an x86 PIII-725 machine with a UDMA-4 EIDE disk, using the QNX 4 file system. The table lists the rates, in megabytes per seconds, for writing and reading a 256 MB file, using file descriptors and standard I/O:

Record size FD write FD read Stdio write Stdio read
32 1.5 1.7 10.9 12.7
64 2.8 3.1 11.7 14.3
128 5.0 5.6 12.0 15.1
256 8.0 9.0 12.4 15.2
512 10.8 12.9 13.2 16.0
1024 14.1 16.9 13.1 16.3
2048 16.1 20.6 13.2 16.5
4096 17.1 24.0 13.9 16.5
8192 18.3 31.4 14.0 16.4
16384 18.1 37.3 14.3 16.4

Notice how the read() access is very sensitive to the record size; this is because each read() maps to an _IO_READ message and is basically a context-switch and message-pass to the file system; when only small amounts of data are transferred each time, the OS overhead becomes significant.

Since standard I/O access using fread() uses a 1 KB internal buffer, the number of _IO_READ messages remains constant, regardless of the user record size, and the throughput resembles that of the file-descriptor 1 KB access in all cases (with slight degradation at smaller record sizes due to the increased number of libc calls made). Thus, you should consider the anticipated file-access patterns when you choose from these I/O paradigms.

Pregrowing a file

This example illustrates the effect of pregrowing a data file on an x86 PIII-725 machine with a UDMA-4 EIDE disk, using the QNX 4 file system. The table shows the times, in milliseconds, required to create and write a 256 MB file in 8 KB records:

Scenario: Creation Write Total
write() 0 15073 15073 (15 seconds)
ftruncate() 13908 8510 22418 (22 seconds)
devctl() 55 8479 8534 (8.5 seconds)

Note how extending the file incrementally as a result of each write() call is slower than growing it with a single ftruncate() call, as the file system can allocate larger/contiguous data extents, and needs to update the inode metadata attributes only once. Note also how the time to overwrite already allocated data blocks is much less than that for allocating the blocks dynamically (the sequential writes aren't interrupted by the periodic need to synchronously update the bitmap).

Although the total time to pregrow and overwrite is worse than growing, the pregrowth could be performed during an initialization phase where speed isn't critical, allowing for better write performance later.

The optimal case is to pregrow the file without zero-filling it (using a devctl() ) and then overwrite with the real data contents.

Fine-tuning USB storage devices

If your environment hosts large (for example, media) files on USB storage devices, you should ensure that your configuration allows sufficient RAM for read-ahead processing of large files, such as MP3 files. You can change the configuration by adjusting the cache and vnode values that devb-umass passes to io-blk.so with the blk option.

A reasonable starting configuration for the blk option is: cache=512k,vnode=256. You should, however, establish benchmarks for key activities in your environment, and then adjust these values for optimal performance.

Last modified: 2014-11-17



Got questions about leaving a comment? Get answers from our Disqus FAQ.

comments powered by Disqus