Linux System Overview – File System

Linux File System Basic

Ext3 (successor of Ext2) is the standard file system for Linux: It is robust, fast and suitable for all fields of use. The main difference between them is that Ext3 has a journal that records the pending operations for fast recovery purpose in the event of system crash. This record guarantees a consistent file system at all times and reduces the time needed for checking a mounted file system from several hours to a few seconds b/c instead of checking the entire disk, the system can check just those areas noted in the journal as having pending operations.

Like all decent Unix file systems, Ext3 uses three general data structures: directories, inodes and data blocks. Directories only contain file names and the inode numbers assigned to them. Each file has one i-node that contains a list of disk block’s starting sector addresses as a file content is normally not stored in contiguous disk blocks in disk drive due to constant add and delete and the size is dynamic (ie. external fragmentation). If the file content are scattered, it takes longer to retrieve its content as it takes more header spins physically.

http://www.heise-online.co.uk/images/110398/0/1

Under the hood, each disk block can span multiple disk sectors and each sector has the size of 512 bytes. Disk sector is the smallest addressable unit on hard disks. Ext3 uses block sizes of 1024, 2048 or 4096 bytes. In theory, Ext3 supports block sizes up to 64 KB, but in x86 and x64 architectures, 4 KB is the maximum: This block size corresponds to that of the kernel’s memory pages in RAM, which makes paging easier for the operating system. Ext3 uses 32-bit values (4 bytes as integer in Java) to assign block numbers, which means that it can only address about four billion blocks – 4 TB at a block size of 1024 bytes, 16 TB at 4096 bytes. So, larger block size allows you to create large file system. On the other hand, large blocks can waste a lot of disk space because files always use a whole block even if they only contain a few bytes: On average, every file wastes half a block - the larger the blocks and the smaller the files, the more noticeable the effect is. This effect called internal fragmentation.

Optimization with sacrifice

For an efficient file system, it needs to quickly find the data belonging to a file name. For example, for filename “abc.txt”, OS needs to traverse a list of directory entries before the inode of the file is located (depends on the depth of the folder hierarchy) and then traverse all the data block pointers to retrieve the content. To optimize the speed, Ext3 writes the inodes into static tables on the disk during formatting. One consequence of this is that the number of inodes can’t be altered after the file system has been set up. As every file needs to be assigned to one specific inode there can’t be more files than inodes. It is not scalable for handling large number of files. By default, mke2fs creates one inode for every 4 KB in file systems up to 512 MB, otherwise one inode for every 8 KB. Although you can tune this number to increase the number of inodes, it is only changeable at setup time (not dynamic). By the way, each inode itself consumes 128  bytes in size. 

Handle large size file

How is it possible to fit the millions of data block numbers required for gigabyte-sized files into a static data structure of 128 bytes? It isn’t – one Ext3 inode stores exactly 15 block numbers. The first twelve point directly to data blocks, block 13 to a data block containing block numbers (indirectly addressed blocks), block 14 to a block pointing to blocks with block numbers (double indirect), and block 15 points to triple indirect blocks. Therefore, at a block size of 4 KB (that is 1024 block numbers with 4 bytes per indirect block) one inode can handle 12 + 1024 + 10242 + 10243, around a billion block numbers. The resulting maximum file size of just over 4 TB.

Power of B-Tree Indexing

Now you know how inode uses hierarchical pointers to handle file with large size. However, if a directory has tons of files, how directory entries make it efficient to locate the inode. Ext2 originally stored the file names within a directory as a linked list. While this is a elegantly simple data structure it has the disadvantage that operations take longer and longer with a growing number of entries. Ext3 can manage directories in B-Tree+ structure if [code]]czo5OlwiZGlyX2luZGV4XCI7e1smKiZdfQ==[[/code] is set (not default). This drastically speeds up directory operations. Performance loss is only experienced when the directories are filled with hundreds of thousands of files. This is usually caused by a caching effect as Linux kernel doesn't use unlimited memory for caching directory structures even you add more memory.

$ sudo tune2fs -O dir_index /dev/hda1

Run the above command as root. Do note that the indexing will take up much more space, but then hard disk space is not too expensive nowadays. If you don't want to tweak the OS default setting but you still want to store large number of files. You can restructure the directory so that it does not contain that many files. Without doing this, in a default (untuned) Ext3 partition, each subsequent write degrades horribly past the 2000 file limit. So, keeping the items in a directory to within 2000 files should be fine. If you want to go this route, there are approaches to restructure your folder:

  1. Date based - YEAR > MONTH > DATE > HOUR if your files is uniformly distributed across the time.
  2. Hash based - break down the hash into several parts as folder name (check this)
  3. Id based - reverse the id and break it down use 2 digits each to make sure it is uniformly distributed

NOTE: I don't want to use random number here as I want to locate the file via its metadata later.

Alternative solution for large number of files in a directory

ReiserFS can handle up to 2^31 files per dir (that's 2 billion), with a max of 2^32 (4 billion) files on the filesys total. It can handle up to 64000 subdirs in a directory. Ext3 has a limit of 32000 subdirs per dir. The max number of files per dir is theoretically unlimited (actually around 130 trillion), but performance becomes terrible with above 10-15 thousand files. The max number of total files on the filesys is limited by the number of inodes you have. With a 1 gig file system and a 4k block/ inode ratio (the default), you have around 260000 inodes, and that's also the max number of files you can have.

Reference

Here are some good references

  1. The Unix and Internet Fundamental How to - Eric Raymond
  2. Tuning Linux file system - Ext3 by Oliver Diedrich
  3. Handling large number of files in a directory - Roopinder Singh
  4. Super fast Ext4 filesystem arrives in Ubuntu 9.04 - If the benchmark is correct, it outperforms all the file system nowadays dramatically.
  5. Introduction to Linux file systems and files
  6. Extreme performance monitoring and tuning in Linux
  7. Simple Help with simple answer - simplehelp.net

Comments

comments

, , , , , , ,

Subscribe

Subscribe my "7 Days Crack Course" to make money online together! Free for the first 100 registration.

No comments yet.

Leave a Reply