| .. SPDX-License-Identifier: GPL-2.0 |
| |
| Directory Entries |
| ----------------- |
| |
| In an ext4 filesystem, a directory is more or less a flat file that maps |
| an arbitrary byte string (usually ASCII) to an inode number on the |
| filesystem. There can be many directory entries across the filesystem |
| that reference the same inode number--these are known as hard links, and |
| that is why hard links cannot reference files on other filesystems. As |
| such, directory entries are found by reading the data block(s) |
| associated with a directory file for the particular directory entry that |
| is desired. |
| |
| Linear (Classic) Directories |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| By default, each directory lists its entries in an “almost-linear” |
| array. I write “almost” because it's not a linear array in the memory |
| sense because directory entries are not split across filesystem blocks. |
| Therefore, it is more accurate to say that a directory is a series of |
| data blocks and that each block contains a linear array of directory |
| entries. The end of each per-block array is signified by reaching the |
| end of the block; the last entry in the block has a record length that |
| takes it all the way to the end of the block. The end of the entire |
| directory is of course signified by reaching the end of the file. Unused |
| directory entries are signified by inode = 0. By default the filesystem |
| uses ``struct ext4_dir_entry_2`` for directory entries unless the |
| “filetype” feature flag is not set, in which case it uses |
| ``struct ext4_dir_entry``. |
| |
| The original directory entry format is ``struct ext4_dir_entry``, which |
| is at most 263 bytes long, though on disk you'll need to reference |
| ``dirent.rec_len`` to know for sure. |
| |
| .. list-table:: |
| :widths: 8 8 24 40 |
| :header-rows: 1 |
| |
| * - Offset |
| - Size |
| - Name |
| - Description |
| * - 0x0 |
| - \_\_le32 |
| - inode |
| - Number of the inode that this directory entry points to. |
| * - 0x4 |
| - \_\_le16 |
| - rec\_len |
| - Length of this directory entry. Must be a multiple of 4. |
| * - 0x6 |
| - \_\_le16 |
| - name\_len |
| - Length of the file name. |
| * - 0x8 |
| - char |
| - name[EXT4\_NAME\_LEN] |
| - File name. |
| |
| Since file names cannot be longer than 255 bytes, the new directory |
| entry format shortens the rec\_len field and uses the space for a file |
| type flag, probably to avoid having to load every inode during directory |
| tree traversal. This format is ``ext4_dir_entry_2``, which is at most |
| 263 bytes long, though on disk you'll need to reference |
| ``dirent.rec_len`` to know for sure. |
| |
| .. list-table:: |
| :widths: 8 8 24 40 |
| :header-rows: 1 |
| |
| * - Offset |
| - Size |
| - Name |
| - Description |
| * - 0x0 |
| - \_\_le32 |
| - inode |
| - Number of the inode that this directory entry points to. |
| * - 0x4 |
| - \_\_le16 |
| - rec\_len |
| - Length of this directory entry. |
| * - 0x6 |
| - \_\_u8 |
| - name\_len |
| - Length of the file name. |
| * - 0x7 |
| - \_\_u8 |
| - file\_type |
| - File type code, see ftype_ table below. |
| * - 0x8 |
| - char |
| - name[EXT4\_NAME\_LEN] |
| - File name. |
| |
| .. _ftype: |
| |
| The directory file type is one of the following values: |
| |
| .. list-table:: |
| :widths: 16 64 |
| :header-rows: 1 |
| |
| * - Value |
| - Description |
| * - 0x0 |
| - Unknown. |
| * - 0x1 |
| - Regular file. |
| * - 0x2 |
| - Directory. |
| * - 0x3 |
| - Character device file. |
| * - 0x4 |
| - Block device file. |
| * - 0x5 |
| - FIFO. |
| * - 0x6 |
| - Socket. |
| * - 0x7 |
| - Symbolic link. |
| |
| In order to add checksums to these classic directory blocks, a phony |
| ``struct ext4_dir_entry`` is placed at the end of each leaf block to |
| hold the checksum. The directory entry is 12 bytes long. The inode |
| number and name\_len fields are set to zero to fool old software into |
| ignoring an apparently empty directory entry, and the checksum is stored |
| in the place where the name normally goes. The structure is |
| ``struct ext4_dir_entry_tail``: |
| |
| .. list-table:: |
| :widths: 8 8 24 40 |
| :header-rows: 1 |
| |
| * - Offset |
| - Size |
| - Name |
| - Description |
| * - 0x0 |
| - \_\_le32 |
| - det\_reserved\_zero1 |
| - Inode number, which must be zero. |
| * - 0x4 |
| - \_\_le16 |
| - det\_rec\_len |
| - Length of this directory entry, which must be 12. |
| * - 0x6 |
| - \_\_u8 |
| - det\_reserved\_zero2 |
| - Length of the file name, which must be zero. |
| * - 0x7 |
| - \_\_u8 |
| - det\_reserved\_ft |
| - File type, which must be 0xDE. |
| * - 0x8 |
| - \_\_le32 |
| - det\_checksum |
| - Directory leaf block checksum. |
| |
| The leaf directory block checksum is calculated against the FS UUID, the |
| directory's inode number, the directory's inode generation number, and |
| the entire directory entry block up to (but not including) the fake |
| directory entry. |
| |
| Hash Tree Directories |
| ~~~~~~~~~~~~~~~~~~~~~ |
| |
| A linear array of directory entries isn't great for performance, so a |
| new feature was added to ext3 to provide a faster (but peculiar) |
| balanced tree keyed off a hash of the directory entry name. If the |
| EXT4\_INDEX\_FL (0x1000) flag is set in the inode, this directory uses a |
| hashed btree (htree) to organize and find directory entries. For |
| backwards read-only compatibility with ext2, this tree is actually |
| hidden inside the directory file, masquerading as “empty” directory data |
| blocks! It was stated previously that the end of the linear directory |
| entry table was signified with an entry pointing to inode 0; this is |
| (ab)used to fool the old linear-scan algorithm into thinking that the |
| rest of the directory block is empty so that it moves on. |
| |
| The root of the tree always lives in the first data block of the |
| directory. By ext2 custom, the '.' and '..' entries must appear at the |
| beginning of this first block, so they are put here as two |
| ``struct ext4_dir_entry_2``\ s and not stored in the tree. The rest of |
| the root node contains metadata about the tree and finally a hash->block |
| map to find nodes that are lower in the htree. If |
| ``dx_root.info.indirect_levels`` is non-zero then the htree has two |
| levels; the data block pointed to by the root node's map is an interior |
| node, which is indexed by a minor hash. Interior nodes in this tree |
| contains a zeroed out ``struct ext4_dir_entry_2`` followed by a |
| minor\_hash->block map to find leafe nodes. Leaf nodes contain a linear |
| array of all ``struct ext4_dir_entry_2``; all of these entries |
| (presumably) hash to the same value. If there is an overflow, the |
| entries simply overflow into the next leaf node, and the |
| least-significant bit of the hash (in the interior node map) that gets |
| us to this next leaf node is set. |
| |
| To traverse the directory as a htree, the code calculates the hash of |
| the desired file name and uses it to find the corresponding block |
| number. If the tree is flat, the block is a linear array of directory |
| entries that can be searched; otherwise, the minor hash of the file name |
| is computed and used against this second block to find the corresponding |
| third block number. That third block number will be a linear array of |
| directory entries. |
| |
| To traverse the directory as a linear array (such as the old code does), |
| the code simply reads every data block in the directory. The blocks used |
| for the htree will appear to have no entries (aside from '.' and '..') |
| and so only the leaf nodes will appear to have any interesting content. |
| |
| The root of the htree is in ``struct dx_root``, which is the full length |
| of a data block: |
| |
| .. list-table:: |
| :widths: 8 8 24 40 |
| :header-rows: 1 |
| |
| * - Offset |
| - Type |
| - Name |
| - Description |
| * - 0x0 |
| - \_\_le32 |
| - dot.inode |
| - inode number of this directory. |
| * - 0x4 |
| - \_\_le16 |
| - dot.rec\_len |
| - Length of this record, 12. |
| * - 0x6 |
| - u8 |
| - dot.name\_len |
| - Length of the name, 1. |
| * - 0x7 |
| - u8 |
| - dot.file\_type |
| - File type of this entry, 0x2 (directory) (if the feature flag is set). |
| * - 0x8 |
| - char |
| - dot.name[4] |
| - “.\\0\\0\\0” |
| * - 0xC |
| - \_\_le32 |
| - dotdot.inode |
| - inode number of parent directory. |
| * - 0x10 |
| - \_\_le16 |
| - dotdot.rec\_len |
| - block\_size - 12. The record length is long enough to cover all htree |
| data. |
| * - 0x12 |
| - u8 |
| - dotdot.name\_len |
| - Length of the name, 2. |
| * - 0x13 |
| - u8 |
| - dotdot.file\_type |
| - File type of this entry, 0x2 (directory) (if the feature flag is set). |
| * - 0x14 |
| - char |
| - dotdot\_name[4] |
| - “..\\0\\0” |
| * - 0x18 |
| - \_\_le32 |
| - struct dx\_root\_info.reserved\_zero |
| - Zero. |
| * - 0x1C |
| - u8 |
| - struct dx\_root\_info.hash\_version |
| - Hash type, see dirhash_ table below. |
| * - 0x1D |
| - u8 |
| - struct dx\_root\_info.info\_length |
| - Length of the tree information, 0x8. |
| * - 0x1E |
| - u8 |
| - struct dx\_root\_info.indirect\_levels |
| - Depth of the htree. Cannot be larger than 3 if the INCOMPAT\_LARGEDIR |
| feature is set; cannot be larger than 2 otherwise. |
| * - 0x1F |
| - u8 |
| - struct dx\_root\_info.unused\_flags |
| - |
| * - 0x20 |
| - \_\_le16 |
| - limit |
| - Maximum number of dx\_entries that can follow this header, plus 1 for |
| the header itself. |
| * - 0x22 |
| - \_\_le16 |
| - count |
| - Actual number of dx\_entries that follow this header, plus 1 for the |
| header itself. |
| * - 0x24 |
| - \_\_le32 |
| - block |
| - The block number (within the directory file) that goes with hash=0. |
| * - 0x28 |
| - struct dx\_entry |
| - entries[0] |
| - As many 8-byte ``struct dx_entry`` as fits in the rest of the data block. |
| |
| .. _dirhash: |
| |
| The directory hash is one of the following values: |
| |
| .. list-table:: |
| :widths: 16 64 |
| :header-rows: 1 |
| |
| * - Value |
| - Description |
| * - 0x0 |
| - Legacy. |
| * - 0x1 |
| - Half MD4. |
| * - 0x2 |
| - Tea. |
| * - 0x3 |
| - Legacy, unsigned. |
| * - 0x4 |
| - Half MD4, unsigned. |
| * - 0x5 |
| - Tea, unsigned. |
| |
| Interior nodes of an htree are recorded as ``struct dx_node``, which is |
| also the full length of a data block: |
| |
| .. list-table:: |
| :widths: 8 8 24 40 |
| :header-rows: 1 |
| |
| * - Offset |
| - Type |
| - Name |
| - Description |
| * - 0x0 |
| - \_\_le32 |
| - fake.inode |
| - Zero, to make it look like this entry is not in use. |
| * - 0x4 |
| - \_\_le16 |
| - fake.rec\_len |
| - The size of the block, in order to hide all of the dx\_node data. |
| * - 0x6 |
| - u8 |
| - name\_len |
| - Zero. There is no name for this “unused” directory entry. |
| * - 0x7 |
| - u8 |
| - file\_type |
| - Zero. There is no file type for this “unused” directory entry. |
| * - 0x8 |
| - \_\_le16 |
| - limit |
| - Maximum number of dx\_entries that can follow this header, plus 1 for |
| the header itself. |
| * - 0xA |
| - \_\_le16 |
| - count |
| - Actual number of dx\_entries that follow this header, plus 1 for the |
| header itself. |
| * - 0xE |
| - \_\_le32 |
| - block |
| - The block number (within the directory file) that goes with the lowest |
| hash value of this block. This value is stored in the parent block. |
| * - 0x12 |
| - struct dx\_entry |
| - entries[0] |
| - As many 8-byte ``struct dx_entry`` as fits in the rest of the data block. |
| |
| The hash maps that exist in both ``struct dx_root`` and |
| ``struct dx_node`` are recorded as ``struct dx_entry``, which is 8 bytes |
| long: |
| |
| .. list-table:: |
| :widths: 8 8 24 40 |
| :header-rows: 1 |
| |
| * - Offset |
| - Type |
| - Name |
| - Description |
| * - 0x0 |
| - \_\_le32 |
| - hash |
| - Hash code. |
| * - 0x4 |
| - \_\_le32 |
| - block |
| - Block number (within the directory file, not filesystem blocks) of the |
| next node in the htree. |
| |
| (If you think this is all quite clever and peculiar, so does the |
| author.) |
| |
| If metadata checksums are enabled, the last 8 bytes of the directory |
| block (precisely the length of one dx\_entry) are used to store a |
| ``struct dx_tail``, which contains the checksum. The ``limit`` and |
| ``count`` entries in the dx\_root/dx\_node structures are adjusted as |
| necessary to fit the dx\_tail into the block. If there is no space for |
| the dx\_tail, the user is notified to run e2fsck -D to rebuild the |
| directory index (which will ensure that there's space for the checksum. |
| The dx\_tail structure is 8 bytes long and looks like this: |
| |
| .. list-table:: |
| :widths: 8 8 24 40 |
| :header-rows: 1 |
| |
| * - Offset |
| - Type |
| - Name |
| - Description |
| * - 0x0 |
| - u32 |
| - dt\_reserved |
| - Zero. |
| * - 0x4 |
| - \_\_le32 |
| - dt\_checksum |
| - Checksum of the htree directory block. |
| |
| The checksum is calculated against the FS UUID, the htree index header |
| (dx\_root or dx\_node), all of the htree indices (dx\_entry) that are in |
| use, and the tail block (dx\_tail). |