|  | XFS Self Describing Metadata | 
|  | ---------------------------- | 
|  |  | 
|  | Introduction | 
|  | ------------ | 
|  |  | 
|  | The largest scalability problem facing XFS is not one of algorithmic | 
|  | scalability, but of verification of the filesystem structure. Scalabilty of the | 
|  | structures and indexes on disk and the algorithms for iterating them are | 
|  | adequate for supporting PB scale filesystems with billions of inodes, however it | 
|  | is this very scalability that causes the verification problem. | 
|  |  | 
|  | Almost all metadata on XFS is dynamically allocated. The only fixed location | 
|  | metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all | 
|  | other metadata structures need to be discovered by walking the filesystem | 
|  | structure in different ways. While this is already done by userspace tools for | 
|  | validating and repairing the structure, there are limits to what they can | 
|  | verify, and this in turn limits the supportable size of an XFS filesystem. | 
|  |  | 
|  | For example, it is entirely possible to manually use xfs_db and a bit of | 
|  | scripting to analyse the structure of a 100TB filesystem when trying to | 
|  | determine the root cause of a corruption problem, but it is still mainly a | 
|  | manual task of verifying that things like single bit errors or misplaced writes | 
|  | weren't the ultimate cause of a corruption event. It may take a few hours to a | 
|  | few days to perform such forensic analysis, so for at this scale root cause | 
|  | analysis is entirely possible. | 
|  |  | 
|  | However, if we scale the filesystem up to 1PB, we now have 10x as much metadata | 
|  | to analyse and so that analysis blows out towards weeks/months of forensic work. | 
|  | Most of the analysis work is slow and tedious, so as the amount of analysis goes | 
|  | up, the more likely that the cause will be lost in the noise.  Hence the primary | 
|  | concern for supporting PB scale filesystems is minimising the time and effort | 
|  | required for basic forensic analysis of the filesystem structure. | 
|  |  | 
|  |  | 
|  | Self Describing Metadata | 
|  | ------------------------ | 
|  |  | 
|  | One of the problems with the current metadata format is that apart from the | 
|  | magic number in the metadata block, we have no other way of identifying what it | 
|  | is supposed to be. We can't even identify if it is the right place. Put simply, | 
|  | you can't look at a single metadata block in isolation and say "yes, it is | 
|  | supposed to be there and the contents are valid". | 
|  |  | 
|  | Hence most of the time spent on forensic analysis is spent doing basic | 
|  | verification of metadata values, looking for values that are in range (and hence | 
|  | not detected by automated verification checks) but are not correct. Finding and | 
|  | understanding how things like cross linked block lists (e.g. sibling | 
|  | pointers in a btree end up with loops in them) are the key to understanding what | 
|  | went wrong, but it is impossible to tell what order the blocks were linked into | 
|  | each other or written to disk after the fact. | 
|  |  | 
|  | Hence we need to record more information into the metadata to allow us to | 
|  | quickly determine if the metadata is intact and can be ignored for the purpose | 
|  | of analysis. We can't protect against every possible type of error, but we can | 
|  | ensure that common types of errors are easily detectable.  Hence the concept of | 
|  | self describing metadata. | 
|  |  | 
|  | The first, fundamental requirement of self describing metadata is that the | 
|  | metadata object contains some form of unique identifier in a well known | 
|  | location. This allows us to identify the expected contents of the block and | 
|  | hence parse and verify the metadata object. IF we can't independently identify | 
|  | the type of metadata in the object, then the metadata doesn't describe itself | 
|  | very well at all! | 
|  |  | 
|  | Luckily, almost all XFS metadata has magic numbers embedded already - only the | 
|  | AGFL, remote symlinks and remote attribute blocks do not contain identifying | 
|  | magic numbers. Hence we can change the on-disk format of all these objects to | 
|  | add more identifying information and detect this simply by changing the magic | 
|  | numbers in the metadata objects. That is, if it has the current magic number, | 
|  | the metadata isn't self identifying. If it contains a new magic number, it is | 
|  | self identifying and we can do much more expansive automated verification of the | 
|  | metadata object at runtime, during forensic analysis or repair. | 
|  |  | 
|  | As a primary concern, self describing metadata needs some form of overall | 
|  | integrity checking. We cannot trust the metadata if we cannot verify that it has | 
|  | not been changed as a result of external influences. Hence we need some form of | 
|  | integrity check, and this is done by adding CRC32c validation to the metadata | 
|  | block. If we can verify the block contains the metadata it was intended to | 
|  | contain, a large amount of the manual verification work can be skipped. | 
|  |  | 
|  | CRC32c was selected as metadata cannot be more than 64k in length in XFS and | 
|  | hence a 32 bit CRC is more than sufficient to detect multi-bit errors in | 
|  | metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is | 
|  | fast. So while CRC32c is not the strongest of possible integrity checks that | 
|  | could be used, it is more than sufficient for our needs and has relatively | 
|  | little overhead. Adding support for larger integrity fields and/or algorithms | 
|  | does really provide any extra value over CRC32c, but it does add a lot of | 
|  | complexity and so there is no provision for changing the integrity checking | 
|  | mechanism. | 
|  |  | 
|  | Self describing metadata needs to contain enough information so that the | 
|  | metadata block can be verified as being in the correct place without needing to | 
|  | look at any other metadata. This means it needs to contain location information. | 
|  | Just adding a block number to the metadata is not sufficient to protect against | 
|  | mis-directed writes - a write might be misdirected to the wrong LUN and so be | 
|  | written to the "correct block" of the wrong filesystem. Hence location | 
|  | information must contain a filesystem identifier as well as a block number. | 
|  |  | 
|  | Another key information point in forensic analysis is knowing who the metadata | 
|  | block belongs to. We already know the type, the location, that it is valid | 
|  | and/or corrupted, and how long ago that it was last modified. Knowing the owner | 
|  | of the block is important as it allows us to find other related metadata to | 
|  | determine the scope of the corruption. For example, if we have a extent btree | 
|  | object, we don't know what inode it belongs to and hence have to walk the entire | 
|  | filesystem to find the owner of the block. Worse, the corruption could mean that | 
|  | no owner can be found (i.e. it's an orphan block), and so without an owner field | 
|  | in the metadata we have no idea of the scope of the corruption. If we have an | 
|  | owner field in the metadata object, we can immediately do top down validation to | 
|  | determine the scope of the problem. | 
|  |  | 
|  | Different types of metadata have different owner identifiers. For example, | 
|  | directory, attribute and extent tree blocks are all owned by an inode, whilst | 
|  | freespace btree blocks are owned by an allocation group. Hence the size and | 
|  | contents of the owner field are determined by the type of metadata object we are | 
|  | looking at.  The owner information can also identify misplaced writes (e.g. | 
|  | freespace btree block written to the wrong AG). | 
|  |  | 
|  | Self describing metadata also needs to contain some indication of when it was | 
|  | written to the filesystem. One of the key information points when doing forensic | 
|  | analysis is how recently the block was modified. Correlation of set of corrupted | 
|  | metadata blocks based on modification times is important as it can indicate | 
|  | whether the corruptions are related, whether there's been multiple corruption | 
|  | events that lead to the eventual failure, and even whether there are corruptions | 
|  | present that the run-time verification is not detecting. | 
|  |  | 
|  | For example, we can determine whether a metadata object is supposed to be free | 
|  | space or still allocated if it is still referenced by its owner by looking at | 
|  | when the free space btree block that contains the block was last written | 
|  | compared to when the metadata object itself was last written.  If the free space | 
|  | block is more recent than the object and the object's owner, then there is a | 
|  | very good chance that the block should have been removed from the owner. | 
|  |  | 
|  | To provide this "written timestamp", each metadata block gets the Log Sequence | 
|  | Number (LSN) of the most recent transaction it was modified on written into it. | 
|  | This number will always increase over the life of the filesystem, and the only | 
|  | thing that resets it is running xfs_repair on the filesystem. Further, by use of | 
|  | the LSN we can tell if the corrupted metadata all belonged to the same log | 
|  | checkpoint and hence have some idea of how much modification occurred between | 
|  | the first and last instance of corrupt metadata on disk and, further, how much | 
|  | modification occurred between the corruption being written and when it was | 
|  | detected. | 
|  |  | 
|  | Runtime Validation | 
|  | ------------------ | 
|  |  | 
|  | Validation of self-describing metadata takes place at runtime in two places: | 
|  |  | 
|  | - immediately after a successful read from disk | 
|  | - immediately prior to write IO submission | 
|  |  | 
|  | The verification is completely stateless - it is done independently of the | 
|  | modification process, and seeks only to check that the metadata is what it says | 
|  | it is and that the metadata fields are within bounds and internally consistent. | 
|  | As such, we cannot catch all types of corruption that can occur within a block | 
|  | as there may be certain limitations that operational state enforces of the | 
|  | metadata, or there may be corruption of interblock relationships (e.g. corrupted | 
|  | sibling pointer lists). Hence we still need stateful checking in the main code | 
|  | body, but in general most of the per-field validation is handled by the | 
|  | verifiers. | 
|  |  | 
|  | For read verification, the caller needs to specify the expected type of metadata | 
|  | that it should see, and the IO completion process verifies that the metadata | 
|  | object matches what was expected. If the verification process fails, then it | 
|  | marks the object being read as EFSCORRUPTED. The caller needs to catch this | 
|  | error (same as for IO errors), and if it needs to take special action due to a | 
|  | verification error it can do so by catching the EFSCORRUPTED error value. If we | 
|  | need more discrimination of error type at higher levels, we can define new | 
|  | error numbers for different errors as necessary. | 
|  |  | 
|  | The first step in read verification is checking the magic number and determining | 
|  | whether CRC validating is necessary. If it is, the CRC32c is calculated and | 
|  | compared against the value stored in the object itself. Once this is validated, | 
|  | further checks are made against the location information, followed by extensive | 
|  | object specific metadata validation. If any of these checks fail, then the | 
|  | buffer is considered corrupt and the EFSCORRUPTED error is set appropriately. | 
|  |  | 
|  | Write verification is the opposite of the read verification - first the object | 
|  | is extensively verified and if it is OK we then update the LSN from the last | 
|  | modification made to the object, After this, we calculate the CRC and insert it | 
|  | into the object. Once this is done the write IO is allowed to continue. If any | 
|  | error occurs during this process, the buffer is again marked with a EFSCORRUPTED | 
|  | error for the higher layers to catch. | 
|  |  | 
|  | Structures | 
|  | ---------- | 
|  |  | 
|  | A typical on-disk structure needs to contain the following information: | 
|  |  | 
|  | struct xfs_ondisk_hdr { | 
|  | __be32  magic;		/* magic number */ | 
|  | __be32  crc;		/* CRC, not logged */ | 
|  | uuid_t  uuid;		/* filesystem identifier */ | 
|  | __be64  owner;		/* parent object */ | 
|  | __be64  blkno;		/* location on disk */ | 
|  | __be64  lsn;		/* last modification in log, not logged */ | 
|  | }; | 
|  |  | 
|  | Depending on the metadata, this information may be part of a header structure | 
|  | separate to the metadata contents, or may be distributed through an existing | 
|  | structure. The latter occurs with metadata that already contains some of this | 
|  | information, such as the superblock and AG headers. | 
|  |  | 
|  | Other metadata may have different formats for the information, but the same | 
|  | level of information is generally provided. For example: | 
|  |  | 
|  | - short btree blocks have a 32 bit owner (ag number) and a 32 bit block | 
|  | number for location. The two of these combined provide the same | 
|  | information as @owner and @blkno in eh above structure, but using 8 | 
|  | bytes less space on disk. | 
|  |  | 
|  | - directory/attribute node blocks have a 16 bit magic number, and the | 
|  | header that contains the magic number has other information in it as | 
|  | well. hence the additional metadata headers change the overall format | 
|  | of the metadata. | 
|  |  | 
|  | A typical buffer read verifier is structured as follows: | 
|  |  | 
|  | #define XFS_FOO_CRC_OFF		offsetof(struct xfs_ondisk_hdr, crc) | 
|  |  | 
|  | static void | 
|  | xfs_foo_read_verify( | 
|  | struct xfs_buf	*bp) | 
|  | { | 
|  | struct xfs_mount *mp = bp->b_target->bt_mount; | 
|  |  | 
|  | if ((xfs_sb_version_hascrc(&mp->m_sb) && | 
|  | !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length), | 
|  | XFS_FOO_CRC_OFF)) || | 
|  | !xfs_foo_verify(bp)) { | 
|  | XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); | 
|  | xfs_buf_ioerror(bp, EFSCORRUPTED); | 
|  | } | 
|  | } | 
|  |  | 
|  | The code ensures that the CRC is only checked if the filesystem has CRCs enabled | 
|  | by checking the superblock of the feature bit, and then if the CRC verifies OK | 
|  | (or is not needed) it verifies the actual contents of the block. | 
|  |  | 
|  | The verifier function will take a couple of different forms, depending on | 
|  | whether the magic number can be used to determine the format of the block. In | 
|  | the case it can't, the code is structured as follows: | 
|  |  | 
|  | static bool | 
|  | xfs_foo_verify( | 
|  | struct xfs_buf		*bp) | 
|  | { | 
|  | struct xfs_mount	*mp = bp->b_target->bt_mount; | 
|  | struct xfs_ondisk_hdr	*hdr = bp->b_addr; | 
|  |  | 
|  | if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) | 
|  | return false; | 
|  |  | 
|  | if (!xfs_sb_version_hascrc(&mp->m_sb)) { | 
|  | if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) | 
|  | return false; | 
|  | if (bp->b_bn != be64_to_cpu(hdr->blkno)) | 
|  | return false; | 
|  | if (hdr->owner == 0) | 
|  | return false; | 
|  | } | 
|  |  | 
|  | /* object specific verification checks here */ | 
|  |  | 
|  | return true; | 
|  | } | 
|  |  | 
|  | If there are different magic numbers for the different formats, the verifier | 
|  | will look like: | 
|  |  | 
|  | static bool | 
|  | xfs_foo_verify( | 
|  | struct xfs_buf		*bp) | 
|  | { | 
|  | struct xfs_mount	*mp = bp->b_target->bt_mount; | 
|  | struct xfs_ondisk_hdr	*hdr = bp->b_addr; | 
|  |  | 
|  | if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) { | 
|  | if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) | 
|  | return false; | 
|  | if (bp->b_bn != be64_to_cpu(hdr->blkno)) | 
|  | return false; | 
|  | if (hdr->owner == 0) | 
|  | return false; | 
|  | } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) | 
|  | return false; | 
|  |  | 
|  | /* object specific verification checks here */ | 
|  |  | 
|  | return true; | 
|  | } | 
|  |  | 
|  | Write verifiers are very similar to the read verifiers, they just do things in | 
|  | the opposite order to the read verifiers. A typical write verifier: | 
|  |  | 
|  | static void | 
|  | xfs_foo_write_verify( | 
|  | struct xfs_buf	*bp) | 
|  | { | 
|  | struct xfs_mount	*mp = bp->b_target->bt_mount; | 
|  | struct xfs_buf_log_item	*bip = bp->b_fspriv; | 
|  |  | 
|  | if (!xfs_foo_verify(bp)) { | 
|  | XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); | 
|  | xfs_buf_ioerror(bp, EFSCORRUPTED); | 
|  | return; | 
|  | } | 
|  |  | 
|  | if (!xfs_sb_version_hascrc(&mp->m_sb)) | 
|  | return; | 
|  |  | 
|  |  | 
|  | if (bip) { | 
|  | struct xfs_ondisk_hdr	*hdr = bp->b_addr; | 
|  | hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn); | 
|  | } | 
|  | xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF); | 
|  | } | 
|  |  | 
|  | This will verify the internal structure of the metadata before we go any | 
|  | further, detecting corruptions that have occurred as the metadata has been | 
|  | modified in memory. If the metadata verifies OK, and CRCs are enabled, we then | 
|  | update the LSN field (when it was last modified) and calculate the CRC on the | 
|  | metadata. Once this is done, we can issue the IO. | 
|  |  | 
|  | Inodes and Dquots | 
|  | ----------------- | 
|  |  | 
|  | Inodes and dquots are special snowflakes. They have per-object CRC and | 
|  | self-identifiers, but they are packed so that there are multiple objects per | 
|  | buffer. Hence we do not use per-buffer verifiers to do the work of per-object | 
|  | verification and CRC calculations. The per-buffer verifiers simply perform basic | 
|  | identification of the buffer - that they contain inodes or dquots, and that | 
|  | there are magic numbers in all the expected spots. All further CRC and | 
|  | verification checks are done when each inode is read from or written back to the | 
|  | buffer. | 
|  |  | 
|  | The structure of the verifiers and the identifiers checks is very similar to the | 
|  | buffer code described above. The only difference is where they are called. For | 
|  | example, inode read verification is done in xfs_iread() when the inode is first | 
|  | read out of the buffer and the struct xfs_inode is instantiated. The inode is | 
|  | already extensively verified during writeback in xfs_iflush_int, so the only | 
|  | addition here is to add the LSN and CRC to the inode as it is copied back into | 
|  | the buffer. | 
|  |  | 
|  | XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of | 
|  | the unlinked list modifications check or update CRCs, neither during unlink nor | 
|  | log recovery. So, it's gone unnoticed until now. This won't matter immediately - | 
|  | repair will probably complain about it - but it needs to be fixed. | 
|  |  |