linux/fs/bcachefs
Linus Torvalds 4a4b30ea80 bcachefs updates for 6.15
On disk format is now soft frozen: no more required/automatic are
 anticipated before taking off the experimental label.
 
 Major changes/features since 6.14:
 
 - Scrub
 
 - Blocksize greater than page size support
 
 - A number of "rebalance spinning and doing no work" issues have been
   fixed; we now check if the write allocation will succeed in
   bch2_data_update_init(), before kicking off the read.
 
   There's still more work to do in this area. Later we may want to add
   another bitset btree, like rebalance_work, to track "extents that
   rebalance was requested to move but couldn't", e.g. due to destination
   target having insufficient online devices.
 
 - We can now support scaling well into the petabyte range: latest
   bcachefs-tools will pick an appropriate bucket size at format time to
   ensure fsck can run in available memory (e.g. a server with 256GB of
   ram and 100PB of storage would want 16MB buckets).
 
 On disk format changes:
 
 - 1.21: cached backpointers (scalability improvement)
 
   Cached replicas now get backpointers, which means we no longer rely on
   incrementing bucket generation numbers to invalidate cached data: this
   lets us get rid of the bucket generation number garbage collection,
   which had to periodically rescan all extents to recompute bucket
   oldest_gen.
 
   Bucket generation numbers are now only used as a consistency check,
   but they're quite useful for that.
 
 - 1.22: stripe backpointers
 
   Stripes now have backpointers: erasure coded stripes have their own
   checksums, separate from the checksums for the extents they contain
   (and stripe checksums also cover the parity blocks). This is required
   for implementing scrub for stripes.
 
 - 1.23: stripe lru (scalability improvement)
 
   Persistent lru for stripes, ordered by "number of empty blocks". This
   is used by the stripe creation path, which depending on free space
   may create a new stripe out of a partially empty existing stripe
   instead of starting a brand new stripe.
 
   This replaces an in-memory heap, and means we no longer have to read
   in the stripes btree at startup.
 
 - 1.24: casefolding
 
   Case insensitive directory support, courtesy of Valve.
 
   This is an incompatible feature, to enable mount with
     -o version_upgrade=incompatible
 
 - 1.25: extent_flags
 
   Another incompatible feature requiring explicit opt-in to enable.
 
   This adds a flags entry to extents, and a flag bit that marks extents
   as poisoned.
 
   A poisoned extent is an extent that was unreadable due to checksum
   errors. We can't move such extents without giving them a new checksum,
   and we may have to move them (for e.g. copygc or device evacuate).
   We also don't want to delete them: in the future we'll have an API
   that lets userspace ignore checksum errors and attempt to deal with
   simple bitrot itself. Marking them as poisoned lets us continue to
   return the correct error to userspace on normal read calls.
 
 Other changes/features:
 
 - BCH_IOCTL_QUERY_COUNTERS: this is used by the new 'bcachefs fs top'
   command, which shows a live view of all internal filesystem counters.
 
 - Improved journal pipelining: we can now have 16 journal writes in
   flight concurrently, up from 4. We're logging significantly more to
   the journal than we used to with all the recent disk accounting
   changes and additions, so some users should see a performance
   increase on some workloads.
 
 - BCH_MEMBER_STATE_failed: previously, we would do no IO at all to
   devices marked as failed. Now we will attempt to read from them, but
   only if we have no better options.
 
 - New option, write_error_timeout: devices will be kicked out of the
   filesystem if all writes have been failing for x number of seconds.
 
   We now also kick devices out when notified by blk_holder_ops that
   they've gone offline.
 
 - Device option handling improvements: the discard option should now be
   working as expected (additionally, in -tools, all device options that
   can be set at format time can now be set at device add time, i.e.
   data_allowed, state).
 
 - We now try harder to read data after a checksum error: we'll do
   additional retries if necessary to a device after after it gave us
   data with a checksum error.
 
 - More self healing work: the full inode <-> dirent consistency checks
   that are currently run by fsck are now also run every time we do a
   lookup, meaning we'll be able to correct errors at runtime. Runtime
   self healing will be flipped on after the new changes have seen more
   testing, currently they're just checking for consistency.
 
 - KMSAN fixes: our KMSAN builds should be nearly clean now, which will
   put a massive dent in the syzbot dashboard.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmfhbnsACgkQE6szbY3K
 bnY6ew/9FXh3m71BvVpuqTYcUGzIC7gVrnkFy6n4W96v07OjSOoTNHOVVovajxc3
 P9LvA77BHC4Xro3H7ORpsIurOZUc6yx18ZizzulVbQFuYa7LY/kNri4ZBtGHcRiV
 pIdQDLSNmwFjPA4x2S1qTFSF1c586lad+UNQiLam5ophBwQPEO6vG51ZEHa4wld9
 +OhWTDYfrvij4D3Lt1ppvhuDP+PQBjhu/QFc0bGjHvKOjfV6sw9XU91sCYKOJIzd
 qzpsiQd5sepnX717Br3f5SLdxMq2lJYvRp9756vltOCaMBvJYJtHqtXCglHQEkFw
 yjhmPjk4r3VlKTF8K+wEJfAHwbC2kEn7csJNbt0+Nko5PPtFyrb8ok6QUbHCKscL
 L0VMnzaXHVqvG2VgYa31temfdz7HM/zHjQ8Al3eQPaqTHIoTXIBQxOQSea/apVMt
 TIlastvLoHfR8W7+LrwOmTjnBJGCJ+MrdcJzJDVk2tQmmcMA0boeZvl4aSklFuyB
 zNN5fxp0VMsxNyIHLJjQ3UcwVqHXC5w+f5H1ByQLUyQh+m/xaAaz7S+BTVdVbFPa
 1Z1xDuvuHOTnjIOamnOD1l36afJnhq5RciPCXCNtQSB819mc+AfNGQNQTVNOTReC
 iTiUCcNxu0/DIPlPmeJzAlukVJUgz+/knOI/6zPs3eI7/o88ZGg=
 =k3cV
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2025-03-24' of git://evilpiepirate.org/bcachefs

Pull bcachefs updates from Kent Overstreet:
 "On disk format is now soft frozen: no more required/automatic are
  anticipated before taking off the experimental label.

  Major changes/features since 6.14:

   - Scrub

   - Blocksize greater than page size support

   - A number of "rebalance spinning and doing no work" issues have been
     fixed; we now check if the write allocation will succeed in
     bch2_data_update_init(), before kicking off the read.

     There's still more work to do in this area. Later we may want to
     add another bitset btree, like rebalance_work, to track "extents
     that rebalance was requested to move but couldn't", e.g. due to
     destination target having insufficient online devices.

   - We can now support scaling well into the petabyte range: latest
     bcachefs-tools will pick an appropriate bucket size at format time
     to ensure fsck can run in available memory (e.g. a server with
     256GB of ram and 100PB of storage would want 16MB buckets).

  On disk format changes:

   - 1.21: cached backpointers (scalability improvement)

     Cached replicas now get backpointers, which means we no longer rely
     on incrementing bucket generation numbers to invalidate cached
     data: this lets us get rid of the bucket generation number garbage
     collection, which had to periodically rescan all extents to
     recompute bucket oldest_gen.

     Bucket generation numbers are now only used as a consistency check,
     but they're quite useful for that.

   - 1.22: stripe backpointers

     Stripes now have backpointers: erasure coded stripes have their own
     checksums, separate from the checksums for the extents they contain
     (and stripe checksums also cover the parity blocks). This is
     required for implementing scrub for stripes.

   - 1.23: stripe lru (scalability improvement)

     Persistent lru for stripes, ordered by "number of empty blocks".
     This is used by the stripe creation path, which depending on free
     space may create a new stripe out of a partially empty existing
     stripe instead of starting a brand new stripe.

     This replaces an in-memory heap, and means we no longer have to
     read in the stripes btree at startup.

   - 1.24: casefolding

     Case insensitive directory support, courtesy of Valve.

     This is an incompatible feature, to enable mount with
       -o version_upgrade=incompatible

   - 1.25: extent_flags

     Another incompatible feature requiring explicit opt-in to enable.

     This adds a flags entry to extents, and a flag bit that marks
     extents as poisoned.

     A poisoned extent is an extent that was unreadable due to checksum
     errors. We can't move such extents without giving them a new
     checksum, and we may have to move them (for e.g. copygc or device
     evacuate). We also don't want to delete them: in the future we'll
     have an API that lets userspace ignore checksum errors and attempt
     to deal with simple bitrot itself. Marking them as poisoned lets us
     continue to return the correct error to userspace on normal read
     calls.

  Other changes/features:

   - BCH_IOCTL_QUERY_COUNTERS: this is used by the new 'bcachefs fs top'
     command, which shows a live view of all internal filesystem
     counters.

   - Improved journal pipelining: we can now have 16 journal writes in
     flight concurrently, up from 4. We're logging significantly more to
     the journal than we used to with all the recent disk accounting
     changes and additions, so some users should see a performance
     increase on some workloads.

   - BCH_MEMBER_STATE_failed: previously, we would do no IO at all to
     devices marked as failed. Now we will attempt to read from them,
     but only if we have no better options.

   - New option, write_error_timeout: devices will be kicked out of the
     filesystem if all writes have been failing for x number of seconds.

     We now also kick devices out when notified by blk_holder_ops that
     they've gone offline.

   - Device option handling improvements: the discard option should now
     be working as expected (additionally, in -tools, all device options
     that can be set at format time can now be set at device add time,
     i.e. data_allowed, state).

   - We now try harder to read data after a checksum error: we'll do
     additional retries if necessary to a device after after it gave us
     data with a checksum error.

   - More self healing work: the full inode <-> dirent consistency
     checks that are currently run by fsck are now also run every time
     we do a lookup, meaning we'll be able to correct errors at runtime.
     Runtime self healing will be flipped on after the new changes have
     seen more testing, currently they're just checking for consistency.

   - KMSAN fixes: our KMSAN builds should be nearly clean now, which
     will put a massive dent in the syzbot dashboard"

* tag 'bcachefs-2025-03-24' of git://evilpiepirate.org/bcachefs: (180 commits)
  bcachefs: Kill unnecessary bch2_dev_usage_read()
  bcachefs: btree node write errors now print btree node
  bcachefs: Fix race in print_chain()
  bcachefs: btree_trans_restart_foreign_task()
  bcachefs: bch2_disk_accounting_mod2()
  bcachefs: zero init journal bios
  bcachefs: Eliminate padding in move_bucket_key
  bcachefs: Fix a KMSAN splat in btree_update_nodes_written()
  bcachefs: kmsan asserts
  bcachefs: Fix kmsan warnings in bch2_extent_crc_pack()
  bcachefs: Disable asm memcpys when kmsan enabled
  bcachefs: Handle backpointers with unknown data types
  bcachefs: Count BCH_DATA_parity backpointers correctly
  bcachefs: Run bch2_check_dirent_target() at lookup time
  bcachefs: Refactor bch2_check_dirent_target()
  bcachefs: Move bch2_check_dirent_target() to namei.c
  bcachefs: fs-common.c -> namei.c
  bcachefs: EIO cleanup
  bcachefs: bch2_write_prep_encoded_data() now returns errcode
  bcachefs: Simplify bch2_write_op_error()
  ...
2025-03-27 13:20:07 -07:00
..
2024-08-08 15:14:02 -04:00
2025-03-24 09:50:36 -04:00
2024-12-21 01:36:15 -05:00
2024-12-21 01:36:20 -05:00
2024-09-09 09:41:49 -04:00
2025-01-09 23:38:41 -05:00
2025-03-14 21:02:16 -04:00
2025-02-26 19:31:05 -05:00
2025-02-26 19:31:05 -05:00
2025-03-24 09:50:36 -04:00
2025-03-24 09:50:36 -04:00
2025-03-24 09:50:36 -04:00
2025-03-24 09:50:36 -04:00
2025-03-24 09:50:36 -04:00
2025-03-24 09:50:36 -04:00
2025-03-14 21:02:12 -04:00
2025-03-24 09:50:36 -04:00
2025-03-24 09:50:36 -04:00
2025-03-24 09:50:36 -04:00
2025-03-24 09:50:36 -04:00
2025-03-14 21:02:11 -04:00
2025-03-27 13:20:07 -07:00
2025-03-27 13:20:07 -07:00
2024-12-21 01:36:22 -05:00
2025-03-24 09:50:37 -04:00
2025-03-24 09:50:36 -04:00
2025-03-24 09:50:36 -04:00
2025-03-24 09:50:36 -04:00
2025-03-24 09:50:37 -04:00
2025-03-24 09:50:35 -04:00
2025-03-24 09:50:36 -04:00
2024-12-21 01:36:16 -05:00
2025-03-14 21:02:12 -04:00
2025-03-14 21:02:12 -04:00
2024-12-21 01:36:20 -05:00
2024-12-21 01:36:20 -05:00
2025-03-24 09:50:36 -04:00
2024-05-08 17:29:19 -04:00
2024-06-23 00:57:21 -04:00
2025-02-26 19:31:05 -05:00
2025-02-26 19:31:05 -05:00
2025-03-24 09:50:37 -04:00
2025-03-14 21:02:16 -04:00
2024-09-09 09:41:49 -04:00
2024-09-09 09:41:49 -04:00
2025-03-24 09:50:34 -04:00
2024-12-21 01:36:20 -05:00