Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast Dedup: FDT-log feature #15895

Closed
wants to merge 14 commits into from
Closed

Conversation

allanjude
Copy link
Contributor

Motivation and Context

Dedup tables have a huge performance overhead in part because they require an update to the table on disk for every write, on every transaction. Adding a write-only journal allows updates to be batched up and deferred, reducing the immediate cost.

Description

To address this, the dedup log was added. If the fast_dedup feature is enabled, at the end of each txg, modified entries will be copied to an in-memory "log" object (ddt_log_t), and appended to an on-disk log. If the same block is requested again, the in-memory object will be checked first, and if its there, the entry inflated back onto the live tree without going to storage. The on-disk log is only read at pool import time, to reload the in-memory log.

Each txg, some amount of the in-memory log will be flushed out to a DDT storage object (ie ZAP) as normal. OpenZFS will try hard to flush enough to keep up with the rate of change on dedup entries, but not so much that it would impact overall throughput, and not using too much memory. See the zfs_dedup_log_* tuneables in zfs(4) for more details.

How Has This Been Tested?

TBD.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

cmd/zdb/zdb.c Outdated Show resolved Hide resolved
Copy link
Contributor

@adamdmoss adamdmoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broadly-speaking I like the approach and implementation, although:

  • I'm not qualified to vouch for its correctness, LGTM but I'm sure there are 1001 subtleties in the journalling that are beyond me
  • The changes still need to be taken to completion (ifdefs/xxx)
  • I'm concerned about how much this has been exercised, especially when separated from the other DDT optimization PRs

@robn
Copy link
Member

robn commented May 15, 2024

[Fast dedup stack rebased to master 3c941d1]

@robn
Copy link
Member

robn commented Jun 7, 2024

@amotin last push switches from storing count of blocks to count of bytes, as discussed. Note that it still operates on whole blocks internally, it just changes what we store, so we could do something different next time.

@robn
Copy link
Member

robn commented Jun 7, 2024

@adamdmoss

  • The changes still need to be taken to completion (ifdefs/xxx)
  • I'm concerned about how much this has been exercised, especially when separated from the other DDT optimization PRs

Both reasonable things to flag! This PR is near the top of the larger "fast dedup" stack of work, and is waiting for some stuff lower down in the stack to be finalised before this one is finalised. To that extent, yes, it is a draft. There's also some specific workarounds, see the commit list. Those will be resolved before the end.

However, we believe the structure is largely correct, even if some of the fine details are not. Review is still helpful here, because if there's an error in the fundamentals then the polishing won't matter.

@amotin
Copy link
Member

amotin commented Jun 10, 2024

last push switches from storing count of blocks to count of bytes, as discussed. Note that it still operates on whole blocks internally, it just changes what we store, so we could do something different next time.

@robn The on-disk log format is still based on blocks, since you do not allow records to cross block boundaries, so we won't be able to change the algorithm without introducing new DDT version. The benefit from using byte offsets I would see is ability to continue writing incomplete log block from previous TXGs, but as I see your current code always rounds the ddl_size to the next block boundary in ddt_log_commit(), while in ddt_log_begin() you always assign dlu_offset to zero. Would we try to change it later, we would become backward-incompatible, so it would be nice to allow it now, at least making it more flexible.

module/zfs/ddt_log.c Outdated Show resolved Hide resolved
module/zfs/ddt_log.c Show resolved Hide resolved
module/zfs/ddt_log.c Outdated Show resolved Hide resolved
module/zfs/ddt_log.c Outdated Show resolved Hide resolved
module/zfs/dsl_scan.c Outdated Show resolved Hide resolved
module/zfs/ddt_log.c Outdated Show resolved Hide resolved
module/zfs/ddt_log.c Outdated Show resolved Hide resolved
module/zfs/ddt_log.c Outdated Show resolved Hide resolved
module/zfs/ddt_log.c Outdated Show resolved Hide resolved
module/zfs/ddt_log.c Show resolved Hide resolved
@robn
Copy link
Member

robn commented Jun 14, 2024

[Fast dedup stack rebased to master c98295e]

@robn robn force-pushed the fdt-rel-log branch 5 times, most recently from 6fb36a9 to 0f84746 Compare June 18, 2024 11:02
module/zfs/ddt_log.c Fixed Show resolved Hide resolved
@robn
Copy link
Member

robn commented Jun 18, 2024

Alright, so we now have large blocks, and it looks good I think.

To clarify some of the above comments, by design, we never write across a block boundary, and we never write to the same block more than once. Not writing across block boundaries just makes the code easier (and not a lot of waste; records are small). Not writing a block more than once is to ensure the log is always write-only. A big part of the pain of dedup historically is needing to read in order to write, and while appending to an object is very different to updating a complex object like a ZAP, we set it as a design goal so we'd never have to deal with it again in the future.

I don't think there's any issue with changing the format in the future. For one thing, there's version and flag bits, so we can spot an old format and deal with it. For the other, the log can be disabled and flushed out without too much effort, so a format change can be done by emptying the log and then recreating it with the new thing enabled.

TODO from above:

  • byteswaps
  • ignore entries before checkpoint during load
  • scrub start txg

@amotin
Copy link
Member

amotin commented Jun 18, 2024

Not writing a block more than once is to ensure the log is always write-only.

My thinking was about a stream of very short transactions, that may produce a long chain of very small blocks, that may take longer to read on import due to required head seeks and overhead. But I guess in that case log should not grow too long, so it may be not a huge deal, though I haven't looked close on the log flushing math yet. I don't insist on this.

@robn
Copy link
Member

robn commented Jun 20, 2024

Alright, big push today:

  • on-disk formats should now be endian-safe: all u64-based, and use the BF64_* macros to break it down. Untested on a big-endian machine (who has those?) but think this will be right.
  • when loading, just skip entries before the checkpoint, rather than loading them and then destorying them later. Cleaner and faster.
  • fix that "before checkpoint" assert
  • finalise object bs/ibs setup
  • fix scan cutoff txg, clean up the code a bit there

(If it helps to see the new changes, robn/fdt-rel-log-wip has the pre-squash commits).

Home stretch now. Pretty sure the last thing missing is some tests just to ensure any of this works - it obviously does by the numbers, but lets actually prove that now and forever.

include/sys/ddt_impl.h Outdated Show resolved Hide resolved
include/sys/ddt_impl.h Outdated Show resolved Hide resolved
module/zfs/ddt_stats.c Show resolved Hide resolved
module/zfs/ddt.c Outdated Show resolved Hide resolved
module/zfs/ddt.c Outdated Show resolved Hide resolved
@robn robn force-pushed the fdt-rel-log branch 2 times, most recently from 1a289e1 to b56fc1f Compare June 21, 2024 05:36
robn and others added 8 commits August 16, 2024 09:59
The upcoming dedup features break the long held assumption that all
blocks on disk with a 'D' dedup bit will always be present in the DDT,
or will have the same set of DVA allocations on disk as in the DDT.

If the DDT is no longer a complete picture of all the dedup blocks that
will be and should be on disk, then it does us no good to walk and prime
it up front, since it won't necessarily match up with every block we'll
see anyway.

Instead, we rework things here to be more like the BRT checks. When we
see a dedup'd block, we look it up in the DDT, consume a refcount, and
for the second-or-later instances, count them as duplicates.

The DDT and BRT are moved ahead of the space accounting. This will
become important for the "flat" feature, which may need to count a
modified version of the block.

Co-authored-by: Allan Jude <allan@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
The "flat phys" feature will use only a single phys slot for all
entries, which means the old "single", "double" etc naming now makes no
sense, and more importantly, means that choosing the right slot for a
given block pointer will depend on how many slots are in use for a given
DDT.

This removes the old names, and adds accessor macros to decouple
specific phys array indexes from any particular meaning.

(These macros look strange in isolation, mainly in the way they take the
ddt_t* as an arg but don't use it. This is mostly a separate commit to
introduce the concept to the reader before the "flat phys" commit
extends it).

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
The idea here is that sometimes you need the contents of an entry with
no intent to modify it, and/or from a place where its difficult to get
hold of its originating ddt_t to know how to interpret it.

A lightweight entry contains everything you might need to "read" an
entry - its key, type and phys contents - but none of the extras for
modifying it or using it in a larger context. It also has the full
complement of phys slots, so it can represent any kind of dedup entry
without having to know the specific configuration of the table it came
from.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
This slims down the in-memory entry to as small as it can be. The
IO-related parts are made into a separate entry, since they're
relatively rarely needed.

The variable allocation for dde_phys is to support the upcoming flat
format.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Traditional dedup keeps a separate ddt_phys_t "type" for each possible
count of DVAs (that is, copies=) parameter. Each of these are tracked
independently of each other, and have their own set of DVAs. This leads
to an (admittedly rare) situation where you can create as many as six
copies of the data, by changing the copies= parameter between copying.
This is both a waste of storage on disk, but also a waste of space in
the stored DDT entries, since there never needs to be more than three
DVAs to handle all possible values of copies=.

This commit adds a new FDT feature, DDT_FLAG_FLAT. When active, only the
first ddt_phys_t is used. Each time a block is written with the dedup
bit set, this single phys is checked to see if it has enough DVAs to
fulfill the request. If it does, the block is filled with the saved DVAs
as normal. If not, an adjusted write is issued to create as many extra
copies as are needed to fulfill the request, which are then saved into
the entry too.

Because a single phys is no longer an all-or-nothing, but can be
transitioning from fewer to more DVAs, the write path now has to keep a
copy of the previous "known good" DVA set so we can revert to it in case
an error occurs. zio_ddt_write() has been restructured and heavily
commented to make it much easier to see what's happening.

Backwards compatibility is maintained simply by allocating four
ddt_phys_t when the DDT_FLAG_FLAT flag is not set, and updating the phys
selection macros to check the flag. In the old arrangement, each number
of copies gets a whole phys, so it will always have either zero or all
necessary DVAs filled, with no in-between, so the old behaviour
naturally falls out of the new code.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Both the API and the code were kinda mangled and I was really struggling
to follow it. The worst offender was the old ddt_stat_add(); after
fixing it up the rest of the changes are mostly knock-on effects and
targets of opportunity.

Note that the old ddt_stat_add() was safe against overflows - it could
produce crazy numbers, but the compiler wouldn't do anything stupid. The
assertions in ddt_stat_sub() go a lot of the way to protecting against
this; getting in a position where overflows are a problem is definitely
a programming error.

Also expanding ddt_stat_add() and ddt_histogram_empty() produces less
efficient assembly. I'm not bothered about this right now though; these
should not be hot functions, and if they are we'll optimise them later.
If we have to go back to the old form, we'll comment it like crazy.

Finally, I've removed the assertion that the bucket will never be
negative, as it will soon be possible to have entries with zero
refcounts: an entry for a block that is no longer on the pool, but is on
the log waiting to be synced out. It might be better to have a separate
bucket for these, since they're still using real space on disk, but
ultimately these stats are driving UI, and for now I've chosen to keep
them matching how they've looked in the past, as well as match the
operators mental model - pool usage is managed elsewhere.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
This yields substantial performance improvements when we only write out
some small % of entries at a time, as it will cause entries that will go
into "nearby" ZAP leaf nodes to be grouped closer together in the AVL, and
so touch fewer blocks. Without this, the distribution is an even spread,
so we touch a lot more ZAP leaf nodes for any given number of entries.

Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
All objects stored in the MOS get copies=3. For a large dedup table,
this requires significant extra IO and disk space, when its not really
necessary - the dedup table itself isn't needed to read or write data,
only to keep data usage down. Losing the dedup table does not render the
pool unusable, it just messes up the accounting somewhat.

This adds a dmu_ddt_copies tuneable. When set to 0, the existing
behaviour is used. When set higher, dedup table blocks (ZAP and log)
will have this many copies rather than the usual 3, while indirect
blocks will have one more again.

This is a tuneable for now mostly for testing. Losing a dedup table can
cause blocks to be leaked, and we currently have no facilities to repair
that.

Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
robn and others added 4 commits August 16, 2024 15:31
Adds a log/journal to dedup. At the end of txg, instead of writing the
entry directly to the ZAP, instead its adding to an in-memory tree and
appended to an on-disk object. The on-disk object is only read at
import, to reload the in-memory tree.

Lookups first go the the log tree before going to the ZAP, so
recently-used entries will remain close by in memory. This vastly
reduces overhead from dedup IO, as it will not have to do so many
read/update/write cycles on ZAP leaf nodes.

A flushing facility is added at end of txg, to push logged entries out
to the ZAP. There's actually two separate "logs" (in-memory tree and
on-disk object), one active (recieving updated entries) and one flushing
(writing out to disk). These are swapped (ie flushing begins) based on
memory used by the in-memory log trees and time since we last flushed
something.

The flushing facility monitors the amount of entries coming in and being
flushed out, and calibrates itself to try to flush enough each txg to
keep up with the ingest rate without competing too much with other IO.
Multiple tuneables are provided to control the flushing facility.

All the histograms and stats are update to accomodate the log as a
separate entry store. zdb gains knowledge of how to count them and dump
them. Documentation included!

Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
The dedup log does not have a stable cursor, so its not possible to
persist our current scan location within it across pool reloads.
Beccause of this, when walking (scanning), we can't treat it like just
another source of dedup entries.

Instead, when a scan is wanted, we switch to an aggressive flushing
mode, pushing out entries older than the scan start txg as fast as we
can, before starting the scan proper.

Entries after the scan start txg will be handled via other methods; the
DDT ZAPs and logs will be written as normal, and blocks not seen yet
will be offered to the scan machinery as normal.

Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Adds per-DDT stats counting lookups and where they were serviced from
(either log or backing zap), number of log entries in memory, and flow
rates.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Signed-off-by: Allan Jude <allan@klarasystems.com>
behlendorf pushed a commit that referenced this pull request Aug 16, 2024
This yields substantial performance improvements when we only write out
some small % of entries at a time, as it will cause entries that will go
into "nearby" ZAP leaf nodes to be grouped closer together in the AVL, and
so touch fewer blocks. Without this, the distribution is an even spread,
so we touch a lot more ZAP leaf nodes for any given number of entries.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15895
behlendorf pushed a commit that referenced this pull request Aug 16, 2024
All objects stored in the MOS get copies=3. For a large dedup table,
this requires significant extra IO and disk space, when its not really
necessary - the dedup table itself isn't needed to read or write data,
only to keep data usage down. Losing the dedup table does not render the
pool unusable, it just messes up the accounting somewhat.

This adds a dmu_ddt_copies tuneable. When set to 0, the existing
behaviour is used. When set higher, dedup table blocks (ZAP and log)
will have this many copies rather than the usual 3, while indirect
blocks will have one more again.

This is a tuneable for now mostly for testing. Losing a dedup table can
cause blocks to be leaked, and we currently have no facilities to repair
that.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15895
behlendorf pushed a commit that referenced this pull request Aug 16, 2024
Adds a log/journal to dedup. At the end of txg, instead of writing the
entry directly to the ZAP, instead its adding to an in-memory tree and
appended to an on-disk object. The on-disk object is only read at
import, to reload the in-memory tree.

Lookups first go the the log tree before going to the ZAP, so
recently-used entries will remain close by in memory. This vastly
reduces overhead from dedup IO, as it will not have to do so many
read/update/write cycles on ZAP leaf nodes.

A flushing facility is added at end of txg, to push logged entries out
to the ZAP. There's actually two separate "logs" (in-memory tree and
on-disk object), one active (recieving updated entries) and one flushing
(writing out to disk). These are swapped (ie flushing begins) based on
memory used by the in-memory log trees and time since we last flushed
something.

The flushing facility monitors the amount of entries coming in and being
flushed out, and calibrates itself to try to flush enough each txg to
keep up with the ingest rate without competing too much with other IO.
Multiple tuneables are provided to control the flushing facility.

All the histograms and stats are update to accomodate the log as a
separate entry store. zdb gains knowledge of how to count them and dump
them. Documentation included!

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15895
behlendorf pushed a commit that referenced this pull request Aug 16, 2024
The dedup log does not have a stable cursor, so its not possible to
persist our current scan location within it across pool reloads.
Beccause of this, when walking (scanning), we can't treat it like just
another source of dedup entries.

Instead, when a scan is wanted, we switch to an aggressive flushing
mode, pushing out entries older than the scan start txg as fast as we
can, before starting the scan proper.

Entries after the scan start txg will be handled via other methods; the
DDT ZAPs and logs will be written as normal, and blocks not seen yet
will be offered to the scan machinery as normal.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15895
behlendorf pushed a commit that referenced this pull request Aug 16, 2024
Adds per-DDT stats counting lookups and where they were serviced from
(either log or backing zap), number of log entries in memory, and flow
rates.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15895
behlendorf pushed a commit that referenced this pull request Aug 16, 2024
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes #15895
@robn
Copy link
Member

robn commented Aug 17, 2024

Thanks all!

lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Sep 4, 2024
Both the API and the code were kinda mangled and I was really struggling
to follow it. The worst offender was the old ddt_stat_add(); after
fixing it up the rest of the changes are mostly knock-on effects and
targets of opportunity.

Note that the old ddt_stat_add() was safe against overflows - it could
produce crazy numbers, but the compiler wouldn't do anything stupid. The
assertions in ddt_stat_sub() go a lot of the way to protecting against
this; getting in a position where overflows are a problem is definitely
a programming error.

Also expanding ddt_stat_add() and ddt_histogram_empty() produces less
efficient assembly. I'm not bothered about this right now though; these
should not be hot functions, and if they are we'll optimise them later.
If we have to go back to the old form, we'll comment it like crazy.

Finally, I've removed the assertion that the bucket will never be
negative, as it will soon be possible to have entries with zero
refcounts: an entry for a block that is no longer on the pool, but is on
the log waiting to be synced out. It might be better to have a separate
bucket for these, since they're still using real space on disk, but
ultimately these stats are driving UI, and for now I've chosen to keep
them matching how they've looked in the past, as well as match the
operators mental model - pool usage is managed elsewhere.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes openzfs#15895
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Sep 4, 2024
This yields substantial performance improvements when we only write out
some small % of entries at a time, as it will cause entries that will go
into "nearby" ZAP leaf nodes to be grouped closer together in the AVL, and
so touch fewer blocks. Without this, the distribution is an even spread,
so we touch a lot more ZAP leaf nodes for any given number of entries.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes openzfs#15895
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Sep 4, 2024
All objects stored in the MOS get copies=3. For a large dedup table,
this requires significant extra IO and disk space, when its not really
necessary - the dedup table itself isn't needed to read or write data,
only to keep data usage down. Losing the dedup table does not render the
pool unusable, it just messes up the accounting somewhat.

This adds a dmu_ddt_copies tuneable. When set to 0, the existing
behaviour is used. When set higher, dedup table blocks (ZAP and log)
will have this many copies rather than the usual 3, while indirect
blocks will have one more again.

This is a tuneable for now mostly for testing. Losing a dedup table can
cause blocks to be leaked, and we currently have no facilities to repair
that.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes openzfs#15895
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Sep 4, 2024
Adds a log/journal to dedup. At the end of txg, instead of writing the
entry directly to the ZAP, instead its adding to an in-memory tree and
appended to an on-disk object. The on-disk object is only read at
import, to reload the in-memory tree.

Lookups first go the the log tree before going to the ZAP, so
recently-used entries will remain close by in memory. This vastly
reduces overhead from dedup IO, as it will not have to do so many
read/update/write cycles on ZAP leaf nodes.

A flushing facility is added at end of txg, to push logged entries out
to the ZAP. There's actually two separate "logs" (in-memory tree and
on-disk object), one active (recieving updated entries) and one flushing
(writing out to disk). These are swapped (ie flushing begins) based on
memory used by the in-memory log trees and time since we last flushed
something.

The flushing facility monitors the amount of entries coming in and being
flushed out, and calibrates itself to try to flush enough each txg to
keep up with the ingest rate without competing too much with other IO.
Multiple tuneables are provided to control the flushing facility.

All the histograms and stats are update to accomodate the log as a
separate entry store. zdb gains knowledge of how to count them and dump
them. Documentation included!

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes openzfs#15895
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Sep 4, 2024
The dedup log does not have a stable cursor, so its not possible to
persist our current scan location within it across pool reloads.
Beccause of this, when walking (scanning), we can't treat it like just
another source of dedup entries.

Instead, when a scan is wanted, we switch to an aggressive flushing
mode, pushing out entries older than the scan start txg as fast as we
can, before starting the scan proper.

Entries after the scan start txg will be handled via other methods; the
DDT ZAPs and logs will be written as normal, and blocks not seen yet
will be offered to the scan machinery as normal.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes openzfs#15895
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Sep 4, 2024
Adds per-DDT stats counting lookups and where they were serviced from
(either log or backing zap), number of log entries in memory, and flow
rates.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes openzfs#15895
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Sep 4, 2024
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes openzfs#15895
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants