Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs send - quota bug - userused underflow #3789

Closed
mw171 opened this issue Sep 16, 2015 · 53 comments
Closed

zfs send - quota bug - userused underflow #3789

mw171 opened this issue Sep 16, 2015 · 53 comments
Milestone

Comments

@mw171
Copy link

mw171 commented Sep 16, 2015

hi,

there seems to be a bug in the userused computation under 0.6.5:

# dd if=/dev/zero of=/tmp/a bs=1024k seek=1023 count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.00228316 s, 459 MB/s
# dd if=/dev/zero of=/tmp/b bs=1024k seek=1023 count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.00203765 s, 515 MB/s
# 
# zpool create a /tmp/a
# zpool create b /tmp/b
# 
# date > /a/x
# 
# zfs snapshot a@1
# zfs send a@1 | zfs recv -F b
# 
# rm /b/x
# zfs userquota@root=1M b
# sync
# zfs userspace b
TYPE        NAME   USED  QUOTA
POSIX User  root  16.0E     1M
# 
# date > /b/x
-bash: /b/x: Disk quota exceeded
# 

Regards,
Martin

@tjikkun
Copy link

tjikkun commented Sep 18, 2015

We are seeing this bug as well, except in our case I don't know exactly how to reproduce it. It looks like it happens when files are added/removed and then in some cases userused is decremented to much, or it is not incremented enough when it should. But, like I said, unfortunately at this moment I am not sure how to reproduce. It is a very big problem however, as more and more of our users are getting "Disk quota exceeded"

@DeHackEd
Copy link
Contributor

I'm also seeing it on a few datasets.

# zfs userspace rubyweapon/home/dhe/private -p
TYPE        NAME                  USED  QUOTA
POSIX User  root  18446744073708409344   none
POSIX User  dhe            4818886144   none

Which is about equal to -1142272 for the root user.

I went through snapshots looking for something a possible time indicator of when the issue began. The most recent snapshot I have is dated 2015-08-22, bootup/import date was 2015-08-04. That might help with guessing the time it started?

One other thing. "find $mountpoint -uid 0" on a snapshot returns nothing, but userspace on the snapshot shows root has 1.5k of space. Is that pool metadata?

@dweeezil
Copy link
Contributor

The culprit is 5f8e1e8. It causes the accounting to be skipped for object creation during zfs receive. Pinging @nedbass.

@dweeezil
Copy link
Contributor

I did a bit of digging to get to the root cause. The dbuf_try_add_ref() function requires a dbuf's refcount to be greater than its dirty count in order to be successful. In this case, restore_object() has called dmu_buf_will_dirty() in the context of the zfs program which ultimately adds a reference and results in both the refcount and the dirty count being set to 1. When the sync task does its spa_sync(), the new early return in dmu_objset_userquota_get_ids() is triggered due to the failure of dbuf_try_add_ref(). This is likely an unusual code path which was not exercised prior to the 5f8e1e8 commit.

@tjikkun
Copy link

tjikkun commented Sep 19, 2015

Just to be clear: I am seeing this symptom on a dataset that is not on the receiving end of zfs send/receive. Do you think it can be the same issue?

@DeHackEd
Copy link
Contributor

If it matters, my dataset had send/recv done in the past, but that was last year at least.

@dweeezil
Copy link
Contributor

Let me clarify the problem I found a bit further: The specific issue I referred to above causes the entries in the user/group tracking objects to not be incremented as objects are restored to the dataset. The effect is as if they're all set to zero. As soon as storage is allocated in the newly-received filesystem, entries are created for the appropriate user/group and are incremented appropriately. Similarly, as storage is freed, entries are decremented which it what leads to the negative values.

I've only researched the exact scenario laid out by @mw171 which involves a send/receive and have found the root cause of the problem which is outlined earlier in this issue.

Since most of the code to do the work already exists, I'm working up a patch which will add a command to cause the usage values to be recomputed for a specific filesystem. This will allow to correct things if they ever get out of sync. I expect to have this working tomorrow once I get a chance to finish it up.

It would certainly be interesting to know of any other situations under which the usage values get out of sync.

@dweeezil
Copy link
Contributor

Re-reading my comment, even further clarification is needed w.r.t. the phrase "as if they're all set to zero": This is only true in the context of a zfs receive when it creates a new filesystem. In the case of an existing filesystem, I believe the effect is as if the existing counters are not updated at all (by the receive as it consumes space). I've not investigated what happens when a receive operation frees space.

@mw171
Copy link
Author

mw171 commented Sep 21, 2015

@dweeezil: Great! In the moment we can only provide some filesystems with bad used counts without userquota. So a "usage-recompute-command" would be a great help for us. Thanx!

@mw171 mw171 closed this as completed Sep 21, 2015
@DeHackEd
Copy link
Contributor

You accidentally closed this.

@mw171
Copy link
Author

mw171 commented Sep 21, 2015

sorry, wrong button...

@mw171 mw171 reopened this Sep 21, 2015
@dweeezil
Copy link
Contributor

Here's an update on the insane hack I've whipped up to fix the user/group usage tables:

It would be nice to develop a more proper and robust patch to rebuild the usage tables but in the mean time, the https://github.com/dweeezil/zfs/tree/userused branch contains one of the more wretched hacks I've ever developed. It adds zfs userspace_upgrade which simply exercises the existing function normally invoked when upgrading ZPL from 3 to something later. It also adds zfs userspace_rebuild which does almost the same thing but instead can be used to manipulate the existing user/group usage tables. Here's how it works:

A new module parameter, zfs_dmu_userused_action determines the manner in which the tables are managed and can be used in conjunction with zfs userspace_rebuild to affect a rebuild of the tables as follows:

echo 1 > /sys/modules/zfs/parameters/zfs_dmu_userused_action
zfs userspace_rebuild <pool>/<dataset>
.... sync and wait awhile for the operation to complete ...
echo 2 > /sys/modules/zfs/parameters/zfs_dmu_userused_action
zfs userspace_rebuild <pool>/<dataset>
... sync and wait awhile for it to complete ...
echo 0 > /sys/modules/zfs/parameters/zfs_dmu_userused_action

The module parameter is global and will impact all operations. In other words, this is a hack of epic proportions and must be performed when the pool is otherwise quiescent. I wrote it mainly to understand how these tables are managed and not to provide a "production-ready" solution to a rebuild of the tables. I do have an idea in mind as to how a proper solution can be implemented but am not sure whether it's worth the effort. Mode 1 causes the values to be cleared, mode 2 causes them to be recomputed and mode 0 is normal tracking.

There may be ways other than the send/receive mentioned in the original report from this issue in which the tables can become out-of-sync with reality.

The usual disclaimer applies to this patch but with less dire warnings: It will very likely not eat your pool but if used improperly will definitely mess up your user/group used tables.

@tjikkun
Copy link

tjikkun commented Sep 22, 2015

@dweeezil awesome work! We have quite a few (864 at the moment) users that have underflowed, and probably quite a few users with decreased quota counts that haven't underflowed yet. As long as the bug is not fixed however, recalculating would not really help, since it will just get out of sync again. In any case I would really love there to be a "production-ready" solution. I am not sure how else we could recover from this.

@dweeezil
Copy link
Contributor

@tjikkun So far, I'm not aware of any other operation which cause the values to become out of sync. How old is your pool and at what spa version was it created and also how were the filesystems created and at what zpl version? Have your user/group values been out of sync for a long time? Have you got any idea when they became corrupted and whether that would correlate with a specific version of the zfs code?

@tjikkun
Copy link

tjikkun commented Sep 22, 2015

@dweeezil Well, it happens on pools of varying ages, some are only little more than a week old, some are a year old. Not sure exactly what was the version we started at, most probably it was 0.6.3.
I have never seen the out of sync values before upgrading to 0.6.5
It looks like the corruption happens when there are more then a few concurrent creates/writes/deletes, but I haven't been able to easily reproduce. But with time, the number of accounts that are "under zero" increases, we are at 901 now.
We have not run ZFS development versions, so I only know the issue is introduced between 0.6.4.2 and 0.6.5

@behlendorf behlendorf added this to the 0.6.5.2 milestone Sep 23, 2015
@behlendorf
Copy link
Contributor

@dweeezil how do you want to proceed with this. We absolutely need to address this in the next point release shall we revert 5f8e1e8 or do you have a better solution.

@dweeezil
Copy link
Contributor

@behlendorf While tracking down the initially reported problem caused by zfs receive, I'm pretty certain I tripped over an underflow caused by something else. It's clear that 5f8e1e8 causes the issue with zfs receive and is likely causing the underflow in other cases. I was hoping to figure out the fundamental problem with 5f8e1e8 but that has turned into a major spelunking mission regarding the origin of dbuf_try_add_ref() and the chain of issues that led to it. In the mean time, particularly if there's going to be another point release, it would seem to make sense to revert it until a different solution can be found for #3443.

I'm going to revert it locally and make sure I can reproduce #3443. Maybe that can shed some light on the manner in which dbuf_try_add_ref() is being used.

@nedbass
Copy link
Contributor

nedbass commented Sep 24, 2015

Thanks for digging into this problem @dweeezil. I never did have a great feeling about 5f8e1e8. I agree it would be best to revert it for now. It might be useful to get input from @scsiguy since he is the author of dbuf_try_add_ref() and 4c7b7ee which introduced #3443.

behlendorf added a commit that referenced this issue Sep 25, 2015
This reverts commit 5f8e1e8.  It
was determined that this patch introduced the quota regression
described in #3789.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3443
Issue #3789
@behlendorf behlendorf modified the milestones: 0.6.5.3, 0.6.5.2 Sep 25, 2015
@behlendorf
Copy link
Contributor

Reverted the offending commit, a3000f9, from master until we have a better fix. We'll do the same for the 0.6.5.2 release.

@scsiguy
Copy link
Contributor

scsiguy commented Sep 27, 2015

On Sep 24, 2015, at 1:06 PM, Ned Bass notifications@github.com wrote:

Thanks for digging into this problem @dweeezil https://github.com/dweeezil. I never did have a great feeling about 5f8e1e8 5f8e1e8. I agree it would be best to revert it for now. It might be useful to get input from @scsiguy https://github.com/scsiguy since he is the author of dbuf_try_add_ref() and 4c7b7ee 4c7b7ee which introduced #3443 #3443.


Reply to this email directly or view it on GitHub #3789 (comment).

Can you try the attached patch? I do not have a physical ZFS test rig right now, so this has only gone through ztest. It should go through the ZFS test suite before being committed, especially given that ztest does not test the objset eviction path at all.

Justin

@nedbass
Copy link
Contributor

nedbass commented Sep 28, 2015

Hi @scsiguy, the attachment didn't seem to make it to github.

@scsiguy
Copy link
Contributor

scsiguy commented Sep 28, 2015

On Sep 28, 2015, at 10:41 AM, Ned Bass notifications@github.com wrote:

Hi @scsiguy https://github.com/scsiguy, the attachment didn't seem to make it to github.


Reply to this email directly or view it on GitHub #3789 (comment).

You can pull it from this FreeBSD bug report:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=202607 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=202607

Justin

@tjikkun
Copy link

tjikkun commented Oct 27, 2015

Like comment #3789 (comment) we too have lots of users with the wrong quota. What are the chances of a command like zfs recompute alluserused <zfs> ?
If it will not happen we need to make some hard choices about how to deal with it.

@dweeezil
Copy link
Contributor

@tjikkun My hack in https://github.com/dweeezil/zfs/tree/userused does actually work. As I mentioned, however, it is not production-ready and must be used when no other access (actually, no new file creation or removal) is being performed on a filesystem. Here's a little script I used to test it:

#!/bin/sh

FS=timtank/zfsroot

echo 1 > /sys/module/zfs/parameters/zfs_dmu_userused_action
users=$(zfs userspace -H $FS | awk '{print $3}')
for i in $users; do
    echo something > fix-$i
    chown $i fix-$i
    sync
done
sync
sleep 5
rm -f fix-*
echo 0 > /sys/module/zfs/parameters/zfs_dmu_userused_action

That script will wipe all the used values and allow them to be recomputed. Once they're wiped, echo 2 into zfs_dmu_userused_action and run "zfs userspace_rebuild tank/whatever" and it'll recompute the values. After it's done, change the parameter back to 0 and it should be fixed.

@scsiguy
Copy link
Contributor

scsiguy commented Oct 27, 2015

I have a ZFS test system on order and will try to root cause this problem once it arrives.

@dweeezil
Copy link
Contributor

@scsiguy I'm not sure the original problem happens any more. At least I've not seen it happen with current master code.

@tjikkun
Copy link

tjikkun commented Oct 27, 2015

@dweeezil While I appreciate that work very much, for me the problem exists in a production environment. My guess is a userspace recalculation would take as much time as a scrub. If that is so, then there is no way I can keep those systems offline for such an amount of time.

@mw171
Copy link
Author

mw171 commented Oct 28, 2015

@tjikkun: here the same.
It would be very helpful, to be able to selectively set userused@ to zero.
Then the following would also be in productive operation with us possible:

chown -R 999 /home/<uid>
zfs set userused@<uid>=0 <zfs>
chown -R <uid> /home/<uid>

@scsiguy
Copy link
Contributor

scsiguy commented Oct 28, 2015

@dweeezil I thought @mw171 was able to still repro an underflow with the latest code.

@mw171
Copy link
Author

mw171 commented Oct 28, 2015

@scsiguy yes, but only with an already infected zfs:

# zfs userspace b
TYPE        NAME   USED  QUOTA
POSIX User  root  16.0E   none
# ps -ef >/b/x
# zfs userspace b
TYPE        NAME   USED  QUOTA
POSIX User  root  73.5K   none
# rm /b/x
# zfs userspace b
TYPE        NAME   USED  QUOTA
POSIX User  root  16.0E   none
# modinfo zfs | grep '^version'
version:        0.6.5-29_gb23d543
# 

@scsiguy
Copy link
Contributor

scsiguy commented Oct 28, 2015

@mw171 so you have explicitly tested against a "clean pool" and been unable to repro the problem?

@mw171
Copy link
Author

mw171 commented Oct 28, 2015

@scsiguy yes, just tried once more:

# dd if=/dev/zero of=/tmp/a bs=1024k seek=1023 count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.00234458 s, 447 MB/s
# dd if=/dev/zero of=/tmp/b bs=1024k seek=1023 count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.00173433 s, 605 MB/s
# zpool create a /tmp/a
# zpool create b /tmp/b
# date > /a/x
# zfs snapshot a@1
# zfs send a@1 | zfs recv -F b
# rm /b/x
# zfs userquota@root=1M b
# sync
# zfs userspace b
TYPE        NAME  USED  QUOTA
POSIX User  root   512     1M
# date > /b/x
# modinfo zfs | grep '^version'
version:        0.6.5-29_gb23d543
# 

@scsiguy
Copy link
Contributor

scsiguy commented Oct 28, 2015

@tjikkun Can you expand on why augmenting scrub to fix this corruption won't work for you? Scrub doesn't prevent access to the pool and it could be a special version of scrub which only recalculates user/group used without doing full read/verify of user data.

@tjikkun
Copy link

tjikkun commented Oct 28, 2015

Re: #3789 (comment)
Yes that, but preferably a way to do zfs set userused@<uid>=<somenumber> <zfs>
Or alternatively: If it would be made impossible to underflow, then the following would be possible:
For an underflowed account, add data until it isn't underflowed anymore.
Then, if it is impossible to underflow again you can just do:

chown -R 999 /home/<uid>
#now userused will be 0
chown -R <uid> /home/<uid>

@tjikkun
Copy link

tjikkun commented Oct 28, 2015

@scsiguy augmenting scrub would be totally awesome. If that can be done that would be best.
What is ready now however is something that I assume would take as long as a scrub but would require no file create/remove while it takes place. This makes it unsuitable for us.

@mw171
Copy link
Author

mw171 commented Nov 11, 2015

Unfortunately an underflow is still possible:

# zfs userspace bad
TYPE        NAME   USED  QUOTA
POSIX User  root  16.0E   none
# cp /usr/bin/vi /bad
# zfs userspace bad
TYPE        NAME   USED  QUOTA
POSIX User  root  2.38M   none
# rm /bad/vi
# zfs userspace bad
TYPE        NAME   USED  QUOTA
POSIX User  root  16.0E   none
# modinfo zfs | grep '^version'
version:        0.6.5-31_gf3e2a7a

@mw171
Copy link
Author

mw171 commented Nov 19, 2015

Same with 0.6.5-35_g5c79067 . It would really be very helpful in repairing a damaged ZFS if you could prevent underflows of userused. Please help!

@mw171
Copy link
Author

mw171 commented Nov 22, 2015

what do you think of following little patch?

*** zfs/module/zfs/zap.c.orig   Tue Oct  6 16:58:34 2015
--- zfs/module/zfs/zap.c        Sun Nov 22 16:53:58 2015
***************
*** 1137,1142 ****
        if (value == 0)
                err = zap_remove(os, obj, name, tx);
!       else
                err = zap_update(os, obj, name, 8, 1, &value, tx);
        return (err);
  }
--- 1137,1144 ----
        if (value == 0)
                err = zap_remove(os, obj, name, tx);
!       else {
!               if (value >> 63) value = 0;
                err = zap_update(os, obj, name, 8, 1, &value, tx);
+       }
        return (err);
  }

it does the job for me:

zfs userspace bad
TYPE        NAME   USED  QUOTA
POSIX User  root  16.0E   none
echo > /bad/x
zfs userspace bad
TYPE        NAME  USED  QUOTA
POSIX User  root   512   none
rm /bad/x
zfs userspace bad
TYPE        NAME  USED  QUOTA
POSIX User  root   512   none

@DeHackEd
Copy link
Contributor

From a style standpoint you probably want to write if ((value >> 63) != 0) or someone may get confused.

@dweeezil
Copy link
Contributor

Checking back in on this issue: First off, we definitely don't want a patch in the generic ZAP code as @mw171 is proposing since it would affect all other ZAP operations. If we really wanted a hack, do_userquota_update() would be the right place to take care of this problem.

IIRC, there is still some condition under which root's quota can underflow as shown above. I'm not sure why this happens and I've not tried to track it down. AFAIK, this problem doesn't happen for any non-root users with the current code. It would be nice to track down the problem with root, but it would also be interesting to know if anyone can cause an underflow in a non-root user on a fresh filesystem with current code.

@mw171
Copy link
Author

mw171 commented Dec 15, 2015

@dweeezil: Is there any zap operation in which it makes sense that an underflow can happen?

@tjikkun
Copy link

tjikkun commented Dec 17, 2015

@dweeezil Would the patch in the generic ZAP code that @mw171 be okay for us to use in a production environment? Or would it be dangerous or have a large performance impact?

@dweeezil
Copy link
Contributor

@tjikkun Since the zap_increment() function is only used in the context of the user/groupused objects, it would likely be OK. I'd considering patching it as follows (completely untested):

diff --git a/module/zfs/zap.c b/module/zfs/zap.c
index c5ea392..1ee4aa8 100644
--- a/module/zfs/zap.c
+++ b/module/zfs/zap.c
@@ -1124,16 +1124,24 @@ int
 zap_increment(objset_t *os, uint64_t obj, const char *name, int64_t delta,
     dmu_tx_t *tx)
 {
-       uint64_t value = 0;
+       uint64_t ovalue = 0, value = 0;
        int err;

        if (delta == 0)
                return (0);

-       err = zap_lookup(os, obj, name, 8, 1, &value);
+       err = zap_lookup(os, obj, name, 8, 1, &ovalue);
        if (err != 0 && err != ENOENT)
                return (err);
-       value += delta;
+       value = ovalue + delta;
+
+       /* Truncate underflow to zero */
+       if (delta < 0 && value >= ovalue)
+               value = 0;
+       /* Cap overflow to  UINT64_MAX */
+       else if (delta > 0 && value <= ovalue)
+               value = UINT64_MAX;
+
        if (value == 0)
                err = zap_remove(os, obj, name, tx);
        else

I wish these functions didn't have such generic names because they might be tempting to use for ZAPs in other places in which this behavior wouldn't be appropriate.

@tjikkun
Copy link

tjikkun commented Dec 18, 2015

@dweeezil Thanks, I don't want to prevent overflows just yet though, because my plan is to implement the patch, then overflow the accounts that have underflowed. Maybe I can better use @mw171 suggestion for that.

@mw171
Copy link
Author

mw171 commented Dec 21, 2015

@dweeezil if I already have a faulty userused of e.g. 2^64-1 , how could I get a corrected value near zero with your patch?

@behlendorf behlendorf modified the milestones: 0.8.0, 0.7.0 Mar 26, 2016
@griznog
Copy link

griznog commented Jul 15, 2016

We've been hit by this on several servers and had to disable quotas on some as well as resort to du to calculate usage on the affected filesystems. This isn't a show-stopping bug, but it is a really annoying one. Is there a recommended method of fixing this once a user is affected or an ETA for a method to fix it? Some of the affected servers are larger than 250 TB, so the du workaround is really painful.

@behlendorf
Copy link
Contributor

Closing, to my knowledge this issue hasn't been observed in 0.7.x series or newer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants