Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault with subleveldown #60

Closed
christianbundy opened this issue Feb 1, 2019 · 27 comments · Fixed by #89
Closed

Segfault with subleveldown #60

christianbundy opened this issue Feb 1, 2019 · 27 comments · Fixed by #89
Labels
bug Something isn't working

Comments

@christianbundy
Copy link

Disclaimer: this is almost definitely my fault. I'm completely unfamiliar with leveldown and I'm probably doing everything wrong.

With that out of the way, I have a reproducible segfault while using subleveldown that's highly correlated with open/close events.

git clone https://github.com/flumedb/flumeview-level
cd flumeview-level
git checkout -b sub
git pull origin sub
npm ci
node test/read.js

Here's the output:

$ node test/read.js
  flumeview-level re-init +0ms
  flumeview-level re-start +2ms
  flumeview-level opened +9ms
  flumeview-level create() +0ms
FATAL ERROR: v8::Object::Cast Could not convert to object
 1: 0x55a75a67d6d1 node::Abort() [node]
 2: 0x55a75a67d71f  [node]
 3: 0x55a75a82fe9b v8::Utils::ReportApiFailure(char const*, char const*) [node]
 4: 0x7f52204e2b1c v8::Object::Cast(v8::Value*) [/home/christianbundy/src/flumeview-level/node_modules/leveldown/build/Debug/leveldown.node]
 5: 0x7f52204e490c v8::Local<v8::Object> v8::Local<v8::Object>::Cast<v8::Value>(v8::Local<v8::Value>) [/home/christianbundy/src/flumeview-level/node_modules/leveldown/build/Debug/leveldown.node]
 6: 0x7f52204e4036 v8::Local<v8::Object> v8::Local<v8::Value>::As<v8::Object>() const [/home/christianbundy/src/flumeview-level/node_modules/leveldown/build/Debug/leveldown.node]
 7: 0x7f52204e92d2 leveldown::Database::Get(Nan::FunctionCallbackInfo<v8::Value> const&) [/home/christianbundy/src/flumeview-level/node_modules/leveldown/build/Debug/leveldown.node]
 8: 0x7f52204e0ee6  [/home/christianbundy/src/flumeview-level/node_modules/leveldown/build/Debug/leveldown.node]
 9: 0x55a75a8b2ab0  [node]
10: 0x55a75a8b40d7  [node]
11: 0x2eccedadbe1d 
[1]    21605 abort (core dumped)  node test/read.js

And of course:

$ coredumpctl info 21605
           PID: 21605 (node)
           UID: 1000 (christianbundy)
           GID: 1000 (christianbundy)
        Signal: 6 (ABRT)
     Timestamp: Fri 2019-02-01 14:36:20 PST (55s ago)
  Command Line: node test/read.js
    Executable: /usr/bin/node
 Control Group: /user.slice/user-1000.slice/user@1000.service/gnome-terminal-server.service
          Unit: user@1000.service
     User Unit: gnome-terminal-server.service
         Slice: user-1000.slice
     Owner UID: 1000 (christianbundy)
       Boot ID: a26e2f3a62384290a1884245840c594f
    Machine ID: 4e83385b599d4d2ea6f5173888d92afc
      Hostname: samus
       Storage: /var/lib/systemd/coredump/core.node.1000.a26e2f3a62384290a1884245840c594f.21605.1549060580000000.lz4
       Message: Process 21605 (node) of user 1000 dumped core.
                
                Stack trace of thread 21605:
                #0  0x00007f5225387d7f raise (libc.so.6)
                #1  0x00007f5225372672 abort (libc.so.6)
                #2  0x000055a75a67d6e4 _ZN4node5AbortEv (node)
                #3  0x000055a75a67d71f n/a (node)
                #4  0x000055a75a82fe9b _ZN2v85Utils16ReportApiFailureEPKcS2_ (node)
                #5  0x00007f52204e2b1c n/a (/home/christianbundy/src/flumeview-level/node_modules/leveldown/build/Debug/leveldown.node)
                #6  0x00007f52204e490c n/a (/home/christianbundy/src/flumeview-level/node_modules/leveldown/build/Debug/leveldown.node)
                #7  0x00007f52204e4036 n/a (/home/christianbundy/src/flumeview-level/node_modules/leveldown/build/Debug/leveldown.node)
                #8  0x00007f52204e92d2 n/a (/home/christianbundy/src/flumeview-level/node_modules/leveldown/build/Debug/leveldown.node)
                #9  0x00007f52204e0ee6 n/a (/home/christianbundy/src/flumeview-level/node_modules/leveldown/build/Debug/leveldown.node)
                #10 0x000055a75a8b2ab0 n/a (node)
                #11 0x000055a75a8b40d7 n/a (node)
                #12 0x00002eccedadbe1d n/a (n/a)
                #13 0x00002ecceda918d5 n/a (n/a)
                #14 0x00002ecceda918d5 n/a (n/a)
                #15 0x00002ecceda918d5 n/a (n/a)
                #16 0x00002ecceda918d5 n/a (n/a)
                #17 0x00002ecceda918d5 n/a (n/a)
                #18 0x00002ecceda918d5 n/a (n/a)
                #19 0x00002ecceda918d5 n/a (n/a)
                #20 0x00002ecceda918d5 n/a (n/a)
                #21 0x00002ecceda918d5 n/a (n/a)
                #22 0x00002ecceda918d5 n/a (n/a)
                #23 0x00002ecceda918d5 n/a (n/a)
                #24 0x00002ecceda8ee75 n/a (n/a)
                #25 0x00002ecceda892c1 n/a (n/a)
                #26 0x000055a75ab7eb06 n/a (node)
                #27 0x000055a75ab7f03d n/a (node)
                #28 0x000055a75ab7f0f2 _ZN2v88internal9Execution4CallEPNS0_7IsolateENS0_6HandleINS0_6ObjectEEES6_iPS6_ (node)
                #29 0x000055a75a844785 _ZN2v88Function4CallENS_5LocalINS_7ContextEEENS1_INS_5ValueEEEiPS5_ (node)
                #30 0x000055a75a67e92c _ZN4node20InternalMakeCallbackEPNS_11EnvironmentEN2v85LocalINS2_6ObjectEEENS3_INS2_8FunctionEEEiPNS3_INS2_5ValueEEENS_13async_contextE (node)
                #31 0x000055a75a67ea2d _ZN4node12MakeCallbackEPN2v87IsolateENS0_5LocalINS0_6ObjectEEENS3_INS0_8FunctionEEEiPNS3_INS0_5ValueEEENS_13async_contextE (node)
                #32 0x000055a75a6639e9 _ZN4node11Environment14CheckImmediateEP10uv_check_s (node)
                #33 0x00007f52260d7241 uv__run_check (libuv.so.1)
                #34 0x00007f52260d1ab9 uv_run (libuv.so.1)
                #35 0x000055a75a688986 _ZN4node5StartEPN2v87IsolateEPNS_11IsolateDataERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISB_EESF_ (node)
                #36 0x000055a75a686c50 _ZN4node5StartEiPPc (node)
                #37 0x00007f5225374223 __libc_start_main (libc.so.6)
                #38 0x000055a75a63f7fe _start (node)
                
                Stack trace of thread 21606:
                #0  0x00007f522544bc7e epoll_pwait (libc.so.6)
                #1  0x00007f52260e02aa uv__io_poll (libuv.so.1)
                #2  0x00007f52260d1ab0 uv_run (libuv.so.1)
                #3  0x000055a75a7091b4 _ZZN4node20BackgroundTaskRunner20DelayedTaskScheduler5StartEvENUlPvE_4_FUNES2_ (node)
                #4  0x00007f522551ba9d start_thread (libpthread.so.0)
                #5  0x00007f522544bb23 __clone (libc.so.6)
                
                Stack trace of thread 21612:
                #0  0x00007f5225521afc pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x00007f52260dd80a uv_cond_wait (libuv.so.1)
                #2  0x00007f52260ccac3 n/a (libuv.so.1)
                #3  0x00007f522551ba9d start_thread (libpthread.so.0)
                #4  0x00007f522544bb23 __clone (libc.so.6)
                
                Stack trace of thread 21615:
                #0  0x00007f5225521afc pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x00007f52260dd80a uv_cond_wait (libuv.so.1)
                #2  0x00007f52260ccac3 n/a (libuv.so.1)
                #3  0x00007f522551ba9d start_thread (libpthread.so.0)
                #4  0x00007f522544bb23 __clone (libc.so.6)
                
                Stack trace of thread 21610:
                #0  0x00007f5225521afc pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x00007f52260dd80a uv_cond_wait (libuv.so.1)
                #2  0x000055a75a706e1c n/a (node)
                #3  0x00007f522551ba9d start_thread (libpthread.so.0)
                #4  0x00007f522544bb23 __clone (libc.so.6)
                
                Stack trace of thread 21607:
                #0  0x00007f5225521afc pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x00007f52260dd80a uv_cond_wait (libuv.so.1)
                #2  0x000055a75a706e1c n/a (node)
                #3  0x00007f522551ba9d start_thread (libpthread.so.0)
                #4  0x00007f522544bb23 __clone (libc.so.6)
                
                Stack trace of thread 21608:
                #0  0x00007f5225521afc pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x00007f52260dd80a uv_cond_wait (libuv.so.1)
                #2  0x000055a75a706e1c n/a (node)
                #3  0x00007f522551ba9d start_thread (libpthread.so.0)
                #4  0x00007f522544bb23 __clone (libc.so.6)
                
                Stack trace of thread 21609:
                #0  0x00007f5225521afc pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x00007f52260dd80a uv_cond_wait (libuv.so.1)
                #2  0x000055a75a706e1c n/a (node)
                #3  0x00007f522551ba9d start_thread (libpthread.so.0)
                #4  0x00007f522544bb23 __clone (libc.so.6)
                
                Stack trace of thread 21611:
                #0  0x00007f5225524436 do_futex_wait.constprop.1 (libpthread.so.0)
                #1  0x00007f5225524538 __new_sem_wait_slow.constprop.0 (libpthread.so.0)
                #2  0x00007f52260dd894 uv_sem_wait (libuv.so.1)
                #3  0x000055a75a766e61 n/a (node)
                #4  0x00007f522551ba9d start_thread (libpthread.so.0)
                #5  0x00007f522544bb23 __clone (libc.so.6)
                
                Stack trace of thread 21613:
                #0  0x00007f5225521afc pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x00007f52260dd80a uv_cond_wait (libuv.so.1)
                #2  0x00007f52260ccac3 n/a (libuv.so.1)
                #3  0x00007f522551ba9d start_thread (libpthread.so.0)
                #4  0x00007f522544bb23 __clone (libc.so.6)
                
                Stack trace of thread 21614:
                #0  0x00007f5225521afc pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x00007f52260dd80a uv_cond_wait (libuv.so.1)
                #2  0x00007f52260ccac3 n/a (libuv.so.1)
                #3  0x00007f522551ba9d start_thread (libpthread.so.0)
                #4  0x00007f522544bb23 __clone (libc.so.6)

Please let me know if there's anything else I can do to help debug this. Thanks a lot! I've been really enjoying my experience with leveldown so far.

@ralphtheninja
Copy link
Member

Are you sure the following commands are correct?

git clone https://github.com/Level/subleveldown
cd subleveldown
git checkout -b sub
git pull origin sub
npm ci
node test/read.js

@christianbundy
Copy link
Author

Oof, sorry about that. I've double-checked that this is 100% correct:

git clone https://github.com/flumedb/flumeview-level
cd flumeview-level
git checkout -b sub
git pull origin sub
npm ci
node test/read.js

@ralphtheninja
Copy link
Member

I personally get a SIGSEGV:

screenshot from 2019-02-02 00-48-47

It's probably the same issue but blows up somewhere else due to timing etc. It's difficult to see directly where this goes wrong. My gut feeling tells me something is trying to read data after the database has been closed. This is (unfortunately) a known problem with leveldown.

@ralphtheninja
Copy link
Member

Btw, is this test code new or has it been around for a while? Wondering if there's something that has changed in leveldown that all of a sudden broke the tests.

@ralphtheninja
Copy link
Member

@vweevers Any ideas?

@christianbundy
Copy link
Author

Test code is super new. I'm on mobile so this will be brief, but the gist is that on Scuttlebutt we have many instances of leveldown running and I wanted to experiment with a single instance shared by subleveldown. As far as I can tell this passes all previous tests except "retest", which tries to close and re-open to database to ensure persistence.

Is my intuition correct that sharing an instance of leveldown via subleveldown may be more performant, or is this all just a wild goose chase?

Thanks for your help! I'm really appreciative at how quickly you've triaged this issue.

@ralphtheninja
Copy link
Member

Test code is super new.

This is good to know.

Is my intuition correct that sharing an instance of leveldown via subleveldown may be more performant, or is this all just a wild goose chase?

It might be more performant. You'd have to write some benchmarks to verify this. Also depends on what type of performance you're after. Maybe lower memory usage at the expense of more key processing? Just guessing here.

Anyway. I think you should take a closer look at the code dealing with closing and re-opening the database. Take care to close the database once you know all iterators are done (read streams etc), then re-open.

@vweevers
Copy link
Member

vweevers commented Feb 2, 2019

On Windows I get a nullptr in a get() operation:

image

@vweevers
Copy link
Member

vweevers commented Feb 2, 2019

@christianbundy I think the issue (or at least one issue) is that the database isn't actually closed. Because flumedb.close() does not close the flumelog-level instance or its underlying db.

@vweevers
Copy link
Member

vweevers commented Feb 2, 2019

You can check this by adding the following lines to flumelog-level:

db.on('closing', () => console.error('closing'))
db.on('closed', () => console.error('closed'))

Both don't happen, until I make the following patch to flumedb:

index ca81863..e77e925 100644
--- a/index.js
+++ b/index.js
@@ -179,8 +179,11 @@ module.exports = function (log, isReady, mapper) {
           if(sv.close) sv.close(cb)
           else cb()
         }
-      })) (cb)
-
+      }))(function (err) {
+        log.db.close(function (err2) {
+          cb(err || err2)
+        })
+      })
     }
   }
   return flume

@vweevers
Copy link
Member

vweevers commented Feb 2, 2019

Side note: mkdirp() before opening should not be needed, LevelDB does that too.

@ralphtheninja
Copy link
Member

On Windows I get a nullptr in a get() operation:

This leads me to think it hasn't been opened since db is NULL in the Database class until we open it.

https://github.com/Level/leveldown/blob/82b9675a927c9af5482d1acbc7d342b0229a9273/src/database.cc#L26

https://github.com/Level/leveldown/blob/82b9675a927c9af5482d1acbc7d342b0229a9273/src/database.cc#L43

@vweevers
Copy link
Member

vweevers commented Feb 2, 2019

@ralphtheninja with the close() patch above, I no longer get the null pointer. So the issue is not that the test doesn't open the db, but that it opens a db still in use by a previous instance. The "empty close X" tests still fail, but that's a separate issue.

@vweevers
Copy link
Member

vweevers commented Mar 2, 2019

@christianbundy Do you have enough information to continue?

@christianbundy
Copy link
Author

Yeah, it sounds like I'm trying to open an already-open database, but I'm not clear on why that's segfaulting rather than just throwing an error. I've tried to write a minimal reproducible test-case, but I'm not sure exactly why this segfaults.

I'm happy to just use the workaround of not trying to open an already-open database, but I was surprised by the segfault.

@vweevers
Copy link
Member

vweevers commented Mar 2, 2019

True, the segfault does warrant further investigation. I'll have a look.

@vweevers
Copy link
Member

vweevers commented Mar 3, 2019

@christianbundy I'm struggling to write a test case (I can't seem to produce a segfault unless I create an iterator, but that's not the issue here AFAICT and already slated to be fixed by #597).

Could you share yours?

@christianbundy
Copy link
Author

Sorry, I mean that I've also been unable to write a minimal test case. You can still use these instructions but I haven't had the time to whittle it down any further.

@vweevers
Copy link
Member

vweevers commented Mar 3, 2019

It seems ralphtheninja was on the right track in https://github.com/Level/leveldown/issues/590#issuecomment-459952479.

The chain of db's is not closed properly somewhere, leading to a broken state. Then when reopened, which wraps some of the existing db's in new db's, the underlying LevelDOWN instance is not opened, which then leads to a null pointer on the first get() (or other operation).

Initially, the state is:

image

Once opened, all those statuses change to open, like they should. But before and after closing, the state is:

image

Note that EncodingDOWN stays "open" for some reason. Could be an issue in subleveldown (this is really hard to track).

Then, in the # reload test, the db is re-wrapped, inheriting some of that bad state:

image

Because EncodingDOWN believes it's open, it does not open its underlying LevelDOWN instance. So after open, the state is:

image

@vweevers
Copy link
Member

vweevers commented Mar 3, 2019

So subleveldown, because it unwraps the db chain to get to leveldown and then takes control of it, closes that leveldown instance. It bypasses encoding-down because it doesn't want that particular layer; it rewraps leveldown with a new encoding-down instance. But, if you then reuse the original encoding-down instance (directly or via levelup) and it was already opened, you get the above situation.

Hard to explain, but I think I have enough info now to write a test.

@vweevers
Copy link
Member

vweevers commented Mar 3, 2019

Also, the reason that the close patch above (https://github.com/Level/leveldown/issues/590#issuecomment-459950304) fixes the situation, is because in that case, the original encoding-down gets closed and therefore is in a good state to be reused.

@vweevers
Copy link
Member

vweevers commented Mar 3, 2019

Long story short, the bug is in subleveldown. It opens the levelup instance that you pass in, but then ignores (what that does to the state of) encoding-down. On close, it only closes leveldown, not the originally passed in levelup or by extension encoding-down. To fix, we should either:

  1. Unwrap the levelup instance in the subleveldown constructor, discarding both the levelup and encoding-down layers. Then subleveldown only has to open the leveldown instance.
  2. Keep that part as-is, but also take control of the levelup instance and close it.

I prefer 1, but feel like I'm missing something because the way it currently works suddenly seems needlessly complicated. AFAICT subleveldown does not use the levelup instance for any other purpose than to reopen the underlying store.

Moving this issue to the subleveldown repo.

@vweevers vweevers transferred this issue from Level/leveldown Mar 3, 2019
@vweevers vweevers added the bug Something isn't working label Mar 3, 2019
@vweevers
Copy link
Member

vweevers commented Mar 3, 2019

but feel like I'm missing something

Ah yes. I was missing the fact that the levelup instance that is passed in, may already have subleveldown in its chain. Gah. Also, we lack a mechanism in abstract-leveldown to hook into an opening db (there are no events).

@vweevers
Copy link
Member

vweevers commented Mar 3, 2019

@christianbundy With npm i Level/subleveldown#open-close-hack on top of your test setup I can avoid the segfault, but one of the tests fails on incorrect data.

Could that be because flumelog-level does not use subleveldown? E.g. it reads+writes unprefixed data, conflicting with the keyspace of flumeview-level. I'm not familiar with the design of flume so I can't really investigate that part. Did these tests pass before you introduced subleveldown?

(The "empty close" tests also still fail but to reiterate, that's a separate problem and I want to put that aside for now)

@christianbundy
Copy link
Author

No, these tests never passed -- this was an experiment with adding subleveldown, and when I got a segfault I just reported with my test-case. It's very possible that it's writing unprefixed data, I don't think that I understood that a prefix was necessary so that's probably been implemented incorrectly by me.

@vweevers
Copy link
Member

vweevers commented Mar 4, 2019

OK. FYI, it might take some time for us to find a proper fix for the subleveldown bug. Also, don't use that #open-close-hack branch for anything other than tests 😉

@vweevers
Copy link
Member

@christianbundy Why did you close this? Did you find a fix on your end?

vweevers added a commit that referenced this issue Apr 4, 2020
DeferredOpen means that the db opens itself and defers operations
until it's open. Currently that's only supported by level(up) and
friends. Before, subleveldown would also accept abstract-leveldown
db's that were not wrapped in levelup.

Opening and closing a sublevel no longer opens or closes the parent
db. The sublevel does wait for the parent to open (which in the
case of levelup already happens automatically) but never initiates
a state change.

Drops support of old modules:

- memdb (use level-mem instead)
- deferred-leveldown < 2.0.0 (and thus levelup < 2.0.0)
- abstract-leveldown < 2.4.0

Closes #84, #83 and #60.
vweevers added a commit that referenced this issue Apr 4, 2020
DeferredOpen means that the db opens itself and defers operations
until it's open. Currently that's only supported by level(up) and
friends. Before, subleveldown would also accept abstract-leveldown
db's that were not wrapped in levelup.

Opening and closing a sublevel no longer opens or closes the parent
db. The sublevel does wait for the parent to open (which in the
case of levelup already happens automatically) but never initiates
a state change.

Drops support of old modules:

- memdb (use level-mem instead)
- deferred-leveldown < 2.0.0 (and thus levelup < 2.0.0)
- abstract-leveldown < 2.4.0

Closes #84, #83 and #60.
vweevers added a commit that referenced this issue Apr 5, 2020
DeferredOpen means that the db opens itself and defers operations
until it's open. Currently that's only supported by levelup (and
levelup factories like level). Previously, subleveldown would also
accept abstract-leveldown db's that were not wrapped in levelup.

Opening and closing a sublevel no longer opens or closes the parent
db. The sublevel does wait for the parent to open (which in the
case of levelup already happens automatically) but never initiates
a state change.

If one closes the parent but not the sublevel, subsequent
operations (like get and put) on the sublevel will yield an error,
to prevent segmentation faults from underlying stores.

Drops support of old modules:

- memdb (use level-mem instead)
- deferred-leveldown < 2.0.0 (and thus levelup < 2.0.0)
- abstract-leveldown < 2.4.0

Closes #84, #83 and #60.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants