Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: switch madsim integration and recovery tests to sql backend #18678

Open
wants to merge 37 commits into
base: main
Choose a base branch
from

Conversation

yezizp2012
Copy link
Contributor

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

@yezizp2012 yezizp2012 marked this pull request as ready for review September 26, 2024 05:56
@yezizp2012 yezizp2012 requested a review from a team as a code owner September 26, 2024 05:56
@yezizp2012 yezizp2012 changed the title feat: [IGNORE ME]switch madsim integration tests to sql backend feat: switch madsim integration and recovery tests to sql backend Sep 26, 2024
Copy link
Contributor

@kwannoel kwannoel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate issues that need to be handled:

  1. Provide a config which can use in-memory sqlite. We need to spawn it separately, in a static lifetime, so it doesn't get killed when meta gets killed.

@yezizp2012
Copy link
Contributor Author

yezizp2012 commented Sep 30, 2024

thread '<unnamed>' panicked at /risingwave/.cargo/registry/src/index.crates.io-6f17d22bba15001f/madsim-0.2.30/src/sim/runtime/context.rs:27:44:
there is no reactor running, must be called from the context of a Madsim runtime
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::panicking::panic_display
   3: core::option::expect_failed
   4: expect<alloc::sync::Arc<madsim::sim::task::TaskInfo, alloc::alloc::Global>>
   5: {closure#0}
   6: try_with<core::cell::RefCell<core::option::Option<alloc::sync::Arc<madsim::sim::task::TaskInfo, alloc::alloc::Global>>>, madsim::sim::runtime::context::current_task::{closure_env#0}, alloc::sync::Arc<madsim::sim::task::TaskInfo, alloc::alloc::Global>>
   7: with<core::cell::RefCell<core::option::Option<alloc::sync::Arc<madsim::sim::task::TaskInfo, alloc::alloc::Global>>>, madsim::sim::runtime::context::current_task::{closure_env#0}, alloc::sync::Arc<madsim::sim::task::TaskInfo, alloc::alloc::Global>>
   8: current_task
   9: current
             at ./src/meta/src/controller/streaming_job.rs:1717:5
             at ./src/meta/src/manager/metadata.rs:432:26
             at ./src/meta/src/barrier/recovery.rs:761:18
             at ./src/meta/src/barrier/recovery.rs:649:78
             at ./src/meta/src/barrier/recovery.rs:281:34
             at ./src/meta/src/barrier/recovery.rs:421:14
             at ./src/meta/src/barrier/mod.rs:1118:65
             at ./src/meta/src/barrier/mod.rs:867:64
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
thread '<unnamed>' panicked at /risingwave/.cargo/registry/src/index.crates.io-6f17d22bba15001f/async-task-4.4.0/src/utils.rs:17:5:
aborting the process
stack backtrace:
   0:     0x5647a4559845 - std::backtrace_rs::backtrace::libunwind::trace::h81f95f911fafa2e0
   1:     0x5647a4559845 - std::backtrace_rs::backtrace::trace_unsynchronized::h2e88326ce80cfb26
   2:     0x5647a4559845 - std::sys::backtrace::_print_fmt::h3318b90cf1bd0c10
   3:     0x5647a4559845 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::hfb260f93257b2f2e
   4:     0x5647a458b7cb - core::fmt::rt::Argument::fmt::h78c29ca6c560b840
   5:     0x5647a458b7cb - core::fmt::write::h70fe4701d8d7a171
   6:     0x5647a45540af - std::io::Write::write_fmt::h14783b7a2197dfce
   7:     0x5647a455ab41 - std::sys::backtrace::BacktraceLock::print::hc4ed4e400debb2bf
   8:     0x5647a455ab41 - std::panicking::default_hook::{{closure}}::hfff4b120f05a97bf
   9:     0x5647a455a81c - std::panicking::default_hook::h52349d4986dd35a0
                               at /risingwave/.cargo/registry/src/index.crates.io-6f17d22bba15001f/async-task-4.4.0/src/utils.rs:17:5
                               at /risingwave/.cargo/registry/src/index.crates.io-6f17d22bba15001f/async-task-4.4.0/src/utils.rs:29:13
                               at /risingwave/.cargo/registry/src/index.crates.io-6f17d22bba15001f/async-task-4.4.0/src/utils.rs:37:1
                               at /risingwave/.cargo/registry/src/index.crates.io-6f17d22bba15001f/async-task-4.4.0/src/raw.rs:453:9
                               at /risingwave/.cargo/registry/src/index.crates.io-6f17d22bba15001f/async-task-4.4.0/src/runnable.rs:850:13
                               at /risingwave/.cargo/registry/src/index.crates.io-6f17d22bba15001f/madsim-0.2.30/src/sim/task/mod.rs:315:9
                               at /risingwave/.cargo/registry/src/index.crates.io-6f17d22bba15001f/madsim-0.2.30/src/sim/task/mod.rs:240:13
                               at /risingwave/.cargo/registry/src/index.crates.io-6f17d22bba15001f/madsim-0.2.30/src/sim/runtime/mod.rs:129:9
                               at /risingwave/.cargo/registry/src/index.crates.io-6f17d22bba15001f/madsim-0.2.30/src/sim/runtime/builder.rs:141:35
thread '<unnamed>' panicked at /risingwave/.cargo/registry/src/index.crates.io-6f17d22bba15001f/async-task-4.4.0/src/utils.rs:12:13:
aborting the process
stack backtrace:
   0:     0x5647a4559845 - std::backtrace_rs::backtrace::libunwind::trace::h81f95f911fafa2e0
   1:     0x5647a4559845 - std::backtrace_rs::backtrace::trace_unsynchronized::h2e88326ce80cfb26
   2:     0x5647a4559845 - std::sys::backtrace::_print_fmt::h3318b90cf1bd0c10
   3:     0x5647a4559845 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::hfb260f93257b2f2e
   4:     0x5647a458b7cb - core::fmt::rt::Argument::fmt::h78c29ca6c560b840
   5:     0x5647a458b7cb - core::fmt::write::h70fe4701d8d7a171
   6:     0x5647a45540af - std::io::Write::write_fmt::h14783b7a2197dfce
   7:     0x5647a455ab41 - std::sys::backtrace::BacktraceLock::print::hc4ed4e400debb2bf
   8:     0x5647a455ab41 - std::panicking::default_hook::{{closure}}::hfff4b120f05a97bf
   9:     0x5647a455a81c - std::panicking::default_hook::h52349d4986dd35a0
                               at /risingwave/.cargo/registry/src/index.crates.io-6f17d22bba15001f/async-task-4.4.0/src/utils.rs:12:13
                               at /risingwave/.cargo/registry/src/index.crates.io-6f17d22bba15001f/async-task-4.4.0/src/utils.rs:18:1
                               at /risingwave/.cargo/registry/src/index.crates.io-6f17d22bba15001f/async-task-4.4.0/src/utils.rs:29:13
                               at /risingwave/.cargo/registry/src/index.crates.io-6f17d22bba15001f/async-task-4.4.0/src/utils.rs:37:1
                               at /risingwave/.cargo/registry/src/index.crates.io-6f17d22bba15001f/async-task-4.4.0/src/raw.rs:453:9
                               at /risingwave/.cargo/registry/src/index.crates.io-6f17d22bba15001f/async-task-4.4.0/src/runnable.rs:850:13
                               at /risingwave/.cargo/registry/src/index.crates.io-6f17d22bba15001f/madsim-0.2.30/src/sim/task/mod.rs:315:9
                               at /risingwave/.cargo/registry/src/index.crates.io-6f17d22bba15001f/madsim-0.2.30/src/sim/task/mod.rs:240:13
                               at /risingwave/.cargo/registry/src/index.crates.io-6f17d22bba15001f/madsim-0.2.30/src/sim/runtime/mod.rs:129:9
                               at /risingwave/.cargo/registry/src/index.crates.io-6f17d22bba15001f/madsim-0.2.30/src/sim/runtime/builder.rs:141:35
thread '<unnamed>' panicked at library/core/src/panicking.rs:229:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.

And also integration recovery tests: https://buildkite.com/risingwavelabs/main-cron/builds/3420#0192414d-bde7-490d-b9d0-3a681345a3e0

Encountered another issue in recovery test. Seems like it occurred when trying to shutdown the simulation cluster. Cc @kwannoel any ideas?

@kwannoel
Copy link
Contributor

Encountered another issue in recovery test. Seems like it occurred when trying to shutdown the simulation cluster. Cc @kwannoel any ideas?

Not sure either. Maybe to do with sqlx. I'm taking a look.

@yezizp2012
Copy link
Contributor Author

yezizp2012 commented Sep 30, 2024

And this one https://buildkite.com/risingwavelabs/pull-request/builds/59181#0192414d-c986-408c-bf72-81826cef3716:

2022-04-30T13:40:01.912968Z ERROR node{id=5 name="frontend-2"}:task{id=369707}:handle_query{mode="simple query" session_id=0 sql=CREATE MATERIALIZED VIEW mv AS SELECT tm, foo FROM t EMIT ON WINDOW CLOSE}: pgwire::pg_protocol: error when process message error=Failed to run the query: Catalog error: gRPC request to meta service failed: Some entity that we attempted to create already exists: table with name mv exists
2022-04-30T13:40:01.912968Z  INFO node{id=5 name="frontend-2"}:task{id=369707}:handle_query{mode="simple query" session_id=0 sql=CREATE MATERIALIZED VIEW mv AS SELECT tm, foo FROM t EMIT ON WINDOW CLOSE}: pgwire_query_log: status="err" time=31ms
2022-04-30T13:40:01.914625Z  INFO node{id=11 name="client"}:task{id=369708}: tokio_postgres::connection: NOTICE: EMIT ON WINDOW CLOSE is currently an experimental feature. Please use it with caution.    
2022-04-30T13:40:01.914625Z DEBUG node{id=11 name="client"}:task{id=863}: risingwave_simulation::slt: Record Statement { loc: Location { file: "e2e_test/streaming/eowc/eowc_select.slt", line: 17, upper: None }, conditions: [], connection: Default, sql: "create materialized view mv as\nselect tm, foo from t\nemit on window close;", expected: Ok } finished in 102.300447ms
2022-04-30T13:40:02.060582Z  INFO node{id=9 name="compactor-1"}:task{id=368300}: risingwave_storage::hummock::compactor: running_parallelism_count=0 pull_task_ack=false pending_pull_task_count=3
2022-04-30T13:40:02.276794Z  INFO epoch{otel.name="Epoch 2234174752423936" epoch=2234174752423936}: rw_tracing: new barrier enqueued epoch=2234174817959936
2022-04-30T13:40:02.279992Z  WARN node{id=7 name="compute-2"}:task{id=369263}: risingwave_stream::task::barrier_manager: control stream reset with error error=gRPC request failed: Internal error: failed to handle barrier event: Actor 8 exited unexpectedly: actor 3 not found in info table
2022-04-30T13:40:02.279992Z  WARN node{id=7 name="compute-2"}:task{id=369263}: risingwave_stream::task::barrier_manager: actor error overwritten actor_id=7 prev_err=Actor 7 exited unexpectedly: actor 3 not found in info table
2022-04-30T13:40:02.281031Z  WARN node{id=3 name="meta-1"}:task{id=367601}: risingwave_meta::barrier::rpc: get error from response stream node=WorkerNode { id: 3, r#type: ComputeNode, host: Some(HostAddress { host: "192.168.3.2", port: 5688 }), state: Running, property: Some(Property { is_streaming: true, is_serving: true, is_unschedulable: false, internal_rpc_host_addr: "" }), transactional_id: Some(2), resource: Some(Resource { rw_version: "2.1.0-alpha", total_memory_bytes: 66673201152, total_cpu_cores: 2 }), started_at: Some(1651325983), parallelism: 2, node_label: "" } err=gRPC request to stream service failed: Internal error: failed to handle barrier event: Actor 8 exited unexpectedly: actor 3 not found in info table
2022-04-30T13:40:02.281031Z  INFO node{id=3 name="meta-1"}:task{id=367601}: risingwave_telemetry_event: Telemetry tracking_id is not set, event reporting disabled
2022-04-30T13:40:02.281031Z  INFO node{id=3 name="meta-1"}:task{id=367601}:failure_recovery{error=get error from control stream, in worker node 3: gRPC request to stream service failed: Internal error: failed to handle barrier event: Actor 8 exited unexpectedly: actor 3 not found in info table}: risingwave_meta::barrier::recovery: recovery start!
2022-04-30T13:40:02.281031Z DEBUG node{id=3 name="meta-1"}:task{id=369582}: risingwave_meta::stream::stream_manager: stream job failed id=TableId { table_id: 977 }
2022-04-30T13:40:02.281031Z ERROR node{id=3 name="meta-1"}:task{id=369582}: risingwave_meta::rpc::ddl_controller_v2: failed to create streaming job id=977 error=get error from control stream, in worker node 3: gRPC request to stream service failed: Internal error: failed to handle barrier event: Actor 8 exited unexpectedly: actor 3 not found in info table
2022-04-30T13:40:02.281035Z  WARN node{id=3 name="meta-1"}:task{id=369582}: risingwave_meta::rpc::ddl_controller_v2: aborted streaming job id=977

...

2022-04-30T13:40:16.120591Z  INFO node{id=5 name="frontend-2"}:task{id=369707}:handle_query{mode="simple query" session_id=0 sql=SELECT * FROM mv ORDER BY tm}: pgwire_query_log: status="err" time=0ms
2022-04-30T13:40:16.126742Z DEBUG node{id=11 name="client"}:task{id=863}: risingwave_simulation::slt: Record Query { loc: Location { file: "e2e_test/streaming/eowc/eowc_select.slt", line: 31, upper: None }, conditions: [], connection: Default, sql: "select * from mv order by tm;", expected: Results { types: [Text, Integer], sort_mode: None, label: None, results: ["2023-05-06 16:51:00  1", "2023-05-06 16:56:00  8", "2023-05-06 17:30:00  3"] } } finished in 14.281116ms
thread '<unnamed>' panicked at /risingwave/src/tests/simulation/src/slt.rs:321:29:
query failed: db error: ERROR: Failed to run the query

Caused by these errors (recent errors listed first):
  1: Catalog error
  2: table or source not found: mv

[SQL] select * from mv order by tm;

Actually mv was failed to create and got Catalog error: gRPC request to meta service failed: Some entity that we attempted to create already exists: table with name mv exists error after retry. But the streaming job was aborted after it. The simulation test treated it as finished in the previous one:

| SqlCmd::CreateMaterializedView { .. }
if i != 0
&& e.to_string().contains("exists")
&& e.to_string().contains("Catalog error") =>

We can fix it by changing the match pattern of checking error messages.

@kwannoel kwannoel force-pushed the feat/switch-sim-integration-tests branch 3 times, most recently from 963f285 to 537c8d8 Compare September 30, 2024 11:29
@kwannoel kwannoel force-pushed the feat/switch-sim-integration-tests branch from e4efa0c to c227f11 Compare September 30, 2024 12:06
@kwannoel
Copy link
Contributor

kwannoel commented Oct 1, 2024

To summarize, here's what's needed after this PR is merged:

  1. Re-enable kill meta.
  2. Update sqlx patch to 0.7.4
  3. Update sqlx patch to also include madsim-tokio fixes.
  4. Re-introduce MADSIM_TEST_SEED.
    i. Provide a config which can use in-memory sqlite. We need to spawn it separately, in a static lifetime, so it doesn't get killed when meta gets killed.
  5. feat: switch madsim integration and recovery tests to sql backend #18678 (comment) Why do we need this configuration?
  6. Fix the timeout for e2e deterministic test. Currently now TEST_NUM=32 to avoid timeout. I suspect it may be related to in-memory sqlite. Perhaps in-memory meta store can speed it up.

@kwannoel
Copy link
Contributor

kwannoel commented Oct 1, 2024

Passed 2 rounds of ci. @xxchan @BugenZhao can help to take a look?

I'd like to get this merged, and work on subsequent parts incrementally. We can always revert if something goes terribly wrong.

@kwannoel
Copy link
Contributor

kwannoel commented Oct 1, 2024

Meta backup test failed: https://buildkite.com/risingwavelabs/main-cron/builds/3448#01924664-243c-4e6c-a9f6-a9efcc92a0b8

failed to run `e2e_test/backup_restore/tpch_snapshot_create.slt`
Caused by:
    statement failed: db error: ERROR: Failed to run the query
    Caused by these errors (recent errors listed first):
      1: gRPC request to meta service failed: Internal error
      2: Hummock error
      3: SST 2 is rejected from being committed since it's below watermark: SST timestamp 1727758344, meta node timestamp 1727758345, retention_sec 0, watermark 1727758345

Edit: Nevermind, it already failed in today's main-cron.

@xxchan
Copy link
Member

xxchan commented Oct 1, 2024

I'd like to get this merged, and work on subsequent parts incrementally.

I'd be happy to proceed in this way

Copy link
Member

@xxchan xxchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quickly browsed the changes. I don't think it can do anything too harmful. So rubber stamp 😁

Comment on lines +11 to +13
[meta.developer]
meta_actor_cnt_per_worker_parallelism_soft_limit = 65536
meta_actor_cnt_per_worker_parallelism_hard_limit = 65536
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How's this related with etcd vs sql? 👀

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me add this to the list of things to follow up on.

DbBackend::Sqlite => {
Arc::new(SqlBackendElectionClient::new(id, SqliteDriver::new(conn)))
}
DbBackend::Sqlite => Arc::new(DummyElectionClient::new(id)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just disallow multiple meta when it's SQLite..? (I'm fine leaving it as UB..)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants