Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test failed in CI: app::background::tasks::blueprint_execution::test::test_deploy_omicron_zones #6568

Open
bnaecker opened this issue Sep 13, 2024 · 3 comments · Fixed by #6590
Assignees
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.

Comments

@bnaecker
Copy link
Collaborator

This test failed on a CI run on pull request #6551:

https://github.com/oxidecomputer/omicron/pull/6551/checks?check_run_id=30071404933

Log showing the specific test failure:

https://buildomat.eng.oxide.computer/wg/0/details/01J7KRQA0QPH39CY96CQJF6Z23/1FrIoqw9yG0jDA7yi1gTIwhUFrVoqNJvyLQAGB0ClHEq3pm0/01J7KRR5P728MRF0TSY1N93XM3#S4699

Excerpt from the log showing the failure:

--- STDERR:              omicron-nexus app::background::tasks::blueprint_execution::test::test_deploy_omicron_zones ---
log file: /var/tmp/omicron_tmp/omicron_nexus-5f590c67750dc30c-test_deploy_omicron_zones.10132.0.log
note: configured to log to "/var/tmp/omicron_tmp/omicron_nexus-5f590c67750dc30c-test_deploy_omicron_zones.10132.0.log"
DB URL: postgresql://root@[::1]:41040/omicron?sslmode=disable
DB address: [::1]:41040
log file: /var/tmp/omicron_tmp/omicron_nexus-5f590c67750dc30c-test_deploy_omicron_zones.10132.2.log
note: configured to log to "/var/tmp/omicron_tmp/omicron_nexus-5f590c67750dc30c-test_deploy_omicron_zones.10132.2.log"
log file: /var/tmp/omicron_tmp/omicron_nexus-5f590c67750dc30c-test_deploy_omicron_zones.10132.3.log
note: configured to log to "/var/tmp/omicron_tmp/omicron_nexus-5f590c67750dc30c-test_deploy_omicron_zones.10132.3.log"
thread 'app::background::tasks::blueprint_execution::test::test_deploy_omicron_zones' panicked at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/httptest-0.15.5/src/server.rs:90:13:
received the following unexpected requests:
[
    Request {
        method: GET,
        uri: /inventory,
        version: HTTP/1.1,
        headers: {
            "accept": "application/json",
            "host": "[::1]:50941",
        },
        body: b"",
    },
]
stack backtrace:
   0: rust_begin_unwind
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:652:5
   1: core::panicking::panic_fmt
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/panicking.rs:72:14
   2: verify_and_clear
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/httptest-0.15.5/src/server.rs:90:13
   3: {async_fn#0}
             at ./src/app/background/tasks/blueprint_execution.rs:463:9
   4: {async_block#0}
             at ./src/app/background/tasks/blueprint_execution.rs:267:5
   5: poll<&mut dyn core::future::future::Future<Output=()>>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/future/future.rs:123:9
   6: poll<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/future/future.rs:123:9
   7: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:673:57
   8: with_budget<core::task::poll::Poll<()>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/coop.rs:107:5
   9: budget<core::task::poll::Poll<()>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/coop.rs:73:5
  10: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:673:25
  11: tokio::runtime::scheduler::current_thread::Context::enter
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:412:19
  12: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:672:36
  13: tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}}
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:751:68
  14: tokio::runtime::context::scoped::Scoped<T>::set
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context/scoped.rs:40:9
  15: tokio::runtime::context::set_scheduler::{{closure}}
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context.rs:180:26
  16: try_with<tokio::runtime::context::Context, tokio::runtime::context::set_scheduler::{closure_env#0}<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<()>), tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>, core::option::Option<()>>>, (alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<()>)>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/thread/local.rs:283:12
  17: std::thread::local::LocalKey<T>::with
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/thread/local.rs:260:9
  18: tokio::runtime::context::set_scheduler
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context.rs:180:9
  19: tokio::runtime::scheduler::current_thread::CoreGuard::enter
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:751:27
  20: tokio::runtime::scheduler::current_thread::CoreGuard::block_on
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:660:19
  21: {closure#0}<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:180:28
  22: tokio::runtime::context::runtime::enter_runtime
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context/runtime.rs:65:16
  23: block_on<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:168:9
  24: tokio::runtime::runtime::Runtime::block_on_inner
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/runtime.rs:361:47
  25: block_on<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/runtime.rs:335:13
  26: test_deploy_omicron_zones
             at ./src/app/background/tasks/blueprint_execution.rs:267:5
  27: omicron_nexus::app::background::tasks::blueprint_execution::test::test_deploy_omicron_zones::{{closure}}
             at ./src/app/background/tasks/blueprint_execution.rs:268:76
  28: core::ops::function::FnOnce::call_once
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:250:5
  29: core::ops::function::FnOnce::call_once
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: temporary directory leaked: "/var/tmp/omicron_tmp/.tmpqQMj6Z"
	If you would like to access the database for debugging, run the following:

	# Run the database
	cargo xtask db-dev run --no-populate --store-dir "/var/tmp/omicron_tmp/.tmpqQMj6Z/data"
	# Access the database. Note the port may change if you run multiple databases.
	cockroach sql --host=localhost:32221 --insecure
WARN: dropped ClickHouseInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
failed to clean up ClickHouse data dir:
- /var/tmp/omicron_tmp/omicron_nexus-5f590c67750dc30c-test_deploy_omicron_zones.10132.1-clickhouse-LirrEm: File exists (os error 17)
WARN: dropped DendriteInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: dendrite temporary directory leaked: /var/tmp/omicron_tmp/.tmpz45LCq
WARN: dropped DendriteInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: dendrite temporary directory leaked: /var/tmp/omicron_tmp/.tmpcak1sV
WARN: dropped MgdInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: mgd temporary directory leaked: /var/tmp/omicron_tmp/.tmpvh7J0T
WARN: dropped MgdInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: mgd temporary directory leaked: /var/tmp/omicron_tmp/.tmpYFkdVu

@bnaecker bnaecker added the Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. label Sep 13, 2024
@jgallagher
Copy link
Contributor

I think I understand this. This test starts some mock sled-agents then inserts records for those mock sled-agents into CRDB. It then proceeds to activate the Nexus RPW it's trying to test a few ways and confirms that task talks to the sled-agents as expected.

However, we're running under a #[nexus_test], so other Nexus RPWs entirely unrelated to blueprint execution are running. Presumably in this flake, the inventory collection RPW saw the CRDB records for our mock sled agents and tried to collect their inventories, causing the mock http server to complain about an unexpected request.

I'm not sure off the top of my head what the best fix for this is, though. Simulated sled agent instead of a mock? Turn off the other RPWs (if that's possible)? Tell the mock sled-agent to ignore unexpected requests?

@bnaecker
Copy link
Collaborator Author

Tell the mock sled-agent to ignore unexpected requests?

I like this approach. It looks like we have complete control over the mock servers, so we maybe we can keep track only of those counts we care about. The assertion can track that one count, rather than all requests.

@jgallagher
Copy link
Contributor

Still happening after the fix in #6590 (#6593 (comment)). #6594 should fix this, but I'm worried it may just make the flake less likely. I'd like to leave this open for a while even after #6594 lands in case we see it again (at which point we probably need to rework that test to either use a different http mocking library or do its checks some other way).

@jgallagher jgallagher reopened this Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants