Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crdb memory budget exceeded error when creating a large number of instances concurrently #5904

Closed
askfongjojo opened this issue Jun 15, 2024 · 18 comments · Fixed by #6551
Closed
Labels
known issue To include in customer documentation and training
Milestone

Comments

@askfongjojo
Copy link

askfongjojo commented Jun 15, 2024

While creating 70 instances concurrently using the same Terraform plan I used many times before without any problem, I'm now hitting a number of 500 errors in every run consistently. The errors in nexus log all look like the one below:

02:12:24.115Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (dropshot_external): request completed
    error_message_external = Internal Server Error
    error_message_internal = saga ACTION error at node "output": unexpected database error: root: memory budget exceeded: 7854088 bytes requested, 127439909 currently allocated, 134217728 bytes in budget
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/0cd0e82/dropshot/src/server.rs:866
    latency_us = 1918719
    local_addr = 172.30.2.5:443
    method = POST
    remote_addr = 172.20.17.42:54427
    req_id = 9c1272f5-2184-4f5b-b7c9-5b6b43adad1c
    response_code = 500
    uri = https://oxide.sys.rack2.eng.oxide.computer/v1/instances?project=fe0da422-5c48-4b52-8010-f2fc401f090f

Using this D script against the nexus process running in sled 8, I was able to capture the database error in question:

conn_id: f2bdf557-5a71-4a0f-9a98-83a50a11f1cd, latency: 2329 us, query: 'INSERT INTO "saga_node_event" ("saga_id", "node_id", "event_type", "data", "event_time", "creator") VALUES ($1, $2, $3, $4, $5, $6) -- binds: [SagaId(9225d6c7-6369-40fa-b9cf-e3b698434a66), SagaNodeId(5), "failed", Object {"ActionFailed": Object {"source_error": Object {"InternalError": Object {"internal_message": String("unexpected database error: root: memory budget exceeded: 7854088 bytes requested, 133920542 currently allocated, 134217728 bytes in budget")}}}}, 2024-06-15T04:17:54.044368305Z, 65a11c18-7f59-41ac-b9e7-680627f996e7]'

It is however an insert statement and didn't appear to be the culprit. I had a look at 3 different occurrences of the 500 errors and tried to pinpoint the more expensive queries running around the same time. Here is one of those (latency = 346542 us):

WITH found_row AS MATERIALIZED (
  SELECT 
    "vpc_subnet"."id", 
    "vpc_subnet"."name", 
    "vpc_subnet"."description", 
    "vpc_subnet"."time_created", 
    "vpc_subnet"."time_modified", 
    "vpc_subnet"."time_deleted", 
    "vpc_subnet"."vpc_id", 
    "vpc_subnet"."rcgen", 
    "vpc_subnet"."ipv4_block", 
    "vpc_subnet"."ipv6_block" 
  FROM 
    "vpc_subnet" 
  WHERE 
    (
      ("vpc_subnet"."id" = $1) 
      AND (
        "vpc_subnet"."time_deleted" IS NULL
      )
    ) FOR 
  UPDATE 
    ), 
dummy AS MATERIALIZED (
  SELECT 
    IF(
      EXISTS(
        SELECT 
          "id" 
        FROM 
          found_row
      ), 
      TRUE, 
      CAST(1 / 0 AS BOOL)
    )
), 
updated_row AS MATERIALIZED (
  UPDATE 
    "vpc_subnet" 
  SET 
    "rcgen" = "rcgen" + 1 
  WHERE 
    "id" IN (
      SELECT 
        "id" 
      FROM 
        found_row
    ) RETURNING 1
), 
inserted_row AS (
  INSERT INTO "network_interface" (
    "id", "name", "description", "time_created", 
    "time_modified", "time_deleted", 
    "kind", "parent_id", "vpc_id", "subnet_id", 
    "mac", "ip", "slot", "is_primary"
  ) WITH validated_interface(
    "vpc_id", "subnet_id", "parent_id", 
    "slot", "is_primary"
  ) AS (
    SELECT 
      CAST(
        IF(
          COALESCE(
            (
              SELECT 
                "vpc_id" 
              FROM 
                "network_interface" 
              WHERE 
                "time_deleted" IS NULL 
                AND "parent_id" = $2 
                AND "kind" = $3 
              LIMIT 
                1
            ), $4
          ) = $5, 
          $6, 
          'multiple-vpcs'
        ) AS UUID
      ) AS "vpc_id", 
      CAST(
        IF(
          EXISTS(
            SELECT 
              "subnet_id" 
            FROM 
              "network_interface" 
            WHERE 
              "id" != $7 
              AND "parent_id" = $8 
              AND "kind" = $9 
              AND "time_deleted" IS NULL 
              AND "subnet_id" = $10
          ), 
          'non-unique-subnets', 
          $11
        ) AS UUID
      ) AS "subnet_id", 
      (
        CAST(
          CASE COALESCE(
            (
              SELECT 
                CASE WHEN "active_propolis_id" IS NULL THEN "state" ELSE $12 END 
              FROM 
                "instance" 
              WHERE 
                "id" = $13 
                AND "time_deleted" IS NULL
            ), 
            $14
          ) WHEN $15 THEN $16 WHEN $17 THEN $18 WHEN $19 THEN $20 ELSE $21 END AS UUID
        )
      ), 
      (
        SELECT 
          COALESCE(
            (
              SELECT 
                $22 + "shift" AS "slot" 
              FROM 
                (
                  SELECT 
                    generate_series(0, $23) AS "index", 
                    generate_series(0, $24) AS "shift" 
                  UNION ALL 
                  SELECT 
                    generate_series($25, $26) AS "index", 
                    generate_series($27, -1) AS "shift"
                ) 
                LEFT OUTER JOIN "network_interface" ON (
                  "parent_id", "slot", "time_deleted" IS NULL
                ) = ($28, $29 + "shift", TRUE) 
              WHERE 
                "slot" IS NULL 
              ORDER BY 
                "index" 
              LIMIT 
                1
            ), 0
          )
      ), 
      (
        SELECT 
          NOT EXISTS(
            SELECT 
              1 
            FROM 
              "network_interface" 
            WHERE 
              "parent_id" = $30 
              AND "kind" = $31 
              AND "time_deleted" IS NULL 
            LIMIT 
              1
          )
      )
  ) 
  SELECT 
    $32 AS "id", 
    $33 AS "name", 
    $34 AS "description", 
    $35 AS "time_created", 
    $36 AS "time_modified", 
    $37 AS "time_deleted", 
    $38 AS "kind", 
    $39 AS "parent_id", 
    (
      SELECT 
        "vpc_id" 
      FROM 
        validated_interface
    ), 
    (
      SELECT 
        "subnet_id" 
      FROM 
        validated_interface
    ), 
    (
      SELECT 
        $40 + "shift" AS "mac" 
      FROM 
        (
          SELECT 
            generate_series(0, $41) AS "index", 
            generate_series(0, $42) AS "shift" 
          UNION ALL 
          SELECT 
            generate_series($43, $44) AS "index", 
            generate_series($45, -1) AS "shift"
        ) 
        LEFT OUTER JOIN "network_interface" ON (
          "vpc_id", "mac", "time_deleted" IS NULL
        ) = ($46, $47 + "shift", TRUE) 
      WHERE 
        "mac" IS NULL 
      ORDER BY 
        "index" 
      LIMIT 
        1
    ) AS "mac", 
    (
      SELECT 
        $48 + "shift" AS "ip" 
      FROM 
        (
          SELECT 
            generate_series(0, $49) AS "index", 
            generate_series(0, $50) AS "shift" 
          UNION ALL 
          SELECT 
            generate_series($51, $52) AS "index", 
            generate_series($53, -1) AS "shift"
        ) 
        LEFT OUTER JOIN "network_interface" ON (
          "subnet_id", "ip", "time_deleted" IS NULL
        ) = ($54, $55 + "shift", TRUE) 
      WHERE 
        "ip" IS NULL 
      ORDER BY 
        "index" 
      LIMIT 
        1
    ) AS "ip", 
    (
      SELECT 
        "slot" 
      FROM 
        validated_interface
    ), 
    (
      SELECT 
        "is_primary" 
      FROM 
        validated_interface
    ) RETURNING "network_interface"."id", 
    "network_interface"."name", 
    "network_interface"."description", 
    "network_interface"."time_created", 
    "network_interface"."time_modified", 
    "network_interface"."time_deleted", 
    "network_interface"."kind", 
    "network_interface"."parent_id", 
    "network_interface"."vpc_id", 
    "network_interface"."subnet_id", 
    "network_interface"."mac", 
    "network_interface"."ip", 
    "network_interface"."slot", 
    "network_interface"."is_primary"
) 
SELECT 
  * 
FROM 
  inserted_row -- binds: [7612ed87-eff5-433f-b781-cf4cbeb75def, aa47bc6e-e1b9-4b3a-baa0-83f6e4b06c2b, Instance, 91a91cab-2fbd-4c1e-a91f-4f2bcae5705b, 91a91cab-2fbd-4c1e-a91f-4f2bcae5705b, "91a91cab-2fbd-4c1e-a91f-4f2bcae5705b", f0c3f7b4-3190-43d8-b695-3ac56b995077, aa47bc6e-e1b9-4b3a-baa0-83f6e4b06c2b, Instance, 7612ed87-eff5-433f-b781-cf4cbeb75def, "7612ed87-eff5-433f-b781-cf4cbeb75def", Vmm, aa47bc6e-e1b9-4b3a-baa0-83f6e4b06c2b, Destroyed, NoVmm, "aa47bc6e-e1b9-4b3a-baa0-83f6e4b06c2b", Creating, "aa47bc6e-e1b9-4b3a-baa0-83f6e4b06c2b", Destroyed, "no-instance", "bad-state", 0, 8, 8, 9, 8, 0, aa47bc6e-e1b9-4b3a-baa0-83f6e4b06c2b, 0, aa47bc6e-e1b9-4b3a-baa0-83f6e4b06c2b, Instance, f0c3f7b4-3190-43d8-b695-3ac56b995077, Name(Name("net0")), "net0", 2024-06-15T04:17:53.697381777Z, 2024-06-15T04:17:53.697381777Z, None, Instance, aa47bc6e-e1b9-4b3a-baa0-83f6e4b06c2b, MacAddr(MacAddr(MacAddr6([168, 64, 37, 248, 102, 112]))), 432527, 432527, 432528, 983039, -550512, 91a91cab-2fbd-4c1e-a91f-4f2bcae5705b, MacAddr(MacAddr(MacAddr6([168, 64, 37, 248, 102, 112]))), V4(Ipv4Network { addr: 192.168.128.5, prefix: 32 }), 249, 249, 250, 249, 0, 7612ed87-eff5-433f-b781-cf4cbeb75def, V4(Ipv4Network { addr: 192.168.128.5, prefix: 32 })]

(a grep of this query pattern shows the latency was as high as 1523657 us during the window when I was running terraform apply and tracing the database queries).

The complete nexus log and db query tracing output are located in catacomb:/staff/dogfood/omicron-5904.

@askfongjojo askfongjojo added this to the 9 milestone Jun 15, 2024
@askfongjojo
Copy link
Author

The omicron commit deployed in the environment I used (rack2) was 90bc09ed0838dacb7a299b0ffbfd07feb4608ba7.

The last time I ran the same terraform test was on commit d79a51d57bdf324947275841ac849f2b37edff3a and didn't hit this issue.

@jgallagher
Copy link
Contributor

I suspect this subquery is the (or part of the) problem:

    (
      SELECT 
        $40 + "shift" AS "mac" 
      FROM 
        (
          SELECT 
            generate_series(0, $41) AS "index", 
            generate_series(0, $42) AS "shift" 
          UNION ALL 
          SELECT 
            generate_series($43, $44) AS "index", 
            generate_series($45, -1) AS "shift"
        ) 
        LEFT OUTER JOIN "network_interface" ON (
          "vpc_id", "mac", "time_deleted" IS NULL
        ) = ($46, $47 + "shift", TRUE) 
      WHERE 
        "mac" IS NULL 
      ORDER BY 
        "index" 
      LIMIT 
        1
    ) AS "mac", 

After substituting bind parameters, this becomes:

      SELECT 
        184993468409456 + "shift" AS "mac" 
      FROM 
        (
          SELECT 
            generate_series(0, 432527) AS "index", 
            generate_series(0, 432527) AS "shift" 
          UNION ALL 
          SELECT 
            generate_series(432528, 983039) AS "index", 
            generate_series(-550512, -1) AS "shift"
        ) 
        LEFT OUTER JOIN "network_interface" ON (
          "vpc_id", "mac", "time_deleted" IS NULL
        ) = ('91a91cab-2fbd-4c1e-a91f-4f2bcae5705b', 184993468409456 + "shift", TRUE) 
      WHERE 
        "mac" IS NULL 
      ORDER BY 
        "index" 
      LIMIT 
        1

which has some pretty large generate_series() calls. Running just that query in isolation through EXPLAIN ANALYZE emits:

                                                        info
--------------------------------------------------------------------------------------------------------------------
  planning time: 2ms
  execution time: 431ms
  distribution: full
  vectorized: true
  rows read from KV: 117 (8.6 KiB)
  cumulative time spent in KV: 849µs
  maximum memory usage: 25 MiB
  network usage: 11 MiB (660 messages)

  • render
  │ nodes: n2
  │ actual row count: 1
  │ estimated row count: 1
  │
  └── • top-k
      │ nodes: n2
      │ actual row count: 1
      │ estimated row count: 1
      │ order: +"index"
      │ k: 1
      │
      └── • filter
          │ nodes: n1, n2
          │ actual row count: 982,923
          │ estimated row count: 1
          │ filter: mac IS NULL
          │
          └── • hash join (right outer)
              │ nodes: n1, n2
              │ actual row count: 983,040
              │ estimated max memory allocated: 50 MiB
              │ estimated max sql temp disk usage: 0 B
              │ estimated row count: 20
              │ equality: (mac) = (column23)
              │
              ├── • render
              │   │ nodes: n1
              │   │ actual row count: 117
              │   │ KV time: 849µs
              │   │ KV contention time: 0µs
              │   │ KV rows read: 117
              │   │ KV bytes read: 8.6 KiB
              │   │ estimated max memory allocated: 20 KiB
              │   │ estimated row count: 45
              │   │
              │   └── • scan
              │         nodes: n1
              │         actual row count: 117
              │         KV time: 849µs
              │         KV contention time: 0µs
              │         KV rows read: 117
              │         KV bytes read: 8.6 KiB
              │         estimated max memory allocated: 20 KiB
              │         estimated row count: 45 (0.34% of the table; stats collected 3 days ago)
              │         table: network_interface@network_interface_vpc_id_mac_key (partial index)
              │         spans: [/'91a91cab-2fbd-4c1e-a91f-4f2bcae5705b' - /'91a91cab-2fbd-4c1e-a91f-4f2bcae5705b']
              │
              └── • render
                  │ nodes: n2
                  │ actual row count: 983,040
                  │ estimated row count: 20
                  │
                  └── • union all
                      │ nodes: n2
                      │ actual row count: 983,040
                      │ estimated row count: 20
                      │
                      ├── • project set
                      │   │ nodes: n2
                      │   │ actual row count: 432,528
                      │   │ estimated row count: 10
                      │   │
                      │   └── • emptyrow
                      │         nodes: n2
                      │         actual row count: 1
                      │
                      └── • project set
                          │ nodes: n2
                          │ actual row count: 550,512
                          │ estimated row count: 10
                          │
                          └── • emptyrow
                                nodes: n2
                                actual row count: 1
(84 rows)


Time: 435ms total (execution 434ms / network 1ms)

Note the actual memory usage was 25 MiB and the estimated max memory was 50 MiB. Our total query memory budget is only ~128 MiB IIRC, so a few of these running concurrently would max us out. The 983039 value is the distance between MacAddr::MIN_GUEST_ADDR and MacAddr::MAX_GUEST_ADDR.

@askfongjojo
Copy link
Author

askfongjojo commented Jun 17, 2024

@jgallagher @bnaecker - Thank you for digging into this. As the logic for locating available mac address hasn't changed lately (or for a long time), it is possible that because we have a lot of more background processes running and makeing database queries, the expensive subquery has become more prone to the memory budget limit.

I wonder if we need to re-evaluate the memory budget allocated besides tuning this particular subquery.

cc @smklein

@bnaecker
Copy link
Collaborator

I've been mulling on this for a bit. The next-item style queries all work the same way: join the existing entries with all possible entries, and take the first row where the existing entry is NULL. That is, select the first possible entry that's not already in the table.

This is fine, but obviously expensive because we form the full set of all possible values. For things like MACs this expensive, but for IPv6 addresses it will be literally impossible. In theory it should be possible to do this with a self-join instead, between the current set of values and the "next one", e.g., mac and mac + 1. That would at least limit the query to the size of the allocated entries, rather than these existing generate_series() queries, which consume space proportional to the entire possible range. It's strictly less memory, at least.

@morlandi7 morlandi7 added the known issue To include in customer documentation and training label Jun 18, 2024
@morlandi7 morlandi7 modified the milestones: 9, 10 Jul 11, 2024
@morlandi7 morlandi7 modified the milestones: 10, 11 Aug 14, 2024
@augustuswm
Copy link
Contributor

augustuswm commented Sep 4, 2024

I am trying to do another round of concurrent VM provisioning testing on the colo with R10, but I am hitting this issue much more frequently. I'm not sure if that is related to the release itself or if it is due to have a lot more records with time_deleted set to non-null.

In previous R9 testing I was able to reach about 250 VM creations (with a concurrency of 25), but currently I am not able to get past 75 outside a single run where I reduced concurrency to 10. Though that may have just gotten lucky? Further attempts have failed.

Is there anything we want to test around the size of the network_interface table in regards to this? or has the new instance lifecycle work changed the amount of memory that is available to be used here?

@augustuswm
Copy link
Contributor

Anecdotal follow up, but I am starting to see this when provisioning a single VM from the web console now as well. I'm not sure if it is exactly related to that subquery though.

14:51:45.326Z DEBG 44c35566-dd64-4e4a-896e-c50aaa3df14f (ServerContext): recording saga event
    event_type = Failed(ActionFailed { source_error: Object {"InternalError": Object {"internal_message": String("unexpected database error: root: memory budget exceeded: 3932160 bytes requested, 132288856 currently allocated, 134217728 bytes in budget")}} })
    node_id = 5
    saga_id = 649ecf66-71a4-406d-9ba6-82cb8118d5f5
14:51:45.461Z WARN 44c35566-dd64-4e4a-896e-c50aaa3df14f (ServerContext): saga finished
    action_error_node_name = "output"
    action_error_source = ActionFailed { source_error: Object {"InternalError": Object {"internal_message": String("unexpected database error: root: memory budget exceeded: 3932160 bytes requested, 132288856 currently allocated, 134217728 bytes in budget")}} }
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/steno-0.4.1/src/sec.rs:1044
    result = failure
    saga_id = 649ecf66-71a4-406d-9ba6-82cb8118d5f5
    saga_name = instance-create
    sec_id = 44c35566-dd64-4e4a-896e-c50aaa3df14f
    undo_result = success
14:51:45.461Z INFO 44c35566-dd64-4e4a-896e-c50aaa3df14f (ServerContext): saga finished
    file = nexus/src/app/saga.rs:361
    saga_id = 649ecf66-71a4-406d-9ba6-82cb8118d5f5
    saga_name = instance-create
    saga_result = SagaResult { saga_id: 649ecf66-71a4-406d-9ba6-82cb8118d5f5, saga_log: SagaLog { saga_id: 649ecf66-71a4-406d-9ba6-82cb8118d5f5, unwinding: true, events: [N182 started, N182 succeeded, N000 started, N000 succeeded, N001 started, N001 succeeded, N002 started, N002 succeeded, N003 started, N003 succeeded, N004 started, N004 succeeded, N005 started, N005 failed, N004 undo_started, N004 undo_finished, N003 undo_started, N003 undo_finished, N002 undo_started, N002 undo_finished, N001 undo_started, N001 undo_finished, N000 undo_started, N000 undo_finished, N182 undo_started, N182 undo_finished], node_status: {0: UndoFinished, 1: UndoFinished, 2: UndoFinished, 3: UndoFinished, 4: UndoFinished, 5: Failed(ActionFailed { source_error: Object {"InternalError": Object {"internal_message": String("unexpected database error: root: memory budget exceeded: 3932160 bytes requested, 132288856 currently allocated, 134217728 bytes in budget")}} }), 182: UndoFinished} }, kind: Err(SagaResultErr { error_node_name: "output", error_source: ActionFailed { source_error: Object {"InternalError": Object {"internal_message": String("unexpected database error: root: memory budget exceeded: 3932160 bytes requested, 132288856 currently allocated, 134217728 bytes in budget")}} }, undo_failure: None }) }
14:51:45.461Z INFO 44c35566-dd64-4e4a-896e-c50aaa3df14f (dropshot_external): request completed
    error_message_external = Internal Server Error
    error_message_internal = saga ACTION error at node "output": unexpected database error: root: memory budget exceeded: 3932160 bytes requested, 132288856 currently allocated, 134217728 bytes in budget
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/06c8dab/dropshot/src/server.rs:902
    latency_us = 796337
    local_addr = 172.30.2.7:443
    method = POST
    remote_addr = 75.72.239.35:50424
    req_id = 25ac5702-1c7d-414c-ab34-8b4d078c0706
    response_code = 500
    uri = https://demo-load-test.sys.r3.oxide-preview.com/v1/instances?project=capacity-test

@augustuswm
Copy link
Contributor

augustuswm commented Sep 4, 2024

Ok, looking in to this more I think it is a different query as this looks related to the saga executor state:

conn_id: 9654a441-5a07-4fec-8950-d428ad800b23, latency: 1860 us, query: 'INSERT INTO "saga_node_event" ("saga_id", "node_id", "event_type", "data", "event_time", "creator") VALUES ($1, $2, $3, $4, $5, $6) ON CONFLICT ("saga_id", "node_id", "event_type") DO NOTHING -- binds: [SagaId(e6b2c0c3-e4ca-4623-b112-1ab6af432e03), SagaNodeId(5), "failed", Object {"ActionFailed": Object {"source_error": Object {"InternalError": Object {"internal_message": String("unexpected database error: root: memory budget exceeded: 3932160 bytes requested, 133271896 currently allocated, 134217728 bytes in budget")}}}}, 2024-09-04T15:59:39.971959891Z, 44c35566-dd64-4e4a-896e-c50aaa3df14f]'
conn_id: 9853a47e-a733-4548-9c25-b80600ca80a6, latency: 2774 us, query: 'INSERT INTO "saga_node_event" ("saga_id", "node_id", "event_type", "data", "event_time", "creator") VALUES ($1, $2, $3, $4, $5, $6) ON CONFLICT ("saga_id", "node_id", "event_type") DO NOTHING -- binds: [SagaId(382488dc-2f15-4190-9c23-f1f4c21f4e82), SagaNodeId(5), "failed", Object {"ActionFailed": Object {"source_error": Object {"InternalError": Object {"internal_message": String("unexpected database error: root: memory budget exceeded: 3932160 bytes requested, 133466456 currently allocated, 134217728 bytes in budget")}}}}, 2024-09-04T15:59:40.427465735Z, 44c35566-dd64-4e4a-896e-c50aaa3df14f]'
conn_id: 9853a47e-a733-4548-9c25-b80600ca80a6, latency: 4456 us, query: 'INSERT INTO "saga_node_event" ("saga_id", "node_id", "event_type", "data", "event_time", "creator") VALUES ($1, $2, $3, $4, $5, $6) ON CONFLICT ("saga_id", "node_id", "event_type") DO NOTHING -- binds: [SagaId(6059e9af-8170-42a5-ba8e-b84bb608faa8), SagaNodeId(5), "failed", Object {"ActionFailed": Object {"source_error": Object {"InternalError": Object {"internal_message": String("unexpected database error: root: memory budget exceeded: 3932160 bytes requested, 133425496 currently allocated, 134217728 bytes in budget")}}}}, 2024-09-04T15:59:40.800077478Z, 44c35566-dd64-4e4a-896e-c50aaa3df14f]'
conn_id: 66a6c0f6-2295-4bb6-beb3-b6ee2406b51b, latency: 2117 us, query: 'INSERT INTO "saga_node_event" ("saga_id", "node_id", "event_type", "data", "event_time", "creator") VALUES ($1, $2, $3, $4, $5, $6) ON CONFLICT ("saga_id", "node_id", "event_type") DO NOTHING -- binds: [SagaId(f859c285-4973-4669-9543-e74039d4fbff), SagaNodeId(5), "failed", Object {"ActionFailed": Object {"source_error": Object {"InternalError": Object {"internal_message": String("unexpected database error: root: memory budget exceeded: 3932160 bytes requested, 133507416 currently allocated, 134217728 bytes in budget")}}}}, 2024-09-04T15:59:41.959334519Z, 44c35566-dd64-4e4a-896e-c50aaa3df14f]'
conn_id: 66a6c0f6-2295-4bb6-beb3-b6ee2406b51b, latency: 2290 us, query: 'INSERT INTO "saga_node_event" ("saga_id", "node_id", "event_type", "data", "event_time", "creator") VALUES ($1, $2, $3, $4, $5, $6) ON CONFLICT ("saga_id", "node_id", "event_type") DO NOTHING -- binds: [SagaId(e898510c-1c4f-449b-a0b8-a955c47783c2), SagaNodeId(5), "failed", Object {"ActionFailed": Object {"source_error": Object {"InternalError": Object {"internal_message": String("unexpected database error: root: memory budget exceeded: 3932160 bytes requested, 133497176 currently allocated, 134217728 bytes in budget")}}}}, 2024-09-04T15:59:42.313041649Z, 44c35566-dd64-4e4a-896e-c50aaa3df14f]'
conn_id: e8a0245e-0643-4e98-bc40-35f937e45e34, latency: 2021 us, query: 'INSERT INTO "saga_node_event" ("saga_id", "node_id", "event_type", "data", "event_time", "creator") VALUES ($1, $2, $3, $4, $5, $6) ON CONFLICT ("saga_id", "node_id", "event_type") DO NOTHING -- binds: [SagaId(a38b3fec-fd8c-47ee-8373-797b75073cfc), SagaNodeId(5), "failed", Object {"ActionFailed": Object {"source_error": Object {"InternalError": Object {"internal_message": String("unexpected database error: root: memory budget exceeded: 3932160 bytes requested, 133804376 currently allocated, 134217728 bytes in budget")}}}}, 2024-09-04T15:59:43.428520565Z, 44c35566-dd64-4e4a-896e-c50aaa3df14f]'
conn_id: e8a0245e-0643-4e98-bc40-35f937e45e34, latency: 2245 us, query: 'INSERT INTO "saga_node_event" ("saga_id", "node_id", "event_type", "data", "event_time", "creator") VALUES ($1, $2, $3, $4, $5, $6) ON CONFLICT ("saga_id", "node_id", "event_type") DO NOTHING -- binds: [SagaId(fd56b608-4e3f-478a-901d-7c7092ca3080), SagaNodeId(5), "failed", Object {"ActionFailed": Object {"source_error": Object {"InternalError": Object {"internal_message": String("unexpected database error: root: memory budget exceeded: 3932160 bytes requested, 133712216 currently allocated, 134217728 bytes in budget")}}}}, 2024-09-04T15:59:43.845389668Z, 44c35566-dd64-4e4a-896e-c50aaa3df14f]'
conn_id: ecefd999-b91c-4600-97b6-fc9468ef0ab9, latency: 6343 us, query: 'INSERT INTO "saga_node_event" ("saga_id", "node_id", "event_type", "data", "event_time", "creator") VALUES ($1, $2, $3, $4, $5, $6) ON CONFLICT ("saga_id", "node_id", "event_type") DO NOTHING -- binds: [SagaId(568443e7-f44e-4ec6-b976-51cca7fa163c), SagaNodeId(5), "failed", Object {"ActionFailed": Object {"source_error": Object {"InternalError": Object {"internal_message": String("unexpected database error: root: memory budget exceeded: 8119640 bytes requested, 126411776 currently allocated, 134217728 bytes in budget")}}}}, 2024-09-04T15:59:47.525018707Z, 44c35566-dd64-4e4a-896e-c50aaa3df14f]'

It is still an exhaustion during instance create though.

@davepacheco
Copy link
Collaborator

Mentioning this because I misread that output at first: what you're seeing there are SQL INSERTs that are storing errors that happened during some previous query's execution. The errors didn't (necessarily) come from that INSERT. The errors would have come from some query that was executed during the execution of that saga node (e.g., saga id e6b2c0c3-e4ca-4623-b112-1ab6af432e03 node 5 -- it looks like node 5 for all of them so I'm guessing it's the same kind of saga).

It looks like they're trying to allocate ~4 MiB but there's already ~130 MiB allocated out of a pool about that size. If I recall correctly, sometimes it's not the query that produces the error that's actually the problem. 4 MiB is larger than I'd expect but it's also not that much memory. I expect the Cockroach docs have more information about tracking this down but I haven't had a chance to look.

@augustuswm
Copy link
Contributor

augustuswm commented Sep 4, 2024

Thanks! that certainly helps me start sorting out some confusion I had around this. Let me cast a wider net on the queries I'm looking at.

Is there any reason we could think of there being an affinity between memory exhaustion and specific silos? I have a test silo that I am operating in, and I have no conclusive evidence, but once I seem to trigger the exhaustion that silo seems to continually trigger it for minutes at a time. Meanwhile the instance-create saga succeeds in other silos.

@davepacheco
Copy link
Collaborator

There's an overview of managing this problem in the CockroachDB docs:
https://www.cockroachlabs.com/docs/stable/common-errors#memory-budget-exceeded

There's also a bunch more information here:
https://www.cockroachlabs.com/blog/memory-usage-cockroachdb/

I imagine we'll want to characterize how the memory is being used when this happens to decide if there are problematic queries or if we just need to give CockroachDB more memory.

@davepacheco
Copy link
Collaborator

Is there any reason we could think of there being an affinity between memory exhaustion and specific silos? I have a test silo that I am operating in, and I have no conclusive evidence, but once I seem to trigger the exhaustion that silo seems to continually trigger it for minutes at a time. Meanwhile the instance-create saga succeeds in other silos.

The only thing that immediately comes to mind is if the Silo has different IP pools or other resources whose queries are more expensive for some reason? It's conceivable.

@davepacheco
Copy link
Collaborator

This could be #5651?

@augustuswm
Copy link
Contributor

augustuswm commented Sep 6, 2024

It is likely related I think. I was testing against two different IP pools, and the results were similar. One pool was a /20 and the other a /22.

Going back to the affinity thing. I need to do more testing, but I suspect it may be project/VPC/subnet related. I was not able to drill down past that.

I'll add instructions to run, but https://github.com/oxidecomputer/up-down is what I was using for testing.

@askfongjojo
Copy link
Author

askfongjojo commented Sep 6, 2024

The VPC subnet IP range also matters. I run into the CRDB memory budget error consistently, approx. 5 out of 120 instances provisioned with Terraform on default parallelism setting (which is 10), when specifying this subnet 192.168.0.0/24. After changing it to 192.168.0.0/25, I haven't encountered the CRDB error when applying the same 120-instance TF plan.

To be clear, this can't be a workaround for customers. I'm just identifying the "offenders".

@bnaecker
Copy link
Collaborator

bnaecker commented Sep 6, 2024 via email

@bnaecker
Copy link
Collaborator

bnaecker commented Sep 9, 2024

I'm starting to look at this in earnest, and there seems to be two approaches we could take to reducing the memory consumption here:

  • The "self-join" trick I mentioned above, where we'd join the mac column with mac + 1, and take the first entry where mac + 1 doesn't have a corresponding mac in the table.
  • Limit the size of the query, and add a retry mechanism.

Note that the self-join thing would have to be done for both the MAC and IP addresses, since both use the generate_series()-based NextItem query we have today. I'd probably change or add a new version of the NextItem query itself to make this easier.

Self-join

The query for a self-join would look like this:

root@[fd00:1122:3344:105::3]:32221/omicron  OPEN> select next_mac from (select mac + 1 as next_mac from network_interface where vpc_id = '91a91cab-2fbd-4c1e-a91f-4f2bcae5705b' and time_deleted is null) left outer join (select mac from network_interface where vpc_id = '91a91cab-2fbd-4c1e-a91f-4f2bcae5705b' and time_deleted is null) on mac is null = true limit 1;
     next_mac
-------------------
  184993467953491
(1 row)


Time: 2ms total (execution 2ms / network 0ms)

That takes the next MAC (mac + 1), and does a left outer join with the existing MAC, finding those resulting entries where the mac is NULL. I.e., where mac + 1 doesn't exist in the table already.

In this particular case, the query consumes far less memory than the one using generate_series():

root@[fd00:1122:3344:105::3]:32221/omicron  OPEN> explain analyze select next_mac from (select mac + 1 as next_mac fro
m network_interface where vpc_id = '91a91cab-2fbd-4c1e-a91f-4f2bcae5705b' and time_deleted is null) left outer join (s
elect mac from network_interface where vpc_id = '91a91cab-2fbd-4c1e-a91f-4f2bcae5705b' and time_deleted is null) on mac is null = true limit 1;
                                                        info
--------------------------------------------------------------------------------------------------------------------
  planning time: 342µs
  execution time: 2ms
  distribution: local
  vectorized: true
  rows read from KV: 1 (77 B)
  cumulative time spent in KV: 1ms
  maximum memory usage: 50 KiB
  network usage: 0 B (0 messages)

  • limit
  │ nodes: n2
  │ actual row count: 1
  │ estimated max memory allocated: 10 KiB
  │ estimated max sql temp disk usage: 0 B
  │ estimated row count: 1
  │ count: 1
  │
  └── • cross join (left outer)
      │ nodes: n2
      │ actual row count: 1
      │ estimated max memory allocated: 10 KiB
      │ estimated max sql temp disk usage: 0 B
      │ estimated row count: 1
      │
      ├── • render
      │   │ nodes: n2
      │   │ actual row count: 1
      │   │ estimated row count: 1
      │   │
      │   └── • limit
      │       │ nodes: n2
      │       │ actual row count: 1
      │       │ KV time: 1ms
      │       │ KV contention time: 0µs
      │       │ KV rows read: 1
      │       │ KV bytes read: 77 B
      │       │ estimated max memory allocated: 20 KiB
      │       │ estimated row count: 1
      │       │ count: 1
      │       │
      │       └── • render
      │           │ nodes: n2
      │           │ actual row count: 1
      │           │ KV time: 1ms
      │           │ KV contention time: 0µs
      │           │ KV rows read: 1
      │           │ KV bytes read: 77 B
      │           │ estimated max memory allocated: 20 KiB
      │           │ estimated row count: 95
      │           │
      │           └── • scan
      │                 nodes: n2
      │                 actual row count: 1
      │                 KV time: 1ms
      │                 KV contention time: 0µs
      │                 KV rows read: 1
      │                 KV bytes read: 77 B
      │                 estimated max memory allocated: 20 KiB
      │                 estimated row count: 1 - 95 (0.60% of the table; stats collected 5 days ago)
      │                 table: network_interface@network_interface_vpc_id_mac_key (partial index)
      │                 spans: [/'91a91cab-2fbd-4c1e-a91f-4f2bcae5705b' - /'91a91cab-2fbd-4c1e-a91f-4f2bcae5705b']
      │
      └── • render
          │ nodes: n2
          │ actual row count: 0
          │ estimated row count: 0
          │
          └── • norows
                nodes: n2
                actual row count: 0
(70 rows)


Time: 4ms total (execution 3ms / network 1ms)

So that's 50 KiB, vs around 25MiB for the one using generate_series(). This particular query is functionally different from the existing one though, in that it sequentially allocates MACs. The existing query is random. I don't think that matters too much, but I could be wrong. We already sequentially allocate IP addresses.

This does change the next-item query a lot, and so is probably a good bit of work. I'm also less sure of the memory consumption profile.

Limit and retry

In this approach, we'd add some mechanism for creating an iterator over chunks of the search space, and then loop the query until we either (1) find an entry or (2) exhaust the search space. This is how we do the existing VNI allocation, using the VniSearchIter type here:

/// An iterator yielding sequential starting VNIs.
///
/// The VPC insertion query requires a search for the next available VNI, using
/// the `NextItem` query. We limit the search for each query to avoid memory
/// issues on any one query. If we fail to find a VNI, we need to search the
/// next range. This iterator yields the starting positions for the `NextItem`
/// query, so that the entire range can be search in chunks until a free VNI is
/// found.
//
// NOTE: It's technically possible for this to lead to searching the very
// initial portion of the range twice. If we end up wrapping around so that the
// last position yielded by this iterator is `start - x`, then we'll end up
// searching from `start - x` to `start + (MAX_VNI_SEARCH_RANGE_SIZE - x)`, and
// so search those first few after `start` again. This is both innocuous and
// really unlikely.
#[derive(Clone, Copy, Debug)]
pub struct VniSearchIter {

We've successfully used this approach already, which is kind of nice. But it does involve more machinery, since we need to add an iterator for both the MAC and IP addresses, and then figure out which one to advance based on how the query fails. It also requires picking an arbitrary size for these ranges, which isn't needed for the self-join approach.

@andrewjstone
Copy link
Contributor

This does change the next-item query a lot, and so is probably a good bit of work. I'm also less sure of the memory consumption profile.

Is it easier to write this in sql then modifying the next-item query? If so, maybe use the escape hatch.

@bnaecker
Copy link
Collaborator

bnaecker commented Sep 9, 2024

The next-item query does already dip into raw SQL, though using Diesel's QueryFragment trait. This all predates the newer query-builder stuff Sean added, unfortunately. I think the query itself is pretty straightforward though, I'm doing some tests now.

bnaecker added a commit that referenced this issue Sep 10, 2024
- Adds the `NextItemSelfJoined` query, which implements the next-item
  query as a self-join. The current implementation joins the target
  table with the series of all possible values for the target item.
  Because CRDB eagerly evaluates subqueries, this entire list must be
  buffered in memory. The self-join version instead joins the existing
  table with the literal next item (item + 1), and finds the lowest one
  that's not allocated yet. Some initial testing suggests this can
  consume much less memory and run much faster than the previous query
  implementations. As a tradeoff, this new query cannot be used to
  randomly select _any_ item in a range -- only the next, lowest value
  can be selected. Note that this changes the way we allocate MAC
  addresses, which were previously random.
- Fixes #5904
bnaecker added a commit that referenced this issue Sep 10, 2024
- Adds the `NextItemSelfJoined` query, which implements the next-item
  query as a self-join. The current implementation joins the target
  table with the series of all possible values for the target item.
  Because CRDB eagerly evaluates subqueries, this entire list must be
  buffered in memory. The self-join version instead joins the existing
  table with the literal next item (item + 1), and finds the lowest one
  that's not allocated yet. Some initial testing suggests this can
  consume much less memory and run much faster than the previous query
  implementations. As a tradeoff, this new query cannot be used to
  randomly select _any_ item in a range -- only the next, lowest value
  can be selected. Note that this changes the way we allocate MAC
  addresses, which were previously random.
- Fixes #5904
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
known issue To include in customer documentation and training
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants