Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPE-4114] Test: Scale to zero units #347

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

BalabaDmitri
Copy link
Contributor

@BalabaDmitri BalabaDmitri commented Feb 6, 2024

Issue #445

Solution

Test coverage of the following cases:

  1. Scaling with and without storage.
  2. Whether writes are increasing.
  3. The shutdown of units after scale to zero.
  4. Scaling up to 1 unit with attach own storage and check data.
  5. Scaling up to 2 units with attach the another cluster's storage.
  6. Charm is blocked.
  7. Remove unit with another cluster's storage.
  8. Scaling up to 3 units without storage and check data.
  9. Check storage provided has been reused.

Test not coverage of the following cases:

  1. Scaling to 2 units with different postgresql versions.

Implementation

  1. Handling behavior when using another cluster's storage.
  2. Handling behavior when using different postgresql versions.

Copy link

codecov bot commented Feb 7, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.94%. Comparing base (0f2c8c2) to head (9d7bfed).
Report is 24 commits behind head on main.

Current head 9d7bfed differs from pull request most recent head e670781

Please upload reports for the commit e670781 to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #347      +/-   ##
==========================================
- Coverage   80.31%   79.94%   -0.37%     
==========================================
  Files          10       10              
  Lines        2301     2169     -132     
  Branches      376      344      -32     
==========================================
- Hits         1848     1734     -114     
+ Misses        369      368       -1     
+ Partials       84       67      -17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@taurus-forever taurus-forever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BalabaDmintri Thank you for the contribution. We need to add a ticket number and PR body describing What?Where?Why? better.

Re:PR. It is overall good start!!! We need to cover all scaling cases:

re-scaling-back with and without drive. E.g. the default hostpath storage will remove drive, so re-scaling will start on empty disk. In the same time btrfs will keep disk available and re-scaling will be with a disk (could be default disk or manually provided). Could be wrongly provided...

I mean we have those cases:

  • user wants to restore the same storage: remove-unit postgresql/3 && && add-unit -n 2 should reuse old storage as user didn't specify it (it can be new disk, e.g. hostpath). => should be OK, long SST/clone process for 2nd+ node, but 0->1 is new DB init as no disk.
  • user wants to provide specific correct storage => should be OK automatically, just fast resilvering.
  • user made mistake and specified wrong storage (another cluster) => charm blocked. do NOT corrupt foreign disks!!!
  • user made mistake and specified wrong storage (empty/corrupter/no psql data on it) => block to be on a safe side. force user to remove disk and re-scale without any disks.
  • ...

I believe we should test all those cases above.

P.S. Charm should be safe and block bootstrap if found foreign cluster on disk, etc (probable improvement here required).

P.P.S. on top of this: it can be disk from the different charm revision OR different PostgreSQL version. We need to test it too. block or accept is a question to @7annaba3l :-D


# Scale the database to three units.
for store_id in storage_id_list:
await add_unit_with_storage(ops_test, storage=store_id, app=APP_NAME)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marceloneppel

JFMI, should re use ops lib directly as in helper it refers to this which is resolved:

    Note: this function exists as a temporary solution until this issue is resolved:
    https://github.com/juju/python-libjuju/issues/695

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We should remove the workaround and use the methods provided by the lib.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC this is only available in libjuju 3

@taurus-forever
Copy link
Contributor

BTW, @BalabaDmintri you need to sign CLA and fix lint tests here:

lint: commands[1]> poetry run codespell /home/runner/work/postgresql-operator/postgresql-operator/src /home/runner/work/postgresql-operator/postgresql-operator/tests
lint: commands[2]> poetry run ruff /home/runner/work/postgresql-operator/postgresql-operator/src /home/runner/work/postgresql-operator/postgresql-operator/tests
tests/integration/ha_tests/test_self_healing.py:4:1: I001 [*] Import block is un-sorted or un-formatted
tests/integration/ha_tests/test_self_healing.py:604:55: F841 [*] Local variable `e` is assigned to but never used
tests/integration/ha_tests/test_self_healing.py:618:33: W292 [*] No newline at end of file
Found 3 errors.
[*] 3 fixable with the `--fix` option.

Please check https://github.com/canonical/postgresql-operator/blob/main/CONTRIBUTING.md

channel="edge",
)

# Deploy the continuous writes application charm if it wasn't already deployed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part can be removed, as the continuous writes application is already deployed by test_build_and_deploy.

)

if wait_for_apps:
await ops_test.model.wait_for_idle(status="active", timeout=3000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the above comment is handled, this line can be moved close to the deployment of the PostgreSQL application.


# Scale the database to three units.
for store_id in storage_id_list:
await add_unit_with_storage(ops_test, storage=store_id, app=APP_NAME)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We should remove the workaround and use the methods provided by the lib.


# Scale the database to one unit.
logger.info("scaling database to one unit")
await add_unit_with_storage(ops_test, storage=primary_storage, app=APP_NAME)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the unit starts, we should check if the data on the storage has been actually restored.

# Scale the database to three units.
for store_id in storage_id_list:
await add_unit_with_storage(ops_test, storage=store_id, app=APP_NAME)
await check_writes(ops_test)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After 2nd and 3rd units start, it is needed to check that data on them is restored from WAL (not via backup/restore).
Maybe @dragomirp or @marceloneppel know how to check this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@taurus-forever
Copy link
Contributor

Discussed this today on the sync. For the history:

  • our Moto: Never destroy customer data!
  • The goal for this test is to block the charm IF user made a mistake and re-scale application with wrong storage. IF user want to re-deploy application using foreign storage: it is possible (with warning in logs), but re-scaling up with foreign storage must be impossible.

Some example in my mind:

  • deploy 3 units, generate data, create backup, generate data.
  • scale to 2 unit, change data, scale-up to 3 units (with dirty storage of 3rd unit) => cluster member should join well
  • scale to 2 unit, restore backup, scale-up to 3 units (with dirty storage of 3rd unit) => cluster member should be blocked as patroni will see the data but will not be able to join the cluster. Fix: scale down, remove 3rd storage, scale-up (automated re-silvering used). All OK.
  • scale to 2 units, change 3rd storage to foreign cluster, scale-up => unit must be blocked as foreign data detected. Better safe then sorry.
  • ...

@BalabaDmitri BalabaDmitri changed the title deployment test "zero-units" [DPE-4114] Test: Scale to zero units Apr 17, 2024
@BalabaDmitri
Copy link
Contributor Author

@marceloneppel Can you run actions

tests/integration/helpers.py Outdated Show resolved Hide resolved
Comment on lines 874 to 875
async def get_db_connection(ops_test, dbname, is_primary=True, replica_unit_name=""):
unit_name = await get_primary(ops_test, APP_NAME)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may add the type hint for the returned values and a docstring to make the output even easier to understand. The same also applies to the other functions you created in this file.

src/charm.py Outdated Show resolved Hide resolved
tests/integration/ha_tests/test_self_healing.py Outdated Show resolved Hide resolved
Copy link
Contributor

@taurus-forever taurus-forever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, this is something we were missing. Tnx!

logger.info("database scaling up to two units using third-party cluster storage")
new_unit = await add_unit_with_storage(
ops_test, app=app, storage=second_storage, is_blocked=True
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: IMHO, it worth to check: we are blocked with the right message (foreign disk).

@BalabaDmitri
Copy link
Contributor Author

@dragomirp Can you check

@dragomirp
Copy link
Contributor

Hi, @BalabaDmitri, can you resync with main again, to retrigger the CI and to get some fixes. Sorry for asking again.

@dragomirp
Copy link
Contributor

Hi, @BalabaDmitri, the new test seems to frequently fail with:

unit-postgresql-2: 14:28:15 ERROR unit.postgresql/2.juju-log database-peers:2: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-postgresql-2/charm/./src/charm.py", line 1688, in <module>
    main(PostgresqlOperatorCharm)
  File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/ops/main.py", line 544, in main
    manager.run()
  File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/ops/main.py", line 520, in run
    self._emit()
  File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/ops/main.py", line 506, in _emit
    self.framework.reemit()
  File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/ops/framework.py", line 861, in reemit
    self._reemit()
  File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/ops/framework.py", line 941, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-postgresql-2/charm/./src/charm.py", line 393, in _on_peer_relation_departed
    member_ip = self._patroni.get_member_ip(departing_member)
  File "/var/lib/juju/agents/unit-postgresql-2/charm/./src/charm.py", line 728, in _patroni
    self._peer_members_ips,
  File "/var/lib/juju/agents/unit-postgresql-2/charm/./src/charm.py", line 758, in _peer_members_ips
    addresses = self.members_ips
  File "/var/lib/juju/agents/unit-postgresql-2/charm/./src/charm.py", line 779, in members_ips
    return set(json.loads(self._peers.data[self.app].get("members_ips", "[]")))
AttributeError: 'NoneType' object has no attribute 'data' 

E.g. here

I've also seen it fail the assertion for writes continuing:

Traceback (most recent call last):
  File "/home/runner/work/postgresql-operator/postgresql-operator/tests/integration/ha_tests/helpers.py", line 110, in are_writes_increasing
    more_writes[member] > count
AssertionError: test.postgresql-4: writes not continuing to DB (current writes: 612 - previous writes: 612)
line 529, in inner
    _loop.run_until_complete(task)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/runner/work/postgresql-operator/postgresql-operator/tests/integration/ha_tests/test_self_healing.py", line 631, in test_deploy_zero_units
    await are_writes_increasing(ops_test)
  File "/home/runner/work/postgresql-operator/postgresql-operator/tests/integration/ha_tests/helpers.py", line 101, in are_writes_increasing
    for attempt in Retrying(stop=stop_after_delay(60 * 3), wait=wait_fixed(3)):
  File "/home/runner/work/postgresql-operator/postgresql-operator/.tox/integration/lib/python3.10/site-packages/tenacity/__init__.py", line 347, in __iter__
    do = self.iter(retry_state=retry_state)
  File "/home/runner/work/postgresql-operator/postgresql-operator/.tox/integration/lib/python3.10/site-packages/tenacity/__init__.py", line 326, in iter
    raise retry_exc from fut.exception()

E.g. here

Please, take another look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants