Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL] REFRESH Materialized view (non-concurrent) #10510

Closed
fizaaluthra opened this issue Nov 5, 2021 · 1 comment
Closed

[YSQL] REFRESH Materialized view (non-concurrent) #10510

fizaaluthra opened this issue Nov 5, 2021 · 1 comment
Assignees
Labels
area/ysql Yugabyte SQL (YSQL) pgcm

Comments

@fizaaluthra
Copy link
Member

Description

Issue for tracking REFRESH MATERIALIZED VIEW. By default, the refresh is non-concurrent.

@fizaaluthra fizaaluthra added the area/ysql Yugabyte SQL (YSQL) label Nov 5, 2021
@fizaaluthra fizaaluthra self-assigned this Nov 5, 2021
fizaaluthra pushed a commit that referenced this issue Dec 10, 2021
…TERIALIZED VIEW

Summary:
`CREATE MATERIALIZED VIEW`:
- Created like any other YB relation.

`REFRESH MATERIALIZED VIEW`:

- In vanilla PG, a non-concurrent refresh on the materialized view entails the following:
   #  Load the new matview data into a transient relation.
   #  Refresh the matview by swapping it with the transient relation. The swap is done by swapping the `relfilenode` of the two relations. (Therefore the `relfilenode` of the matview is now the `OID` of the transient relation).
   #  Reindex the new matview.

- In YB mode we do the following:
  # Load the new matview data into a transient relation, and swap the `relfilenode` (as in vanilla PG).
  # We implement a `YBGetStorageRelid` utility function that returns, if valid, the `relfilenode` (which will be different from the `OID` if there has been a swap) or the `OID` itself.
  # Any subsequent scans/drops on the matview will now use `YBGetStorageRelid` to retrieve the `OID` of the appropriate relation in DocDB.
  # We reindex the matview by dropping and recreating the matview's indexes.

`REFRESH MATERIALIZED VIEW CONCURRENTLY`:

- In PG, a concurrent refresh entails the following:
  # Ensure there's a unique index on the matview.
  # Load the new data into a temporary table.
  # Ensure there are no duplicate rows without any null values (i.e., duplicate rows are only allowed when at least one column value is null).
  # Create a temporary diff table which tells us what rows are present in the old data and not in the new data and vice versa.
       - The diff table is computed as a full join of the old matview and the new temporary table on the unique index.
  # Delete rows in the matview where row.ctid = any old matview ctid in diff where the newdata is null. (i,e,, if the join did not find a match for this row amidst the new data, this row has been deleted).
  # Insert rows from the new data in the diff table where old matview ctid is null. (i,e., if the join did not find a match amidst the old data, this row has been newly inserted).

- In YB mode:
  # We do essentially the same thing, except that since YB doesn't have ctids we use all the column values instead.
  # Additionally, since there aren't any ctids, if the new data has rows with **all** null columns, the joined diff table will have duplicate entries for such rows.
  # Therefore, in YB mode, we disallow rows with all null columns.

`DROP MATERIALIZED VIEW`:
- The syscatalog tuples are dropped like any other YB relation.
- `YBGetStorageRelid` is used to drop the appropriate relation in DocDB.

Test Plan:
Jenkins: urgent, test regex: .*TestPgRegressFeature.*|.*TestPgRegressExtension.*|.*TestPgRegressPartitions.*

Run the `yb_feature_matview` test using the `TestPgRegressFeature` test suite.

Reviewers: myang, mihnea

Reviewed By: myang, mihnea

Subscribers: bogdan, yql

Differential Revision: https://phabricator.dev.yugabyte.com/D13792
fizaaluthra pushed a commit that referenced this issue Jan 6, 2022
Summary:
In YB mode, we restrict matviews from having rows with all null values when performing concurrent refreshes. Previously, we performed a full scan of the data to check for completely null rows, which leads to a performance hit. This diff performs the check for completely null rows on the computed diff table instead (which is a much smaller table).

Performance test:
Tested on centOS 7 machine with Intel Broadwell CPU.
```
CREATE TABLE base ... ; // 100 MB dataset
CREATE MATERIALIZED VIEW test AS SELECT * FROM base;
CREATE UNIQUE INDEX ON test ... // Unique index required for concurrent refreshes
INSERT a few rows INTO base;
REFRESH MATERIALIZED VIEW CONCURRENTLY base;
```
The above snippet gets stuck on the scan of the temporary table with the new data for at least 25 minutes in the current version of the code.
With this diff, it takes about 3 minutes for the refresh to execute.

Test Plan:
Run yb_feature_matview in TestPgRegressFeature

Jenkins: urgent

Reviewers: mihnea

Reviewed By: mihnea

Subscribers: zyu, smishra, yql

Differential Revision: https://phabricator.dev.yugabyte.com/D14596
@fizaaluthra
Copy link
Member Author

Implemented in commit f69988a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ysql Yugabyte SQL (YSQL) pgcm
Projects
None yet
Development

No branches or pull requests

2 participants