feat: composite primary key -> unique Collection id (Attempt #2) (#2372)

* feat: migration - use unique id for Collections This migration abandons the composite primary key (id, visibility) approach for the project / Collections table and enforces a unique id. Going forward, revisions will be identified by their unique id rather than by the id of their public-facing Collection coupled with visibility=PRIVATE. At the time of this commit, there are no revisions in progress in the corpora_prod database. * fix: prefix revision id comment with seq num Co-authored-by: Kuni Katsuya <kkatsuya@chanzuckerberg.com> * fix: update visibility to PRIVATE when downgrading * feat: add 'revision' relationship to DbCollection This commit also refactors foreign keys to inline * refactor: improve foreign key ref column notations Referencing `id` within DbCollection for `revision_of` by using `project.id` means that `id` doesn't have to be redefined (already in Base class but interpreter otherwise sees the builtin id method) Co-authored-by: Trent Smith <1429913+Bento007@users.noreply.github.com> * fix: add unique constraint to project.revision_of There should only ever be a single revision open at one time for a given public Collection * feat!: backend changes for Collection primary key migration BREAKING_CHANGE: This commit is a temporary stopping point and represents the migration of all backend service code from using the composite key for Collections (`id`, `visibility`) in the `project` table to using a unique `id` as the primary key for every Collection. * fix: move a commit operation plus refactoring Move the commit for Collection revision creation to after the creation of individual Dataset revisions. Remove unnecessary comments and add Session type hint to function definitions. * refactor: revert format for wmg example in corpora-api spec * fix: update unit tests and work in migration * Return revison_of when listing collections * fix tests.unit.processing_container.test_process.TestDatasetProcessing * feat: remove visibility from url's and api requests (#2197) * add revision_off attribute * alter queries and mutations * change query/mutation callsites * revert to passing only attributes and not whole obj * 7 Levels of prop drilling 💀 * retype new file mutation * fix: rename migration * fix: add a cognitive-complexity lint skip * fix: 'collections' path for legacy links * remove private redirects * fix cognitive complexity * handle new revision creation flow * handle collections with new `revision_of` * fix typo in collection link * fix: remove window location manipulation * fix: DatasetUploadStatus import bug fix * properly assign key id * debugging logs * one more log * redirect properly on revision publish * remove non-existant attr * first past of revision/published id links * first past of revision/published id links * remove unneeded revision_of checks * fix diffing * don't redirect if private publish * handle private collection publish * add handled success state * invalidate collection on private publish * refresh on publish * one more ignored collection field for diff * possibly fix edit collection issue * fix: add `id` to ignored fields (#2240) * add logs * do not check collection id for diffs * Revert "add logs" This reverts commit d177ef1. Co-authored-by: Daniel Hegeman <daniel.hegeman@chanzuckerberg.com> Co-authored-by: Trent Smith <1429913+Bento007@users.noreply.github.com> * fix call to test snapshot module in setup_dev_data.sh * incorporate removal of DatasetArtifactType * pkey migration changes in wmg tests * lint cleanup * clean up migration * incorporate cloudfront invalidation test * more cleanup * cleanup, docstring * fix import scheme to prevent circular import error * chore: attempt to fix circular import Unit tests are passing locally but not in Github Actions * fix module access typo * another import fix attempt * Revert "another import fix attempt" This reverts commit ff55991. * Revert "fix module access typo" This reverts commit 70ae87d. * Revert "chore: attempt to fix circular import" This reverts commit 01e712a. * Revert "fix import scheme to prevent circular import error" This reverts commit 70fbcd0. * Move CollectionLink to own file * fix: black version bump to be compatible with click * fix accidental blank lines Co-authored-by: Kuni Katsuya <kkatsuya@chanzuckerberg.com> Co-authored-by: Trent Smith <1429913+Bento007@users.noreply.github.com> Co-authored-by: Trent Smith <trent.smith@chanzuckerberg.com> Co-authored-by: Severiano Badajoz <sbadajoz@chanzuckerberg.com>
chanzuckerberg · Apr 15, 2022 · 3570af7 · 3570af7
1 parent b52dc1a
commit 3570af7
Show file tree

Hide file tree

Showing 53 changed files with 900 additions and 626 deletions.
diff --git a/backend/config/corpora-api.yml b/backend/config/corpora-api.yml
@@ -19,12 +19,12 @@ paths:
         - wmg
       summary: Get WMG cell types (WMG FEATURE PROTOTYPE)
       security:
-        - cxguserCookieLenient: [ ]
-        - { }
+        - cxguserCookieLenient: []
+        - {}
       description: >-
         JSON of all genes
       operationId: backend.corpora.lambdas.api.v1.wmg.get_cell_types
-      parameters: [ ]
+      parameters: []
       responses:
         "200":
           description: OK
@@ -50,12 +50,12 @@ paths:
         - wmg
       summary: Get WMG genes (WMG FEATURE PROTOTYPE)
       security:
-        - cxguserCookieLenient: [ ]
-        - { }
+        - cxguserCookieLenient: []
+        - {}
       description: >-
         JSON of all cell types
       operationId: backend.corpora.lambdas.api.v1.wmg.get_genes
-      parameters: [ ]
+      parameters: []
       responses:
         "200":
           description: OK
@@ -81,18 +81,18 @@ paths:
         - wmg
       summary: Get WMG gene expression for a given gene name (WMG FEATURE PROTOTYPE)
       security:
-        - cxguserCookieLenient: [ ]
-        - { }
+        - cxguserCookieLenient: []
+        - {}
       description: >-
         JSON of gene expression for a given gene
       operationId: backend.corpora.lambdas.api.v1.wmg.get_gene_expression
       parameters:
-      - name: gene_name
-        in: path
-        required: true
-        schema:
-          type: string
-        description: The name of the gene
+        - name: gene_name
+          in: path
+          required: true
+          schema:
+            type: string
+          description: The name of the gene
       responses:
         "200":
           description: OK
@@ -235,7 +235,6 @@ paths:
       operationId: backend.corpora.lambdas.api.v1.collection.get_collection_details
       parameters:
         - $ref: "#/components/parameters/path_collection_uuid"
-        - $ref: "#/components/parameters/query_visibility"
       responses:
         "200":
           description: OK
@@ -338,7 +337,6 @@ paths:
       operationId: backend.corpora.lambdas.api.v1.collection.delete_collection
       parameters:
         - $ref: "#/components/parameters/path_collection_uuid"
-        - $ref: "#/components/parameters/query_visibility"
       responses:
         "204":
           $ref: "#/components/responses/204"
@@ -396,7 +394,7 @@ paths:
       description: >-
         An authenticated user can upload a file from a shared link to replace an existing dataset in their collection.
 
-        The existing dataset must be in an error start or completely uploaded before a replacement can be attempted.
+        The existing dataset must be in an error state or completely uploaded before a replacement can be attempted.
         Once the upload of the replacement dataset is complete, the previous version of the dataset is removed and is
         no longer available for downloading through this collection. The replacement datasets are not publicly available
         until the collection revision is published. The old datasets are available in the public collection until the
@@ -1000,7 +998,7 @@ paths:
       tags:
         - wmg
       operationId: backend.wmg.api.v1.primary_filter_dimensions
-      parameters: [ ]
+      parameters: []
       responses:
         "200":
           description: OK
@@ -1026,7 +1024,7 @@ paths:
       tags:
         - wmg
       operationId: backend.wmg.api.v1.query
-      parameters: [ ]
+      parameters: []
       requestBody:
         content:
           application/json:
@@ -1205,7 +1203,6 @@ paths:
                       ethnicity_terms:
                         $ref: "#/components/schemas/wmg_ontology_term_id_label_list"
 
-
 components:
   schemas:
     user_uuid:
@@ -1362,7 +1359,6 @@ components:
         - name
         - revision
         - collection_id
-        - collection_visibility
       properties:
         id:
           $ref: "#/components/schemas/dataset_uuid"
@@ -1401,8 +1397,6 @@ components:
             $ref: "#/components/schemas/dataset_asset"
         collection_id:
           $ref: "#/components/schemas/collection_uuid"
-        collection_visibility:
-          $ref: "#/components/schemas/visibility"
         is_primary_data:
           $ref: "#/components/schemas/is_primary_data"
         cell_type:
@@ -1513,7 +1507,6 @@ components:
       type: string
       format: uuid
 
-
   parameters:
     path_collection_uuid:
       name: collection_uuid

diff --git a/backend/corpora/common/corpora_orm.py b/backend/corpora/common/corpora_orm.py
@@ -3,12 +3,12 @@
 from datetime import datetime
 from sqlalchemy import (
     Boolean,
+    CheckConstraint,
     Column,
     DateTime,
     Enum,
     Float,
     ForeignKey,
-    ForeignKeyConstraint,
     Integer,
     String,
     UniqueConstraint,
@@ -216,7 +216,7 @@ class DbCollection(Base, AuditMixin, TimestampMixin):
     # the tablename is "project" instead of "collection" to avoid migrating the database
     __tablename__ = "project"
 
-    visibility = Column(Enum(CollectionVisibility), primary_key=True, nullable=False)
+    visibility = Column(Enum(CollectionVisibility), nullable=False)
     owner = Column(StrippedString, nullable=False)
     name = Column(StrippedString)
     description = Column(StrippedString)
@@ -226,12 +226,16 @@ class DbCollection(Base, AuditMixin, TimestampMixin):
     data_submission_policy_version = Column(StrippedString, nullable=True)
     tombstone = Column(Boolean, default=False, nullable=False)
     publisher_metadata = Column(JSON, nullable=True)
+    revision_of = Column(String, ForeignKey("project.id"), nullable=True, unique=True)
 
     # Relationships
+    revision = relationship("DbCollection", cascade="all, delete-orphan", uselist=False)
     links = relationship("DbProjectLink", back_populates="collection", cascade="all, delete-orphan")
     datasets = relationship("DbDataset", back_populates="collection", cascade="all, delete-orphan")
     genesets = relationship("DbGeneset", back_populates="collection", cascade="all, delete-orphan")
 
+    __table_args__ = (CheckConstraint("revision_of IS NULL OR visibility = 'PRIVATE'", "ck_project_private_revision"),)
+
 
 class DbProjectLink(Base, AuditMixin):
     """
@@ -241,21 +245,14 @@ class DbProjectLink(Base, AuditMixin):
     # the tablename is "project_link" instead of "collection_link" to avoid migrating the database
     __tablename__ = "project_link"
 
-    collection_id = Column(String, nullable=False)
-    collection_visibility = Column(Enum(CollectionVisibility), nullable=False)
+    collection_id = Column(String, ForeignKey("project.id"), nullable=False)
     link_name = Column(StrippedString)
     link_url = Column(StrippedString)
     link_type = Column(Enum(CollectionLinkType))
 
     # Relationships
     collection = relationship("DbCollection", uselist=False, back_populates="links")
 
-    # Composite FK
-    __table_args__ = (
-        ForeignKeyConstraint([collection_id, collection_visibility], [DbCollection.id, DbCollection.visibility]),
-        {},
-    )
-
 
 # provide a consistent name
 DbCollectionLink = DbProjectLink
@@ -283,8 +280,7 @@ class DbDataset(Base, AuditMixin, TimestampMixin):
     cell_count = Column(Integer)
     is_valid = Column(Boolean, default=False)
     is_primary_data = Column(Enum(IsPrimaryData))
-    collection_id = Column(String, nullable=False)
-    collection_visibility = Column(Enum(CollectionVisibility), nullable=False)
+    collection_id = Column(String, ForeignKey("project.id"), nullable=False)
     tombstone = Column(Boolean, default=False, nullable=False)
     original_id = Column(String)
     published = Column(Boolean, default=False)
@@ -304,11 +300,7 @@ class DbDataset(Base, AuditMixin, TimestampMixin):
     genesets = relationship("DbGeneset", secondary="geneset_dataset_link", back_populates="datasets")
 
     # Composite FK
-    __table_args__ = (
-        ForeignKeyConstraint([collection_id, collection_visibility], [DbCollection.id, DbCollection.visibility]),
-        UniqueConstraint("collection_id", "curator_tag", name="_dataset__collection_id_curator_tag"),
-        {},
-    )
+    __table_args__ = (UniqueConstraint("collection_id", "curator_tag", name="_dataset__collection_id_curator_tag"),)
 
     def to_dict(self, *args, **kwargs):
         kwargs["remove_attr"] = kwargs.get("remove_attr", []) + ["genesets"]
@@ -326,7 +318,7 @@ class DbDatasetArtifact(Base, AuditMixin):
     __tablename__ = "dataset_artifact"
     __mapper_args__ = {"confirm_deleted_rows": False}
 
-    dataset_id = Column(ForeignKey("dataset.id"), nullable=False)
+    dataset_id = Column(String, ForeignKey("dataset.id"), nullable=False)
     filename = Column(String)
     filetype = Column(Enum(DatasetArtifactFileType))
     user_submitted = Column(Boolean)
@@ -418,7 +410,7 @@ class DbDatasetProcessingStatus(Base, AuditMixin):
 
     __tablename__ = "dataset_processing_status"
 
-    dataset_id = Column(ForeignKey("dataset.id"), nullable=False)
+    dataset_id = Column(String, ForeignKey("dataset.id"), nullable=False)
     upload_status = Column(Enum(UploadStatus))
     upload_progress = Column(Float)
     upload_message = Column(String)
@@ -443,15 +435,11 @@ class DbGeneset(Base, AuditMixin):
     name = Column(String, nullable=False)
     description = Column(String)
     genes = Column(JSONB)
-    collection_id = Column(String, nullable=False)
-    collection_visibility = Column(Enum(CollectionVisibility), nullable=False)
+    collection_id = Column(String, ForeignKey("project.id"), nullable=False)
     collection = relationship("DbCollection", uselist=False, back_populates="genesets")
     datasets = relationship("DbDataset", secondary="geneset_dataset_link", back_populates="genesets")
 
-    __table_args__ = (
-        ForeignKeyConstraint([collection_id, collection_visibility], [DbCollection.id, DbCollection.visibility]),
-        UniqueConstraint("name", "collection_id", "collection_visibility", name="_geneset_name__collection_uc"),
-    )
+    __table_args__ = (UniqueConstraint("name", "collection_id", name="_geneset_name__collection_uc"),)
 
     def to_dict(self, *args, **kwargs):
         kwargs["remove_attr"] = kwargs.get("remove_attr", []) + ["datasets"]
@@ -467,5 +455,5 @@ class DbGenesetDatasetLink(Base, AuditMixin):
 
     __tablename__ = "geneset_dataset_link"
 
-    geneset_id = Column(String, ForeignKey("geneset.id"), index=True)
-    dataset_id = Column(String, ForeignKey("dataset.id"), index=True)
+    geneset_id = Column(String, ForeignKey("geneset.id"), index=True, nullable=False)
+    dataset_id = Column(String, ForeignKey("dataset.id"), index=True, nullable=False)