Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/ted 24 #35

Merged
merged 35 commits into from
Mar 28, 2022
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
9db2652
WIP
CaptainOfHacks Mar 18, 2022
a6f24de
add docs for rmlmapper adapter
CaptainOfHacks Mar 18, 2022
68d227c
Update transform.py
CaptainOfHacks Mar 18, 2022
fe66b46
define mapping_suite_repository
CaptainOfHacks Mar 20, 2022
b378ef6
implement mapping_suite_repository_mongodb
CaptainOfHacks Mar 20, 2022
3773b2d
WIP
CaptainOfHacks Mar 20, 2022
6239ad9
finish MappingSuiteRepositoryInFileSystem
CaptainOfHacks Mar 20, 2022
77a3815
delete test_data_files
CaptainOfHacks Mar 20, 2022
f7e681c
Update test_mapping_suite_repository.py
CaptainOfHacks Mar 20, 2022
370862c
add test_inter_transactions_mapping_suite_repositories
CaptainOfHacks Mar 20, 2022
51482ca
Update mapping_suite_repository.py
CaptainOfHacks Mar 20, 2022
aa8a89a
Update transform.py
CaptainOfHacks Mar 20, 2022
f8386dc
Create notice_transformer.py
CaptainOfHacks Mar 20, 2022
14a1b64
Create test_notice_transformer.py
CaptainOfHacks Mar 20, 2022
3439441
add tests for notice transformer
CaptainOfHacks Mar 20, 2022
60922ea
add make target for init-rml-mapper
CaptainOfHacks Mar 20, 2022
53b9e81
Update Makefile
CaptainOfHacks Mar 20, 2022
8572ea4
Update Makefile
CaptainOfHacks Mar 20, 2022
c833a64
Merge branch 'main' into feature/TED-24
CaptainOfHacks Mar 20, 2022
bfdb972
Update Makefile
CaptainOfHacks Mar 20, 2022
f050bdb
Update __init__.py
CaptainOfHacks Mar 20, 2022
fd64c82
resolve comments
CaptainOfHacks Mar 23, 2022
1907747
Update transform.py
CaptainOfHacks Mar 25, 2022
f694e47
Merge branch 'main' into feature/TED-24
CaptainOfHacks Mar 25, 2022
743dfd4
Update mapping_suite_repository.py
CaptainOfHacks Mar 25, 2022
4373b04
Update transform.py
CaptainOfHacks Mar 25, 2022
b27ae01
Update notice_transformer.py
CaptainOfHacks Mar 25, 2022
51e3db7
Create notice.xml
CaptainOfHacks Mar 25, 2022
b3884fe
Update conftest.py
CaptainOfHacks Mar 25, 2022
d40b64a
Update test_mapping_suite_repository.py
CaptainOfHacks Mar 25, 2022
574c7ba
Update test_notice_transformer.py
CaptainOfHacks Mar 25, 2022
f0e1ed0
Update notice_transformer.py
CaptainOfHacks Mar 25, 2022
bdd6644
Update test_notice_transformer.py
CaptainOfHacks Mar 25, 2022
b404a0b
Update test_notice_transformer.py
CaptainOfHacks Mar 25, 2022
9ef3999
rename package folders
CaptainOfHacks Mar 28, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -101,3 +101,4 @@ infra/traefik/letsencrypt/acme.json
*.log
infra/airflow/logs/scheduler/latest
/.airflow/
.rmlmapper/rmlmapper.jar
11 changes: 10 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ ENV_FILE := .env

PROJECT_PATH = $(shell pwd)
AIRFLOW_INFRA_FOLDER ?= ${PROJECT_PATH}/.airflow
RML_MAPPER_PATH = ${PROJECT_PATH}/.rmlmapper/rmlmapper.jar

#-----------------------------------------------------------------------------
# Dev commands
Expand Down Expand Up @@ -142,8 +143,13 @@ stop-mongo:
@ echo -e "$(BUILD_PRINT)Stopping the Mongo services $(END_BUILD_PRINT)"
@ docker-compose -p ${ENVIRONMENT} --file ./infra/mongo/docker-compose.yml --env-file ${ENV_FILE} down

init-rml-mapper:
@ echo -e "RMLMapper folder initialisation!"
@ wget -c https://github.com/RMLio/rmlmapper-java/releases/download/v5.0.0/rmlmapper-5.0.0-r362-all.jar -P .rmlmapper/
@ mv .rmlmapper/rmlmapper-5.0.0-r362-all.jar .rmlmapper/rmlmapper.jar 2>/dev/null

start-project-services: | start-airflow start-mongo

start-project-services: | start-airflow start-mongo init-rml-mapper
stop-project-services: | stop-airflow stop-mongo

#-----------------------------------------------------------------------------
Expand All @@ -170,6 +176,7 @@ staging-dotenv-file: guard-VAULT_ADDR guard-VAULT_TOKEN vault-installed
@ echo DOMAIN=ted-data.eu >> .env
@ echo ENVIRONMENT=staging >> .env
@ echo SUBDOMAIN=staging. >> .env
@ echo RML_MAPPER_PATH=${RML_MAPPER_PATH} >> .env
@ echo AIRFLOW_INFRA_FOLDER=~/airflow-infra/staging >> .env
@ vault kv get -format="json" ted-staging/airflow | jq -r ".data.data | keys[] as \$$k | \"\(\$$k)=\(.[\$$k])\"" >> .env
@ vault kv get -format="json" ted-staging/mongo-db | jq -r ".data.data | keys[] as \$$k | \"\(\$$k)=\(.[\$$k])\"" >> .env
Expand All @@ -181,6 +188,7 @@ dev-dotenv-file: guard-VAULT_ADDR guard-VAULT_TOKEN vault-installed
@ echo DOMAIN=localhost >> .env
@ echo ENVIRONMENT=dev >> .env
@ echo SUBDOMAIN= >> .env
@ echo RML_MAPPER_PATH=${RML_MAPPER_PATH} >> .env
@ echo AIRFLOW_INFRA_FOLDER=${AIRFLOW_INFRA_FOLDER} >> .env
@ vault kv get -format="json" ted-dev/airflow | jq -r ".data.data | keys[] as \$$k | \"\(\$$k)=\(.[\$$k])\"" >> .env
@ vault kv get -format="json" ted-dev/mongo-db | jq -r ".data.data | keys[] as \$$k | \"\(\$$k)=\(.[\$$k])\"" >> .env
Expand All @@ -193,6 +201,7 @@ prod-dotenv-file: guard-VAULT_ADDR guard-VAULT_TOKEN vault-installed
@ echo DOMAIN=ted-data.eu >> .env
@ echo ENVIRONMENT=prod >> .env
@ echo SUBDOMAIN= >> .env
@ echo RML_MAPPER_PATH=${RML_MAPPER_PATH} >> .env
@ echo AIRFLOW_INFRA_FOLDER=~/airflow-infra/prod >> .env
@ vault kv get -format="json" ted-prod/airflow | jq -r ".data.data | keys[] as \$$k | \"\(\$$k)=\(.[\$$k])\"" >> .env
@ vault kv get -format="json" ted-prod/mongo-db | jq -r ".data.data | keys[] as \$$k | \"\(\$$k)=\(.[\$$k])\"" >> .env
Expand Down
16 changes: 15 additions & 1 deletion ted_sws/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,10 +53,24 @@ def MONGO_DB_PORT(self) -> int:
return int(VaultAndEnvConfigResolver().config_resolve())


class TedConfigResolver(MongoDBConfig):
@property
def MONGO_DB_AGGREGATES_DATABASE_NAME(self) -> str:
return VaultAndEnvConfigResolver().config_resolve()



class RMLMapperConfig:

@property
def RML_MAPPER_PATH(self) -> str:
return VaultAndEnvConfigResolver().config_resolve()


class TedConfigResolver(MongoDBConfig, RMLMapperConfig):
"""
This class resolve the secrets of the ted-sws project.
"""


config = TedConfigResolver()

291 changes: 291 additions & 0 deletions ted_sws/data_manager/adapters/mapping_suite_repository.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,291 @@
import json
import pathlib
import shutil
from typing import Iterator, List, Optional

from pymongo import MongoClient

from ted_sws import config
from ted_sws.domain.adapters.repository_abc import MappingSuiteRepositoryABC
from ted_sws.domain.model.transform import MappingSuite, FileResource, TransformationRuleSet, SHACLTestSuite, \
SPARQLTestSuite, MetadataConstraints

METADATA_FILE_NAME = "metadata.json"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More reliable constant usage.

TRANSFORM_PACKAGE_NAME = "transform"
MAPPINGS_PACKAGE_NAME = "mappings"
RESOURCES_PACKAGE_NAME = "resources"
VALIDATE_PACKAGE_NAME = "validate"
SHACL_PACKAGE_NAME = "shacl"
SPARQL_PACKAGE_NAME = "sparql"


class MappingSuiteRepositoryMongoDB(MappingSuiteRepositoryABC):
"""
This repository is intended for storing MappingSuite objects in MongoDB.
"""

_collection_name = "mapping_suite_collection"
_database_name = config.MONGO_DB_AGGREGATES_DATABASE_NAME

def __init__(self, mongodb_client: MongoClient):
"""

:param mongodb_client:
:param database_name:
"""
mongodb_client = mongodb_client
notice_db = mongodb_client[self._database_name]
self.collection = notice_db[self._collection_name]

def add(self, mapping_suite: MappingSuite):
"""
This method allows you to add MappingSuite objects to the repository.
:param mapping_suite:
:return:
"""
mapping_suite_dict = mapping_suite.dict()
mapping_suite_dict["_id"] = mapping_suite_dict["identifier"]
self.collection.insert_one(mapping_suite_dict)

def update(self, mapping_suite: MappingSuite):
"""
This method allows you to update MappingSuite objects to the repository
:param mapping_suite:
:return:
"""
mapping_suite_dict = mapping_suite.dict()
mapping_suite_dict["_id"] = mapping_suite_dict["identifier"]
self.collection.update_one({'_id': mapping_suite_dict["_id"]}, {"$set": mapping_suite_dict})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope this fails if the _id does not exist in the DB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will not drop, it will not update anything, it will not insert anything


def get(self, reference) -> MappingSuite:
"""
This method allows a MappingSuite to be obtained based on an identification reference.
:param reference:
:return: MappingSuite
"""
result_dict = self.collection.find_one({"identifier": reference})
return MappingSuite(**result_dict) if result_dict else None

def list(self) -> Iterator[MappingSuite]:
"""
This method allows all records to be retrieved from the repository.
:return: list of MappingSuites
"""
for result_dict in self.collection.find():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope .find() returns a generator and not a list.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes 👍
Returns an instance of
:class:~pymongo.cursor.Cursor corresponding to this query.

yield MappingSuite(**result_dict)


class MappingSuiteRepositoryInFileSystem(MappingSuiteRepositoryABC):
"""
This repository is intended for storing MappingSuite objects in FileSystem.
"""

def __init__(self, repository_path: pathlib.Path):
"""

:param repository_path:
"""
self.repository_path = repository_path
self.repository_path.mkdir(parents=True, exist_ok=True)

def _read_package_metadata(self, package_path: pathlib.Path) -> dict:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this package_path is some sort of (unique) reference in the repository path. In which case, you shall use _id/identifier for it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repository_path is unique, in repository can be multiple packages and each package have a package_path

"""
This method allows reading the metadata of a packet.
:param package_path:
:return:
"""
package_metadata_path = package_path / METADATA_FILE_NAME
package_metadata_content = package_metadata_path.read_text(encoding="utf-8")
package_metadata = json.loads(package_metadata_content)
package_metadata['metadata_constraints'] = MetadataConstraints(**package_metadata['metadata_constraints'])
return package_metadata

def _read_transformation_rule_set(self, package_path: pathlib.Path) -> TransformationRuleSet:
"""
This method allows you to read the transformation rules in a package.
:param package_path:
:return:
"""
mappings_path = package_path / TRANSFORM_PACKAGE_NAME / MAPPINGS_PACKAGE_NAME
resources_path = package_path / TRANSFORM_PACKAGE_NAME / RESOURCES_PACKAGE_NAME
resources = self._read_file_resources(path=resources_path)
rml_mapping_rules = self._read_file_resources(path=mappings_path)
return TransformationRuleSet(resources=resources, rml_mapping_rules=rml_mapping_rules)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good


def _read_shacl_test_suites(self, package_path: pathlib.Path) -> List[SHACLTestSuite]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when it comes to SHACL, there shall be a single suite, not a list.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to revise architecture, at this point

"""
This method allows you to read shacl test suites from a package.
:param package_path:
:return:
"""
validate_path = package_path / VALIDATE_PACKAGE_NAME
shacl_path = validate_path / SHACL_PACKAGE_NAME
shacl_test_suite_paths = [x for x in shacl_path.iterdir() if x.is_dir()]
return [SHACLTestSuite(shacl_tests=self._read_file_resources(path=shacl_test_suite_path))
for shacl_test_suite_path in shacl_test_suite_paths]

def _read_sparql_test_suites(self, package_path: pathlib.Path) -> List[SPARQLTestSuite]:
"""
This method allows you to read sparql test suites from a package.
:param package_path:
:return:
"""
validate_path = package_path / VALIDATE_PACKAGE_NAME
sparql_path = validate_path / SPARQL_PACKAGE_NAME
sparql_test_suite_paths = [x for x in sparql_path.iterdir() if x.is_dir()]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we need to keep an aggregating _id for each SPARQL suite (i.e. the folder name is the id). And this is shall be outside the file name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not clear what you mean

return [SPARQLTestSuite(sparql_tests=self._read_file_resources(path=sparql_test_suite_path))
for sparql_test_suite_path in sparql_test_suite_paths]

def _write_package_metadata(self, mapping_suite: MappingSuite):
"""
This method creates the metadata of a package based on the metadata in the mapping_suite.
:param mapping_suite:
:return:
"""
package_path = self.repository_path / mapping_suite.identifier
package_path.mkdir(parents=True, exist_ok=True)
metadata_path = package_path / METADATA_FILE_NAME
package_metadata = mapping_suite.dict()
[package_metadata.pop(key, None) for key in
["transformation_rule_set", "shacl_test_suites", "sparql_test_suites"]]
with metadata_path.open("w", encoding="utf-8") as f:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was thinking why not p.open('w').write('some text') but then I realised with context is probably neater although more verbose.

f.write(json.dumps(package_metadata))

def _write_file_resources(self, file_resources: List[FileResource], path: pathlib.Path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat!

"""
This method allows you to write a list of file-type resources to a specific location.
:param file_resources:
:param path:
:return:
"""
for file_resource in file_resources:
file_resource_path = path / file_resource.file_name
with file_resource_path.open("w", encoding="utf-8") as f:
f.write(file_resource.file_content)

def _read_file_resources(self, path: pathlib.Path) -> List[FileResource]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool

"""
This method reads a list of file-type resources that are in a specific location.
:param path:
:return:
"""
files = [file for file in path.iterdir() if file.is_file()]
return [FileResource(file_name=file.name,
file_content=file.read_text(encoding="utf-8"))
for file in files]

def _write_package_transform_rules(self, mapping_suite: MappingSuite):
"""
This method creates the transformation rules within the package.
:param mapping_suite:
:return:
"""
package_path = self.repository_path / mapping_suite.identifier
transform_path = package_path / TRANSFORM_PACKAGE_NAME
mappings_path = transform_path / MAPPINGS_PACKAGE_NAME
resources_path = transform_path / RESOURCES_PACKAGE_NAME
mappings_path.mkdir(parents=True, exist_ok=True)
resources_path.mkdir(parents=True, exist_ok=True)
self._write_file_resources(file_resources=mapping_suite.transformation_rule_set.rml_mapping_rules,
path=mappings_path
)
self._write_file_resources(file_resources=mapping_suite.transformation_rule_set.resources,
path=resources_path
)

def _write_package_validation_rules(self, mapping_suite: MappingSuite):
"""
This method creates the validation rules within the package.
:param mapping_suite:
:return:
"""
package_path = self.repository_path / mapping_suite.identifier
validate_path = package_path / VALIDATE_PACKAGE_NAME
sparql_path = validate_path / SPARQL_PACKAGE_NAME
shacl_path = validate_path / SHACL_PACKAGE_NAME
sparql_path.mkdir(parents=True, exist_ok=True)
shacl_path.mkdir(parents=True, exist_ok=True)
shacl_test_suites = mapping_suite.shacl_test_suites
shacl_test_suite_path_counter = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a counter like that already raises issues in my mind. Please refer to the comment I provided above: have a single suite of SHACL rules instead of a list of them. (well, maybe you can keep a list after all, but with an _id reference to suite folder, like I explain below)

In the case of SPARQL rules, however, we would have to keep an identifier reference for each suite, which would be the folder name. No counter, however.

for shacl_test_suite in shacl_test_suites:
shacl_test_suite_path = shacl_path / f"shacl_test_suite_{shacl_test_suite_path_counter}"
shacl_test_suite_path.mkdir(parents=True, exist_ok=True)
self._write_file_resources(file_resources=shacl_test_suite.shacl_tests,
path=shacl_test_suite_path
)
shacl_test_suite_path_counter += 1

sparql_test_suites = mapping_suite.sparql_test_suites
for idx, sparql_test_suite in enumerate(sparql_test_suites):
sparql_test_suite_path = sparql_path / f"sparql_test_suite_{idx}"
sparql_test_suite_path.mkdir(parents=True, exist_ok=True)
self._write_file_resources(file_resources=sparql_test_suite.sparql_tests,
path=sparql_test_suite_path
)

def _write_mapping_suite_package(self, mapping_suite: MappingSuite):
"""
This method creates a package based on data from mapping_suite.
:param mapping_suite:
:return:
"""
self._write_package_metadata(mapping_suite=mapping_suite)
self._write_package_transform_rules(mapping_suite=mapping_suite)
self._write_package_validation_rules(mapping_suite=mapping_suite)

def _read_mapping_suite_package(self, mapping_suite_identifier: str) -> Optional[MappingSuite]:
"""
This method reads a package and initializes a MappingSuite object.
:param mapping_suite_identifier:
:return:
"""
package_path = self.repository_path / mapping_suite_identifier
if package_path.is_dir():
package_metadata = self._read_package_metadata(package_path)
package_metadata["transformation_rule_set"] = self._read_transformation_rule_set(package_path)
package_metadata["shacl_test_suites"] = self._read_shacl_test_suites(package_path)
package_metadata["sparql_test_suites"] = self._read_sparql_test_suites(package_path)
return MappingSuite(**package_metadata)
return None

def add(self, mapping_suite: MappingSuite):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we manage the id like in the MongoDB repository?

"""
This method allows you to add MappingSuite objects to the repository.
:param mapping_suite:
:return:
"""
self._write_mapping_suite_package(mapping_suite=mapping_suite)

def update(self, mapping_suite: MappingSuite):
"""
This method allows you to update MappingSuite objects to the repository
:param mapping_suite:
:return:
"""
package_path = self.repository_path / mapping_suite.identifier
if package_path.is_dir():
self._write_mapping_suite_package(mapping_suite=mapping_suite)

def get(self, reference) -> MappingSuite:
"""
This method allows a MappingSuite to be obtained based on an identification reference.
:param reference:
:return: MappingSuite
"""
return self._read_mapping_suite_package(mapping_suite_identifier=reference)

def list(self) -> Iterator[MappingSuite]:
"""
This method allows all records to be retrieved from the repository.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like you have simple and clear descriptions.

:return: list of MappingSuites
"""
package_paths = [x for x in self.repository_path.iterdir() if x.is_dir()]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beautiful!

for package_path in package_paths:
yield self.get(reference=package_path.name)

def clear_repository(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BRUTAL! but useful :)

"""
This method allows you to clean the repository.
:return:
"""
shutil.rmtree(self.repository_path)
8 changes: 5 additions & 3 deletions ted_sws/data_manager/adapters/notice_repository.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
import logging
from typing import Iterator
from pymongo import MongoClient

from ted_sws import config
from ted_sws.domain.adapters.repository_abc import NoticeRepositoryABC
from ted_sws.domain.model.notice import Notice, NoticeStatus

Expand All @@ -13,11 +15,11 @@ class NoticeRepository(NoticeRepositoryABC):
"""

_collection_name = "notice_collection"
_database_name = "notice_db"
_database_name = config.MONGO_DB_AGGREGATES_DATABASE_NAME

def __init__(self, mongodb_client: MongoClient, database_name: str = None):
def __init__(self, mongodb_client: MongoClient):
mongodb_client = mongodb_client
notice_db = mongodb_client[database_name if database_name else self._database_name]
notice_db = mongodb_client[self._database_name]
self.collection = notice_db[self._collection_name]

def add(self, notice: Notice):
Expand Down
Loading