Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iqss/7140 google cloud archiver #7292

Merged
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 28 additions & 2 deletions doc/sphinx-guides/source/installation/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -779,7 +779,7 @@ Dataverse may be configured to submit a copy of published Datasets, packaged as

Dataverse offers an internal archive workflow which may be configured as a PostPublication workflow via an admin API call to manually submit previously published Datasets and prior versions to a configured archive such as Chronopolis. The workflow creates a `JSON-LD <http://www.openarchives.org/ore/0.9/jsonld>`_ serialized `OAI-ORE <https://www.openarchives.org/ore/>`_ map file, which is also available as a metadata export format in the Dataverse web interface.

At present, the DPNSubmitToArchiveCommand and LocalSubmitToArchiveCommand are the only implementations extending the AbstractSubmitToArchiveCommand and using the configurable mechanisms discussed below.
At present, the DPNSubmitToArchiveCommand, LocalSubmitToArchiveCommand, and GoogleCloudSubmitToArchive are the only implementations extending the AbstractSubmitToArchiveCommand and using the configurable mechanisms discussed below.

.. _Duracloud Configuration:

Expand Down Expand Up @@ -827,10 +827,36 @@ ArchiverClassName - the fully qualified class to be used for archiving. For exam

\:ArchiverSettings - the archiver class can access required settings including existing Dataverse settings and dynamically defined ones specific to the class. This setting is a comma-separated list of those settings. For example\:

``curl http://localhost:8080/api/admin/settings/:ArchiverSettings -X PUT -d ":BagItLocalPath``
``curl http://localhost:8080/api/admin/settings/:ArchiverSettings -X PUT -d ":BagItLocalPath"``

:BagItLocalPath is the file path that you've set in :ArchiverSettings.

.. _Google Cloud Configuration:

Google Cloud Configuration
++++++++++++++++++++++++++

The Google Cloud Archiver can send Dataverse Bags to a bucket in Google's cloud, including those in the 'Coldline' storage class (cheaper, with slower access)

``curl http://localhost:8080/api/admin/settings/:ArchiverClassName -X PUT -d "edu.harvard.iq.dataverse.engine.command.impl.GoogleCloudSubmitToArchiveCommand"``

``curl http://localhost:8080/api/admin/settings/:ArchiverSettings -X PUT -d ":":GoogleCloudBucket, :GoogleCloudProject"``

The Google Cloud Archiver defines two custom settings, both are required:

\:GoogleCloudBucket - the name of the bucket to use. For example:

``curl http://localhost:8080/api/admin/settings/:GoogleCloudBucket -X PUT -d "qdr-archive"``

\:GoogleCloudProject - the name of the project managing the bucket. For example:

``curl http://localhost:8080/api/admin/settings/:GoogleCloudProject -X PUT -d "qdr-project"``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When all the BagIt export stuff was initially merged, it didn't dawn on me that we lost our comprehensive list of settings in one place.

I think it would be nice to document :ArchiverSettings in the big list and probably all the various sub-settings like :GoogleCloudBucket, :GoogleCloudProject, and the older ones (:BagItLocalPath, etc.).

From a code perspective, I think I'd also like to see strings like :GoogleCloudBucket, :GoogleCloudProject, :BagItLocalPath, etc. added to the "Key" enum in SettingsServiceBean.java. That way, developers can also have a comprehensive list of all settings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add them to the guides list. The :ArchiverSettings is a Key already. For the others, I think earlier discussions questioned whether the archive-specific classes would end up being deployed separately, in which case having their settings in the core would be odd. Since that's not on the roadmap at this point, I could move them temporarily if you think its worth the effort.


In addition, the Google Cloud Archiver requires that the googlecloudkey.json file for the project be placed in the 'dataverse.files.directory' directory. This file can be created in the Google Cloud Console.

.. _Local Path Configuration:


API Call
++++++++

Expand Down
16 changes: 14 additions & 2 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@
</releases>
</pluginRepository>
</pluginRepositories>
<!--Maven checks for dependendies from these repos in the order shown in the pom.xml
<!--Maven checks for dependencies from these repos in the order shown in the pom.xml
This isn't well documented and seems to change between maven versions -MAD 4.9.4 -->
<repositories>
<repository>
Expand Down Expand Up @@ -127,6 +127,13 @@
<artifactId>httpclient</artifactId>
<version>${httpcomponents.client.version}</version>
</dependency>
<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>google-cloud-bom</artifactId>
<version>0.115.0-alpha</version>
<type>pom</type>
<scope>import</scope>
</dependency>
<dependency>
<groupId>org.testcontainers</groupId>
<artifactId>testcontainers-bom</artifactId>
Expand All @@ -137,7 +144,7 @@
</dependencies>
</dependencyManagement>
<!-- Declare any DIRECT dependencies here.
In case the depency is both transitive and direct (e. g. some common lib for logging),
In case the dependency is both transitive and direct (e. g. some common lib for logging),
manage the version above and add the direct dependency here WITHOUT version tag, too.
-->
<!-- TODO: Housekeeping is utterly needed. -->
Expand Down Expand Up @@ -581,6 +588,11 @@
<artifactId>opennlp-tools</artifactId>
<version>1.9.1</version>
</dependency>
<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>google-cloud-storage</artifactId>
<version>1.97.0</version>
</dependency>


<!-- TESTING DEPENDENCIES -->
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,12 @@ public void run() {
}
}
}).start();

//Have seen Pipe Closed errors for other archivers when used as a workflow without this delay loop
int i=0;
while(digestInputStream.available()<=0 && i<100) {
Thread.sleep(10);
i++;
}
String checksum = store.addContent(spaceName, "datacite.xml", digestInputStream, -1l, null, null,
null);
logger.fine("Content: datacite.xml added with checksum: " + checksum);
Expand Down Expand Up @@ -133,7 +138,11 @@ public void run() {
}
}
}).start();

i=0;
while(digestInputStream.available()<=0 && i<100) {
Thread.sleep(10);
i++;
}
checksum = store.addContent(spaceName, fileName, digestInputStream2, -1l, null, null,
null);
logger.fine("Content: " + fileName + " added with checksum: " + checksum);
Expand Down Expand Up @@ -174,6 +183,9 @@ public void run() {
logger.severe(rte.getMessage());
return new Failure("Error in generating datacite.xml file",
"DuraCloud Submission Failure: metadata file not created");
} catch (InterruptedException e) {
logger.warning(e.getLocalizedMessage());
e.printStackTrace();
}
} catch (ContentStoreException e) {
logger.warning(e.getMessage());
Expand Down
Loading