Merge pull request #7292 from QualitativeDataRepository/IQSS/7140-Goo…

…gleCloudArchiver Iqss/7140 google cloud archiver
IQSS · Oct 9, 2020 · 62b7fdf · 62b7fdf
2 parents 059f410 + 3484bb6
commit 62b7fdf
Show file tree

Hide file tree

Showing 5 changed files with 340 additions and 6 deletions.
diff --git a/doc/release-notes/7140-google-cloud.md b/doc/release-notes/7140-google-cloud.md
@@ -0,0 +1,12 @@
+## Google Cloud Archiver
+
+Dataverse Bags can now be sent to a bucket in Google Cloud, including those in the 'Coldline' storage class, which provide less expensive but slower access.
+
+## Use Cases
+
+- As an Administrator I can set up a regular export to Google Cloud so that my users' data is preserved.
+
+## New Settings
+
+:GoogleCloudProject - the name of the project managing the bucket.
+:GoogleCloudBucket - the name of the bucket to use
diff --git a/doc/sphinx-guides/source/installation/config.rst b/doc/sphinx-guides/source/installation/config.rst
@@ -776,14 +776,16 @@ For Google Analytics, the example script at :download:`analytics-code.html </_st
 
 Once this script is running, you can look in the Google Analytics console (Realtime/Events or Behavior/Events) and view events by type and/or the Dataset or File the event involves.
 
+.. _BagIt Export:
+
 BagIt Export
 ------------
 
 Dataverse may be configured to submit a copy of published Datasets, packaged as `Research Data Alliance conformant <https://www.rd-alliance.org/system/files/Research%20Data%20Repository%20Interoperability%20WG%20-%20Final%20Recommendations_reviewed_0.pdf>`_ zipped `BagIt <https://tools.ietf.org/html/draft-kunze-bagit-17>`_ bags to `Chronopolis <https://libraries.ucsd.edu/chronopolis/>`_ via `DuraCloud <https://duraspace.org/duracloud/>`_ or alternately to any folder on the local filesystem.
 
 Dataverse offers an internal archive workflow which may be configured as a PostPublication workflow via an admin API call to manually submit previously published Datasets and prior versions to a configured archive such as Chronopolis. The workflow creates a `JSON-LD <http://www.openarchives.org/ore/0.9/jsonld>`_ serialized `OAI-ORE <https://www.openarchives.org/ore/>`_ map file, which is also available as a metadata export format in the Dataverse web interface.
 
-At present, the DPNSubmitToArchiveCommand and LocalSubmitToArchiveCommand are the only implementations extending the AbstractSubmitToArchiveCommand and using the configurable mechanisms discussed below.
+At present, the DPNSubmitToArchiveCommand, LocalSubmitToArchiveCommand, and GoogleCloudSubmitToArchive are the only implementations extending the AbstractSubmitToArchiveCommand and using the configurable mechanisms discussed below.
 
 .. _Duracloud Configuration:
 
@@ -831,10 +833,41 @@ ArchiverClassName - the fully qualified class to be used for archiving. For exam
 
 \:ArchiverSettings - the archiver class can access required settings including existing Dataverse settings and dynamically defined ones specific to the class. This setting is a comma-separated list of those settings. For example\:
 
-``curl http://localhost:8080/api/admin/settings/:ArchiverSettings -X PUT -d ":BagItLocalPath”``
+``curl http://localhost:8080/api/admin/settings/:ArchiverSettings -X PUT -d ":BagItLocalPath"``
 
 :BagItLocalPath is the file path that you've set in :ArchiverSettings.
 
+.. _Google Cloud Configuration:
+
+Google Cloud Configuration
+++++++++++++++++++++++++++
+
+The Google Cloud Archiver can send Dataverse Bags to a bucket in Google's cloud, including those in the 'Coldline' storage class (cheaper, with slower access) 
+
+``curl http://localhost:8080/api/admin/settings/:ArchiverClassName -X PUT -d "edu.harvard.iq.dataverse.engine.command.impl.GoogleCloudSubmitToArchiveCommand"``
+
+``curl http://localhost:8080/api/admin/settings/:ArchiverSettings -X PUT -d ":GoogleCloudBucket, :GoogleCloudProject"``
+
+The Google Cloud Archiver defines two custom settings, both are required. The credentials for your account, in the form of a json key file, must also be obtained and stored locally (see below):
+
+In order to use the Google Cloud Archiver, you must have a Google account. You will need to create a project and bucket within that account and provide those values in the settings:
+
+\:GoogleCloudBucket - the name of the bucket to use. For example:
+
+``curl http://localhost:8080/api/admin/settings/:GoogleCloudBucket -X PUT -d "qdr-archive"``
+
+\:GoogleCloudProject - the name of the project managing the bucket. For example:
+
+``curl http://localhost:8080/api/admin/settings/:GoogleCloudProject -X PUT -d "qdr-project"``
+
+The Google Cloud Archiver also requires a key file that must be renamed to 'googlecloudkey.json' and placed in the directory identified by your 'dataverse.files.directory' jvm option. This file can be created in the Google Cloud Console. (One method: Navigate to your Project 'Settings'/'Service Accounts', create an account, give this account the 'Cloud Storage'/'Storage Admin' role, and once it's created, use the 'Actions' menu to 'Create Key', selecting the 'JSON' format option. Use this as the 'googlecloudkey.json' file.)
+
+For example:
+
+``cp <your key file> /usr/local/payara5/glassfish/domains/domain1/files/googlecloudkey.json``
+
+.. _Archiving API Call:
+
 API Call
 ++++++++
 
@@ -2124,3 +2157,40 @@ To enable redirects to the zipper installed on the same server as the main Datav
 To enable redirects to the zipper on a different server: 
 
 ``curl -X PUT -d 'https://zipper.example.edu/cgi-bin/zipdownload' http://localhost:8080/api/admin/settings/:CustomZipDownloadServiceUrl`` 
+
+:ArchiverClassName
+++++++++++++++++++
+
+Dataverse can export archival "Bag" files to an extensible set of storage systems (see :ref:`BagIt Export` above for details about this and for further explanation of the other archiving related settings below).
+This setting specifies which storage system to use by identifying the particular Java class that should be run. Current options include DuraCloudSubmitToArchiveCommand, LocalSubmitToArchiveCommand, and GoogleCloudSubmitToArchiveCommand.
+
+``curl -X PUT -d 'LocalSubmitToArchiveCommand' http://localhost:8080/api/admin/settings/:ArchiverClassName`` 
+
+:ArchiverSettings
++++++++++++++++++
+
+Each Archiver class may have its own custom settings. Along with setting which Archiver class to use, one must use this setting to identify which setting values should be sent to it when it is invoked. The value should be a comma-separated list of setting names.
+For example, the LocalSubmitToArchiveCommand only uses the :BagItLocalPath setting. To allow the class to use that setting, this setting must set as:
+
+``curl -X PUT -d ':BagItLocalPath' http://localhost:8080/api/admin/settings/:ArchiverSettings`` 
+
+:DuraCloudHost
+++++++++++++++
+:DuraCloudPort
+++++++++++++++
+:DuraCloudContext
++++++++++++++++++
+
+These three settings define the host, port, and context used by the DuraCloudSubmitToArchiveCommand. :DuraCloudHost is required. The other settings have default values as noted in the :ref:`Duracloud Configuration` section above.
+
+:BagItLocalPath
++++++++++++++++
+
+This is the local file system path to be used with the LocalSubmitToArchiveCommand class. It is recommended to use an absolute path. See the :ref:`Local Path Configuration` section above.
+
+:GoogleCloudBucket
+++++++++++++++++++ 
+:GoogleCloudProject
++++++++++++++++++++
+
+These are the bucket and project names to be used with the GoogleCloudSubmitToArchiveCommand class. Further information is in the :ref:`Google Cloud Configuration` section above.
diff --git a/pom.xml b/pom.xml
@@ -57,7 +57,7 @@
             </releases>
         </pluginRepository>
     </pluginRepositories>
-    <!--Maven checks for dependendies from these repos in the order shown in the pom.xml
+    <!--Maven checks for dependencies from these repos in the order shown in the pom.xml
         This isn't well documented and seems to change between maven versions -MAD 4.9.4 -->
     <repositories>
         <repository>
@@ -127,6 +127,13 @@
                 <artifactId>httpclient</artifactId>
                 <version>${httpcomponents.client.version}</version>
             </dependency>
+            <dependency>
+              <groupId>com.google.cloud</groupId>
+              <artifactId>google-cloud-bom</artifactId>
+              <version>0.115.0-alpha</version>
+              <type>pom</type>
+              <scope>import</scope>
+            </dependency>
             <dependency>
                 <groupId>org.testcontainers</groupId>
                 <artifactId>testcontainers-bom</artifactId>
@@ -137,7 +144,7 @@
         </dependencies>
     </dependencyManagement>
     <!-- Declare any DIRECT dependencies here.
-         In case the depency is both transitive and direct (e. g. some common lib for logging),
+         In case the dependency is both transitive and direct (e. g. some common lib for logging),
          manage the version above and add the direct dependency here WITHOUT version tag, too.
     -->
     <!-- TODO: Housekeeping is utterly needed. -->
@@ -576,6 +583,11 @@
             <artifactId>opennlp-tools</artifactId>
             <version>1.9.1</version>
         </dependency>
+        <dependency>
+          <groupId>com.google.cloud</groupId>
+          <artifactId>google-cloud-storage</artifactId>
+          <version>1.97.0</version>
+        </dependency>
 
 
         <!-- TESTING DEPENDENCIES -->

diff --git a/...in/java/edu/harvard/iq/dataverse/engine/command/impl/DuraCloudSubmitToArchiveCommand.java b/...in/java/edu/harvard/iq/dataverse/engine/command/impl/DuraCloudSubmitToArchiveCommand.java
@@ -99,7 +99,12 @@ public void run() {
                                 }
                             }
                         }).start();
-
+                        //Have seen Pipe Closed errors for other archivers when used as a workflow without this delay loop
+                        int i=0;
+                        while(digestInputStream.available()<=0 && i<100) {
+                            Thread.sleep(10);
+                            i++;
+                        }
                         String checksum = store.addContent(spaceName, "datacite.xml", digestInputStream, -1l, null, null,
                                 null);
                         logger.fine("Content: datacite.xml added with checksum: " + checksum);
@@ -133,7 +138,11 @@ public void run() {
                                     }
                                 }
                             }).start();
-
+                            i=0;
+                            while(digestInputStream.available()<=0 && i<100) {
+                                Thread.sleep(10);
+                                i++;
+                            }
                             checksum = store.addContent(spaceName, fileName, digestInputStream2, -1l, null, null,
                                     null);
                             logger.fine("Content: " + fileName + " added with checksum: " + checksum);
@@ -174,6 +183,9 @@ public void run() {
                         logger.severe(rte.getMessage());
                         return new Failure("Error in generating datacite.xml file",
                                 "DuraCloud Submission Failure: metadata file not created");
+                    } catch (InterruptedException e) {
+                        logger.warning(e.getLocalizedMessage());
+                        e.printStackTrace();
                     }
                 } catch (ContentStoreException e) {
                     logger.warning(e.getMessage());