Skip to content

Commit

Permalink
add dataverse.netcdf.geo-extract-s3-direct-upload config #9601
Browse files Browse the repository at this point in the history
By default, keep S3 direct upload fast. Don't download NetCDF or HDF5
files to try to pull geospatial metadata out of them when S3 direct
upload is configured. If you really want this, add this setting and make
it true.
  • Loading branch information
pdurbin committed May 25, 2023
1 parent 505e8f2 commit 18076ed
Show file tree
Hide file tree
Showing 6 changed files with 25 additions and 0 deletions.
4 changes: 4 additions & 0 deletions doc/release-notes/9331-extract-bounding-box.md
Original file line number Diff line number Diff line change
@@ -1 +1,5 @@
An attempt will be made to extract a geospatial bounding box (west, south, east, north) from NetCDF and HDF5 files and then insert these values into the geospatial metadata block, if enabled.

The following JVM setting has been added:

- dataverse.netcdf.geo-extract-s3-direct-upload
1 change: 1 addition & 0 deletions doc/sphinx-guides/source/developers/big-data-support.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ The following features are disabled when S3 direct upload is enabled.
- Unzipping of zip files. (See :ref:`compressed-files`.)
- Extraction of metadata from FITS files. (See :ref:`fits`.)
- Creation of NcML auxiliary files (See :ref:`netcdf-and-hdf5`.)
- Extraction of a geospatial bounding box from NetCDF and HDF5 files (see :ref:`netcdf-and-hdf5`) unless :ref:`dataverse.netcdf.geo-extract-s3-direct-upload` is set to true.

.. _cors-s3-bucket:

Expand Down
8 changes: 8 additions & 0 deletions doc/sphinx-guides/source/installation/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2419,6 +2419,14 @@ Defaults to ``false``.
Can also be set via any `supported MicroProfile Config API source`_, e.g. the environment variable
``DATAVERSE_UI_SHOW_VALIDITY_FILTER``. Will accept ``[tT][rR][uU][eE]|1|[oO][nN]`` as "true" expressions.

.. _dataverse.netcdf.geo-extract-s3-direct-upload:

dataverse.netcdf.geo-extract-s3-direct-upload
+++++++++++++++++++++++++++++++++++++++++++++

This setting was added to keep S3 direct upload lightweight. When that feature is enabled and you still want NetCDF and HDF5 files to go through metadata extraction of a Geospatial Bounding Box (see :ref:`netcdf-and-hdf5`), which requires the file to be downloaded from S3 in this scenario, make this setting true.

See also :ref:`s3-direct-upload-features-disabled`.

.. _feature-flags:

Expand Down
1 change: 1 addition & 0 deletions doc/sphinx-guides/source/user/dataset-management.rst
Original file line number Diff line number Diff line change
Expand Up @@ -373,6 +373,7 @@ Please note the following rules regarding these fields:
- If West Longitude and East Longitude are both over 180 (outside the expected -180:180 range), 360 will be subtracted to shift the values from the 0:360 range to the expected -180:180 range.
- If either West Longitude or East Longitude are less than zero but the other longitude is greater than 180 (which would imply an indeterminate domain, a lack of clarity of if the domain is -180:180 or 0:360), metadata will be not be extracted.
- If the bounding box was successfully populated, the subsequent removal of the NetCDF or HDF5 file from the dataset does not automatically remove the bounding box from the dataset metadata. You must remove the bounding box manually, if desired.
- This feature is disabled if S3 direct upload is enabled (see :ref:`s3-direct-upload-features-disabled`) unless :ref:`dataverse.netcdf.geo-extract-s3-direct-upload` has been set to true.

If the bounding box was successfully populated, :ref:`geospatial-search` should be able to find it.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@
import edu.harvard.iq.dataverse.ingest.tabulardata.impl.plugins.sav.SAVFileReaderSpi;
import edu.harvard.iq.dataverse.ingest.tabulardata.impl.plugins.por.PORFileReader;
import edu.harvard.iq.dataverse.ingest.tabulardata.impl.plugins.por.PORFileReaderSpi;
import edu.harvard.iq.dataverse.settings.JvmSettings;
import edu.harvard.iq.dataverse.util.*;

import org.apache.commons.io.IOUtils;
Expand Down Expand Up @@ -105,6 +106,7 @@
import java.util.ListIterator;
import java.util.logging.Logger;
import java.util.Hashtable;
import java.util.Optional;
import javax.ejb.EJB;
import javax.ejb.Stateless;
import javax.inject.Named;
Expand Down Expand Up @@ -1280,6 +1282,11 @@ public boolean extractMetadataFromNetcdf(String tempFileLocation, DataFile dataF
dataFileLocation = localFile.getAbsolutePath();
logger.info("extractMetadataFromNetcdf: file is local. Path: " + dataFileLocation);
} else {
Optional<Boolean> allow = JvmSettings.GEO_EXTRACT_S3_DIRECT_UPLOAD.lookupOptional(Boolean.class);
if (!(allow.isPresent() && allow.get())) {
logger.info("extractMetadataFromNetcdf: skipping because of config is set to not slow down S3 remote upload.");
return false;
}
// Need to create a temporary local file:
tempFile = File.createTempFile("tempFileExtractMetadataNetcdf", ".tmp");
try ( ReadableByteChannel targetFileChannel = (ReadableByteChannel) storageIO.getReadChannel(); FileChannel tempFileChannel = new FileOutputStream(tempFile).getChannel();) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,10 @@ public enum JvmSettings {
SCOPE_UI(PREFIX, "ui"),
UI_ALLOW_REVIEW_INCOMPLETE(SCOPE_UI, "allow-review-for-incomplete"),
UI_SHOW_VALIDITY_FILTER(SCOPE_UI, "show-validity-filter"),

// NetCDF SETTINGS
SCOPE_NETCDF(PREFIX, "netcdf"),
GEO_EXTRACT_S3_DIRECT_UPLOAD(SCOPE_NETCDF, "geo-extract-s3-direct-upload"),
;

private static final String SCOPE_SEPARATOR = ".";
Expand Down

0 comments on commit 18076ed

Please sign in to comment.