Skip to content

Releases: GoogleCloudDataproc/hadoop-connectors

2019-01-30 (GCS 1.9.12, BQ 0.13.12)

30 Jan 22:51
Compare
Choose a tag to compare

This version has a bug in implicit directories inference feature and a bug that leads to GCS lists request spike, please use 1.9.15 version instead.

Changelog

Cloud Storage connector:

  1. Update all dependencies to latest versions.
  2. Improve GCS IO exception messages.
  3. Reduce latency of GCS IO operations.
  4. Fix bug that could lead to data duplication when reading files with GZIP content encoding (HTTP header Content-Encoding: gzip) that have uncompressed size of more than 2.14 GiB.

BigQuery connector:

  1. POM updates for GCS connector 1.9.12.
  2. Improve exception message for BigQuery job execution errors.
  3. Update all dependencies to latest versions.

2018-12-20 (GCS 1.9.11, BQ 0.13.11)

20 Dec 21:50
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Changed the default value of fs.gs.path.encoding to 'uri-path', the new codec introduced in 1.4.5. The old behavior can be restored by setting fs.gs.path.encoding to 'legacy'.

  2. Update all dependencies to latest versions.

  3. Don't use fs.gs.performance.cache.dir.metadata.prefetch.limit property to prefetch metadata in PerformanceCachingGoogleCloudStorage - always use single objects list request, because prefetching metadata with multiple list requests (when directory contains a lot of files) could introduce performance penalties when using performance cache.

  4. Add an option to lazily initialize GoogleHadoopFileSystem instances:

    fs.gs.lazy.init.enable (default: false)
    
  5. Add ability to unset fs.gs.system.bucket with an empty string value:

    fs.gs.system.bucket=
    
  6. Set default value for fs.gs.working.dir property to /.

BigQuery connector:

  1. POM updates for GCS connector 1.9.11.
  2. Update all dependencies to latest versions.

2018-11-01 (GCS 1.9.10, BQ 0.13.10)

01 Nov 21:58
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Use Hadoop CredentialProvider API to retrieve proxy credentials.
  2. Remove 1024 compose components limit from SYNCABLE_COMPOSITE output stream type.

BigQuery connector:

  1. POM updates for GCS connector 1.9.10.

2018-10-19 (GCS 1.9.9, BQ 0.13.9)

20 Oct 00:00
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Add an option for running flat and regular glob search algorithms in parallel:

    fs.gs.glob.concurrent.enable (default: true)
    

    Returns a result of an algorithm that finishes first and cancels the other algorithm.

  2. Add an option to provide path for configuration override file:

    fs.gs.config.override.file (default: null)
    

    Connector overrides its configuration with values provided in this file. This file should be in XML format that follows the same schema as Hadoop configuration files.

BigQuery connector:

  1. POM updates for GCS connector 1.9.9.

2018-10-03 (GCS 1.9.8, BQ 0.13.8)

03 Oct 21:06
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Expose FileChecksum in GoogleHadoopFileSystem via property (valid values: NONE, CRC32C, MD5):

    fs.gs.checksum.type (default: NONE)
    

    CRC32c checksum is compatible with HDFS-13056.

  2. Add support for proxy authentication for both APACHE and JAVA_NET HttpTransport options.

    Proxy authentication is configurable with properties:

    fs.gs.proxy.username (default: null)
    fs.gs.proxy.password (default: null)
    
  3. Update Apache HttpClient to the latest version.

BigQuery connector:

  1. POM updates for GCS connector 1.9.8.

2018-09-20 (GCS 1.9.7, BQ 0.13.7)

20 Sep 19:36
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Add an option to provide credentials directly in Hadoop Configuration, without having to place a file on every node, or associating service accounts with GCE VMs.

    fs.gs.auth.service.account.private.key.id
    fs.gs.auth.service.account.private.key
    
  2. Add an option to specify max bytes rewritten per rewrite request when fs.gs.copy.with.rewrite.enable is set to true:

    fs.gs.rewrite.max.bytes.per.call (default: 536870912)
    

    Even though GCS does not require this parameter for rewrite requests, rewrite requests are flaky without it.

BigQuery connector:

  1. POM updates for GCS connector 1.9.7.

2018-09-20 (GCS 1.6.10, BQ 0.10.11)

20 Sep 17:19
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Add an option to specify max bytes rewritten per rewrite request when fs.gs.copy.with.rewrite.enable is set to true:

    fs.gs.rewrite.max.bytes.per.call (default: 536870912)
    

    Even though GCS does not require this parameter for rewrite requests, rewrite requests are flaky without it.

BigQuery connector:

  1. POM updates for GCS connector 1.6.10.

2018-09-04 (GCS 1.6.9, BQ 0.10.10)

04 Sep 22:40
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Change default values for GCS batch/directory operations properties to improve performance:

    fs.gs.copy.max.requests.per.batch (default: 1 -> 15)
    fs.gs.copy.batch.threads (default: 50 -> 15)
    fs.gs.max.requests.per.batch (default: 25 -> 15)
    fs.gs.batch.threads (default: 25 -> 15)
    
  2. Update all dependencies to latest versions.

BigQuery connector:

  1. POM updates for GCS connector 1.6.9.
  2. Poll BQ jobs in their correct locations.
  3. Update all dependencies to latest versions.

2018-08-30 (GCS 1.9.6, BQ 0.13.6)

31 Aug 01:06
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Change default values for GCS batch/directory operations properties to improve performance:

    fs.gs.copy.max.requests.per.batch (default: 1 -> 15)
    fs.gs.copy.batch.threads (default: 50 -> 15)
    fs.gs.max.requests.per.batch (default: 25 -> 15)
    fs.gs.batch.threads (default: 25 -> 15)
    
  2. Migrate logging to Google Flogger.

    To configure Log4j as a Flogger backend set flogger.backend_factory system property to com.google.common.flogger.backend.log4j.Log4jBackendFactory#getInstance or com.google.cloud.hadoop.repackaged.gcs.com.google.common.flogger.backend.log4j.Log4jBackendFactory#getInstance if using shaded jar.

    For example:

    java -Dflogger.backend_factory=com.google.common.flogger.backend.log4j.Log4jBackendFactory#getInstance ...
    
  3. Delete read buffer in GoogleHadoopFSInputStream class and remove property that enables it:

    fs.gs.inputstream.internalbuffer.enable (default: false)
    
  4. Disable read buffer in GoogleCloudStorageReadChannel by default because it does not provide significant performance benefits:

    fs.gs.io.buffersize (deafult: 8388608 -> 0)
    
  5. Add configuration properties for buffers in GoogleHadoopOutputStream:

    fs.gs.outputstream.buffer.size (default: 8388608)
    fs.gs.outputstream.pipe.buffer.size (default: 1048576)
    
  6. Deprecate and replace properties with new one:

    fs.gs.io.buffersize -> fs.gs.inputstream.buffer.size (deafult: 0)
    fs.gs.io.buffersize.write -> fs.gs.outputstream.upload.chunk.size (default: 67108864)
    
  7. Enable fadvise AUTO mode by default:

    fs.gs.inputstream.fadvise (default: SEQUENTIAL -> AUTO)
    
  8. Update all dependencies to latest versions.

BigQuery connector:

  1. POM updates for GCS connector 1.9.6.

  2. Migrate logging to Google Flogger.

    To configure Log4j as a Flogger backend set flogger.backend_factory system property to com.google.common.flogger.backend.log4j.Log4jBackendFactory#getInstance or com.google.cloud.hadoop.repackaged.bigquery.com.google.common.flogger.backend.log4j.Log4jBackendFactory#getInstance if using shaded jar.

    For example:

    java -Dflogger.backend_factory=com.google.common.flogger.backend.log4j.Log4jBackendFactory#getInstance ...
    
  3. Poll BQ jobs in their correct locations.

  4. Update all dependencies to latest versions.

2018-08-09 (GCS 1.9.5, BQ 0.13.5)

10 Aug 05:19
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Improve build configuration (pom.xmls) compatibility with Maven release plugin.

    Changes version string from 1.9.5-hadoop2 to hadoop2-1.9.5.

  2. Update Maven plugins versions.

  3. Do not send batch request when performing operations (rename, delete, copy) on 1 object.

  4. Add fs.gs.performance.cache.dir.metadata.prefetch.limit (default=1000) configuration property to control number of prefetched metadata objects in the same directory by PerformanceCachingGoogleCloudStorage.

    To disable metadata prefetching set property value to 0.

    To prefetch all objects metadata in a directory set property value to -1.

  5. Add configuration properties to control batching of copy operations separately from other operations:

    fs.gs.copy.max.requests.per.batch (default: 30)
    fs.gs.copy.batch.threads (default: 0)
    
  6. Fix RejectedExecutionException during parallel execution of GCS batch requests.

  7. Change default values for GCS batch/directory operations properties:

    fs.gs.copy.with.rewrite.enable (default: false -> true)
    fs.gs.copy.max.requests.per.batch (default: 30 -> 1)
    fs.gs.copy.batch.threads (default: 0 -> 50)
    fs.gs.max.requests.per.batch (default: 30 -> 25)
    fs.gs.batch.threads (default: 0 -> 25)
    

BigQuery connector:

  1. POM updates for GCS connector 1.9.5.

  2. Improve build configuration (pom.xmls) compatibility with Maven release plugin.

    Changes version string from 0.13.5-hadoop2 to hadoop2-0.13.5.

  3. Update Maven plugins versions.