Failed to output to Redshift due to "Invalid Date Format" #227

kyoshidajp · 2018-01-29T12:04:24Z

Issue Summary

Failed to output CSV data to Redshift with org.postgresql.util.PSQLException: ERROR.

Situation details

Table Scheme:

CREATE TABLE users
(
   name      varchar(80),
   mail      varchar(65535),
   birthday  date
);

Input CSV: (users.csv)

"sample_user1","sample_user1@example.com","1980-01-01"
"sample_user2","sample_user2@example.com","1000-01-01"

Configuration: (conf/csv_to_redshift_users.yml)

in:
  type: file
  path_prefix: /path/to/users.csv
  parser:
    charset: UTF-8
    newline: LF
    type: csv
    delimiter: ','
    quote: '"'
    trim_if_not_quoted: true
    skip_header_lines: 1
    allow_extra_columns: false
    allow_optional_columns: false
    default_timezone: 'Asia/Tokyo'
    columns:
    - {name: name, type: string}
    - {name: mail, type: string}
    - {name: birthday, type: timestamp, format: "%Y-%m-%d"}
out:
  type: redshift
  host: XXXX.ap-northeast-1.redshift.amazonaws.com
  database: XXXX
  user: XXXX
  password: XXXX
  iam_user_name: XXXX
  aws_access_key_id: XXXX
  aws_secret_access_key: XXXX
  s3_bucket: XXXX
  s3_key_prefix: XXXX
  port: XXXX
  table: users
  mode: insert
  options: {loglevel: 1}
  default_timezone: 'Asia/Tokyo'
  column_options:
    name: {type: 'varchar(80)'}
    email: {type: 'varchar(80)'}
    birthday: {type: 'date'}

Run command:

% embulk run -b . -l debug conf/csv_to_redshift_users.yml

Error message:

2018-01-29 20:08:02.139 +0900 [INFO] (0001:cleanup): > 0.01 seconds
org.embulk.exec.PartialExecutionException: java.lang.RuntimeException: org.postgresql.util.PSQLException: ERROR: Load into table 'users_00000161419a2426_embulk000' failed.  Check 'stl_load_errors' system table for details.
        at org.embulk.exec.BulkLoader$LoaderState.buildPartialExecuteException(BulkLoader.java:375)
        at org.embulk.exec.BulkLoader.doRun(BulkLoader.java:607)
        at org.embulk.exec.BulkLoader.access$000(BulkLoader.java:35)
        at org.embulk.exec.BulkLoader$1.run(BulkLoader.java:391)
        at org.embulk.exec.BulkLoader$1.run(BulkLoader.java:387)
        at org.embulk.spi.Exec.doWith(Exec.java:25)
        at org.embulk.exec.BulkLoader.run(BulkLoader.java:387)
        at org.embulk.EmbulkEmbed.run(EmbulkEmbed.java:180)
        at org.embulk.EmbulkRunner.runInternal(EmbulkRunner.java:331)
        at org.embulk.EmbulkRunner.run(EmbulkRunner.java:174)
        at org.embulk.cli.EmbulkRun.runSubcommand(EmbulkRun.java:468)
        at org.embulk.cli.EmbulkRun.run(EmbulkRun.java:100)
        at org.embulk.cli.Main.main(Main.java:28)
Caused by: java.lang.RuntimeException: org.postgresql.util.PSQLException: ERROR: Load into table 'users_00000161419a2426_embulk000' failed.  Check 'stl_load_errors' system table for details.
        at org.embulk.output.jdbc.AbstractJdbcOutputPlugin$PluginPageOutput.finish(AbstractJdbcOutputPlugin.java:1138)
        at org.embulk.exec.LocalExecutorPlugin$ScatterTransactionalPageOutput.finish(LocalExecutorPlugin.java:497)
        at org.embulk.spi.PageBuilder.finish(PageBuilder.java:264)
        at org.embulk.standards.CsvParserPlugin.run(CsvParserPlugin.java:404)
        at org.embulk.spi.FileInputRunner.run(FileInputRunner.java:156)
        at org.embulk.exec.LocalExecutorPlugin$ScatterExecutor.runInputTask(LocalExecutorPlugin.java:294)
        at org.embulk.exec.LocalExecutorPlugin$ScatterExecutor.access$000(LocalExecutorPlugin.java:212)
        at org.embulk.exec.LocalExecutorPlugin$ScatterExecutor$1.call(LocalExecutorPlugin.java:257)
        at org.embulk.exec.LocalExecutorPlugin$ScatterExecutor$1.call(LocalExecutorPlugin.java:253)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.postgresql.util.PSQLException: ERROR: Load into table 'users_00000161419a2426_embulk000' failed.  Check 'stl_load_errors' system table for details.
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2182)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1911)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:173)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:645)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:481)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:409)
        at org.embulk.output.redshift.RedshiftOutputConnection.runCopy(RedshiftOutputConnection.java:84)
        at org.embulk.output.redshift.RedshiftCopyBatchInsert$CopyTask.call(RedshiftCopyBatchInsert.java:264)
        at org.embulk.output.redshift.RedshiftCopyBatchInsert$CopyTask.call(RedshiftCopyBatchInsert.java:240)
        ... 4 more

Error: java.lang.RuntimeException: org.postgresql.util.PSQLException: ERROR: Load into table 'users_00000161419a2426_embulk000' failed.  Check 'stl_load_errors' system table for details.

Technical details

My embulk version is v0.8.39.

The error in stl_load_errors and the query by using SQLWorkbench/J is the following.

Query:

select raw_field_value, err_code, err_reason
from stl_load_errors
order by starttime desc
limit 1;

Result:

raw_field_value | err_code | err_reason
-- | -- | --
999-12-26 | 1205 | Invalid Date Format - length must be 10 or more

When I change data of CSV, then output successfully.

OK Pattern1:

"sample_user1","sample_user1@example.com","1980-01-01"
"sample_user2","sample_user2@example.com","2000-01-01"

OK Pattern2:

"sample_user1","sample_user1@example.com","1980-01-01"

OK Pattern3:

"sample_user1","sample_user1@example.com","100-01-01"

Also, inserted successfully by using SQLWorkbench/J these data with SQL.

insert into users values
('sample_user1', 'sample_user1@example.com', '1980-01-01'),
('sample_user2', 'sample_user2@example.com', '1000-01-01')

The text was updated successfully, but these errors were encountered:

toru-takahashi · 2018-01-30T03:20:56Z

Thank you for the report.

I confirmed that "OK Pattern 3" also doesn't work at embulk 0.8.39.

"sample_user1","sample_user1@example.com","100-01-01"

In your case, your configuration enables skip_header_lines: 1. So, the line can be skipped. That's why it didn't return any error.

I also confirmed that OK pattern 3 doesn't work at embulk 0.8.20.

One strange point is that "raw_field_value" in stl_load_errors doesn't match the actual value.

"sample_user1","sample_user1@example.com","1000-01-01"

But, raw_field_value returns "999-12-26".

It seems there are 2 issues.

1 - date type in embulk-output-redshift doesn't support yyy-MM-dd
2 - date type in embulk-output-redshift modify date incorrectly somehow.

hiroyuki-sato · 2018-01-30T04:25:44Z

Hello, @kyoshidajp

Thank you for the report.
Does this mean Embulk can't insert date '1000-01-01' correctly? Is this correct?
This issue probably that Embulk converts a string to invalid date value if date before < 1970-01-01.

I confirmed it on this test.
https://github.com/hiroyuki-sato/embulk-support/tree/master/output-jdbc-227-before1970

dmikurube · 2018-01-30T05:41:42Z

While we still need to investigate this embulk-output-jdbc, I just confirmed that the Embulk core itself parses and formats 1000-01-01 correctly. I guess it's a problem inside embulk-output-jdbc.

FYI: just as a general advice, I'd basically suggest to use UTC or numeric time offsets like +09:00, and to avoid using region-based time zone IDs such as Asia/Tokyo as far as possible unless it is really really required. Region-based time zones are not really "stable" nor "consistent", and the fragility can confuse people.

Asia/Tokyo is a relatively stable one, but was confusing especially before 1970. It had different time offset before around 1888, and it had summer time around 1950.
https://github.com/eggert/tz/blob/f084056070c1ffff964c0b749017cdf9bf175c0c/asia#L1512-L1513
https://github.com/eggert/tz/blob/f084056070c1ffff964c0b749017cdf9bf175c0c/asia#L1467-L1477

As a result,
1888-01-01 00:00:00 Asia/Tokyo = -2587713539 (epoch seconds)
1889-01-01 00:00:00 Asia/Tokyo = -2556090000 (epoch seconds)

Embulk internally converts the TIMESTAMP type as seconds from the epoch (1970-01-01 00:00:00 UTC). The combination of this conversion and region-based time zones can easily cause behaviors which are unexpected for users. When region-based time zones are truly required, it is basically recommended to keep the date/time values as STRING, not as TIMESTAMP.

hito4t · 2018-01-30T06:58:04Z

embulk-output-jdbc converts embulk Timestamp value to java.util.Date before inserting into DB.
https://github.com/embulk/embulk-output-jdbc/blob/master/embulk-output-jdbc/src/main/java/org/embulk/output/jdbc/StandardBatchInsert.java#L176

I found it is not easy to convert date value before 1582.
https://stackoverflow.com/questions/23457470/calendar-to-date-conversion-for-dates-before-15-oct-1582-gregorian-to-julian-ca
https://www.ibm.com/support/knowledgecenter/en/SSEPEK_11.0.0/java/src/tpc/imjcc_r0053436.html

dmikurube · 2018-01-30T07:08:22Z

A bit different in embulk-output-postgresql, but the cause is similar.

embulk-output-jdbc/embulk-output-postgresql/src/main/java/org/embulk/output/postgresql/AbstractPostgreSQLCopyBatchInsert.java

Lines 171 to 177 in 71f422d

    
           appendDelimiter(); 
        
           cal.setTimeInMillis(v.getEpochSecond() * 1000); 
        
           String f = String.format(Locale.ENGLISH, "%02d-%02d-%02d", 
        
                   cal.get(Calendar.YEAR), 
        
                   cal.get(Calendar.MONTH) + 1, 
        
                   cal.get(Calendar.DAY_OF_MONTH)); 
        
           writer.write(f);

We could reproduce this easily.

import java.time.Instant;
import java.time.ZonedDateTime;
import java.time.ZoneId;
import java.util.Calendar;
import java.util.Locale;
import org.joda.time.DateTimeZone;

public class AncientDays {
    public static void main(String[] args) {
        final ZonedDateTime zoned = ZonedDateTime.of(1000, 1, 1, 0, 0, 0, 0, ZoneId.of("Asia/Tokyo"));
        final Instant instant = zoned.toInstant();
        System.out.println(instant.getEpochSecond());

        final Calendar cal = Calendar.getInstance(DateTimeZone.forID("Asia/Tokyo").toTimeZone(), Locale.ENGLISH);
        cal.setTimeInMillis(instant.getEpochSecond() * 1000L);
        System.out.println(String.format(Locale.ENGLISH, "%02d-%02d-%02d",
                cal.get(Calendar.YEAR),
                cal.get(Calendar.MONTH) + 1,
                cal.get(Calendar.DAY_OF_MONTH)));
    }
}

-30610257539
999-12-26

As @hito4t mentioned, the root cause looks like a problem of Gregorian calendar and Julius calendar. There actually are no "consistent" methods to handle old dates before 1582. We need to everywhere specify how to handle. Please note that dates before 1582 can be differently handled.
(See: https://ja.wikipedia.org/wiki/%E3%82%B0%E3%83%AC%E3%82%B4%E3%83%AA%E3%82%AA%E6%9A%A6#%E3%83%A6%E3%83%AA%E3%82%A6%E3%82%B9%E6%9A%A6%E3%81%AB%E3%82%88%E3%82%8B%E3%81%9A%E3%82%8C in Japanese.)

But here, the problem is not only different dates, but Exception. I guess the direct cause of the Exception may be that the lengths of the strings (original: 1000-01-01 (10 chars), applied: 999-12-26 (9 chars)) are different. Do you have any ideas on this, @hito4t?

hito4t · 2018-01-30T07:17:55Z

@dmikurube
First, 1000-01-01 will be parsed as embulk Timestamp value.
embulk-output-jdbc will convert it to java.sql.Date (= 999-12-26) before inserting into DB.
It seems that Redshift can't parse yyy-MM-dd format.

embulk-output-redshift might be able to insert 1000-01-01 if it was defined as string value.

    - {name: birthday, type: string}

dmikurube · 2018-01-30T07:24:20Z

@hito4t Thanks. Sounds correct, but I was not sure why 100-01-01 was working as reported... Anyways, type: string could be a good workaround.

hiroyuki-sato · 2018-01-30T07:46:42Z

@dmikurube @hito4t JYI

In postgres I needed set column_options explicitly like the following.

in:
    columns:
    - {name: name, type: string}
    - {name: mail, type: string}
    - {name: birthday, type: string }
out:
  type: postgresql
  column_options:
    birthday: { type: date, value_type: string }
#    birthday: { type: date, value_type: date } <-- Can't insert DATE with this config.

\d users
              Table "public.users"
  Column  |           Type           | Modifiers
----------+--------------------------+-----------
 name     | character varying(80)    |
 mail     | character varying(65535) |
 birthday | date

hiroyuki-sato · 2018-01-30T07:54:34Z

@kyoshidajp

Can you try this configuration?, It worked well in my PostgreSQL database.

in:
  type: file
  path_prefix: /path/to/users.csv
  parser:
    charset: UTF-8
    newline: LF
    type: csv
    delimiter: ','
    quote: '"'
    trim_if_not_quoted: true
    skip_header_lines: 1
    allow_extra_columns: false
    allow_optional_columns: false
    default_timezone: 'Asia/Tokyo'
    columns:
    - {name: name, type: string}
    - {name: mail, type: string}
    - {name: birthday, type: string } # <-- Change here
out:
  type: redshift
  host: XXXX.ap-northeast-1.redshift.amazonaws.com
  database: XXXX
  user: XXXX
  password: XXXX
  iam_user_name: XXXX
  aws_access_key_id: XXXX
  aws_secret_access_key: XXXX
  s3_bucket: XXXX
  s3_key_prefix: XXXX
  port: XXXX
  table: users
  mode: insert
  options: {loglevel: 1}
  default_timezone: 'Asia/Tokyo'
  column_options:
    name: {type: 'varchar(80)'}
    email: {type: 'varchar(80)'}
    birthday: { type: date, value_type: string } #<-- change here

kyoshidajp · 2018-01-30T08:19:19Z

Thanks, everyone 🙇

@hiroyuki-sato Could you give me some time to run? I will probably be able to try it, 2-3 hours later.

kyoshidajp · 2018-01-30T10:33:12Z

@hiroyuki-sato Sorry, I'm late.

It was a success by using your configuration.

hiroyuki-sato · 2018-01-30T10:42:03Z

Thank you for reporting the result.
Good news

Please close this issue when you have a time.

kyoshidajp · 2018-01-30T11:03:34Z

Thanks so much!

hito4t · 2018-02-01T11:42:27Z

I've create the new issue #229 .

kyoshidajp closed this as completed Jan 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to output to Redshift due to "Invalid Date Format" #227

Failed to output to Redshift due to "Invalid Date Format" #227

kyoshidajp commented Jan 29, 2018

toru-takahashi commented Jan 30, 2018 •

edited

Loading

hiroyuki-sato commented Jan 30, 2018

dmikurube commented Jan 30, 2018

hito4t commented Jan 30, 2018

dmikurube commented Jan 30, 2018

hito4t commented Jan 30, 2018

dmikurube commented Jan 30, 2018

hiroyuki-sato commented Jan 30, 2018

hiroyuki-sato commented Jan 30, 2018

kyoshidajp commented Jan 30, 2018

kyoshidajp commented Jan 30, 2018

hiroyuki-sato commented Jan 30, 2018

kyoshidajp commented Jan 30, 2018

hito4t commented Feb 1, 2018

Failed to output to Redshift due to "Invalid Date Format" #227

Failed to output to Redshift due to "Invalid Date Format" #227

Comments

kyoshidajp commented Jan 29, 2018

Issue Summary

Situation details

Technical details

toru-takahashi commented Jan 30, 2018 • edited Loading

hiroyuki-sato commented Jan 30, 2018

dmikurube commented Jan 30, 2018

hito4t commented Jan 30, 2018

dmikurube commented Jan 30, 2018

hito4t commented Jan 30, 2018

dmikurube commented Jan 30, 2018

hiroyuki-sato commented Jan 30, 2018

hiroyuki-sato commented Jan 30, 2018

kyoshidajp commented Jan 30, 2018

kyoshidajp commented Jan 30, 2018

hiroyuki-sato commented Jan 30, 2018

kyoshidajp commented Jan 30, 2018

hito4t commented Feb 1, 2018

toru-takahashi commented Jan 30, 2018 •

edited

Loading