Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to output to Redshift due to "Invalid Date Format" #227

Closed
kyoshidajp opened this issue Jan 29, 2018 · 14 comments
Closed

Failed to output to Redshift due to "Invalid Date Format" #227

kyoshidajp opened this issue Jan 29, 2018 · 14 comments

Comments

@kyoshidajp
Copy link

Issue Summary

Failed to output CSV data to Redshift with org.postgresql.util.PSQLException: ERROR.

Situation details

Table Scheme:

CREATE TABLE users
(
   name      varchar(80),
   mail      varchar(65535),
   birthday  date
);

Input CSV: (users.csv)

"sample_user1","sample_user1@example.com","1980-01-01"
"sample_user2","sample_user2@example.com","1000-01-01"

Configuration: (conf/csv_to_redshift_users.yml)

in:
  type: file
  path_prefix: /path/to/users.csv
  parser:
    charset: UTF-8
    newline: LF
    type: csv
    delimiter: ','
    quote: '"'
    trim_if_not_quoted: true
    skip_header_lines: 1
    allow_extra_columns: false
    allow_optional_columns: false
    default_timezone: 'Asia/Tokyo'
    columns:
    - {name: name, type: string}
    - {name: mail, type: string}
    - {name: birthday, type: timestamp, format: "%Y-%m-%d"}
out:
  type: redshift
  host: XXXX.ap-northeast-1.redshift.amazonaws.com
  database: XXXX
  user: XXXX
  password: XXXX
  iam_user_name: XXXX
  aws_access_key_id: XXXX
  aws_secret_access_key: XXXX
  s3_bucket: XXXX
  s3_key_prefix: XXXX
  port: XXXX
  table: users
  mode: insert
  options: {loglevel: 1}
  default_timezone: 'Asia/Tokyo'
  column_options:
    name: {type: 'varchar(80)'}
    email: {type: 'varchar(80)'}
    birthday: {type: 'date'}

Run command:

% embulk run -b . -l debug conf/csv_to_redshift_users.yml

Error message:

2018-01-29 20:08:02.139 +0900 [INFO] (0001:cleanup): > 0.01 seconds
org.embulk.exec.PartialExecutionException: java.lang.RuntimeException: org.postgresql.util.PSQLException: ERROR: Load into table 'users_00000161419a2426_embulk000' failed.  Check 'stl_load_errors' system table for details.
        at org.embulk.exec.BulkLoader$LoaderState.buildPartialExecuteException(BulkLoader.java:375)
        at org.embulk.exec.BulkLoader.doRun(BulkLoader.java:607)
        at org.embulk.exec.BulkLoader.access$000(BulkLoader.java:35)
        at org.embulk.exec.BulkLoader$1.run(BulkLoader.java:391)
        at org.embulk.exec.BulkLoader$1.run(BulkLoader.java:387)
        at org.embulk.spi.Exec.doWith(Exec.java:25)
        at org.embulk.exec.BulkLoader.run(BulkLoader.java:387)
        at org.embulk.EmbulkEmbed.run(EmbulkEmbed.java:180)
        at org.embulk.EmbulkRunner.runInternal(EmbulkRunner.java:331)
        at org.embulk.EmbulkRunner.run(EmbulkRunner.java:174)
        at org.embulk.cli.EmbulkRun.runSubcommand(EmbulkRun.java:468)
        at org.embulk.cli.EmbulkRun.run(EmbulkRun.java:100)
        at org.embulk.cli.Main.main(Main.java:28)
Caused by: java.lang.RuntimeException: org.postgresql.util.PSQLException: ERROR: Load into table 'users_00000161419a2426_embulk000' failed.  Check 'stl_load_errors' system table for details.
        at org.embulk.output.jdbc.AbstractJdbcOutputPlugin$PluginPageOutput.finish(AbstractJdbcOutputPlugin.java:1138)
        at org.embulk.exec.LocalExecutorPlugin$ScatterTransactionalPageOutput.finish(LocalExecutorPlugin.java:497)
        at org.embulk.spi.PageBuilder.finish(PageBuilder.java:264)
        at org.embulk.standards.CsvParserPlugin.run(CsvParserPlugin.java:404)
        at org.embulk.spi.FileInputRunner.run(FileInputRunner.java:156)
        at org.embulk.exec.LocalExecutorPlugin$ScatterExecutor.runInputTask(LocalExecutorPlugin.java:294)
        at org.embulk.exec.LocalExecutorPlugin$ScatterExecutor.access$000(LocalExecutorPlugin.java:212)
        at org.embulk.exec.LocalExecutorPlugin$ScatterExecutor$1.call(LocalExecutorPlugin.java:257)
        at org.embulk.exec.LocalExecutorPlugin$ScatterExecutor$1.call(LocalExecutorPlugin.java:253)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.postgresql.util.PSQLException: ERROR: Load into table 'users_00000161419a2426_embulk000' failed.  Check 'stl_load_errors' system table for details.
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2182)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1911)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:173)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:645)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:481)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:409)
        at org.embulk.output.redshift.RedshiftOutputConnection.runCopy(RedshiftOutputConnection.java:84)
        at org.embulk.output.redshift.RedshiftCopyBatchInsert$CopyTask.call(RedshiftCopyBatchInsert.java:264)
        at org.embulk.output.redshift.RedshiftCopyBatchInsert$CopyTask.call(RedshiftCopyBatchInsert.java:240)
        ... 4 more

Error: java.lang.RuntimeException: org.postgresql.util.PSQLException: ERROR: Load into table 'users_00000161419a2426_embulk000' failed.  Check 'stl_load_errors' system table for details.

Technical details

My embulk version is v0.8.39.

The error in stl_load_errors and the query by using SQLWorkbench/J is the following.

Query:

select raw_field_value, err_code, err_reason
from stl_load_errors
order by starttime desc
limit 1;

Result:

raw_field_value | err_code | err_reason
-- | -- | --
999-12-26 | 1205 | Invalid Date Format - length must be 10 or more

When I change data of CSV, then output successfully.

OK Pattern1:

"sample_user1","sample_user1@example.com","1980-01-01"
"sample_user2","sample_user2@example.com","2000-01-01"

OK Pattern2:

"sample_user1","sample_user1@example.com","1980-01-01"

OK Pattern3:

"sample_user1","sample_user1@example.com","100-01-01"

Also, inserted successfully by using SQLWorkbench/J these data with SQL.

insert into users values
('sample_user1', 'sample_user1@example.com', '1980-01-01'),
('sample_user2', 'sample_user2@example.com', '1000-01-01')
@toru-takahashi
Copy link

toru-takahashi commented Jan 30, 2018

Thank you for the report.

I confirmed that "OK Pattern 3" also doesn't work at embulk 0.8.39.

"sample_user1","sample_user1@example.com","100-01-01"

In your case, your configuration enables skip_header_lines: 1. So, the line can be skipped. That's why it didn't return any error.

I also confirmed that OK pattern 3 doesn't work at embulk 0.8.20.

One strange point is that "raw_field_value" in stl_load_errors doesn't match the actual value.

"sample_user1","sample_user1@example.com","1000-01-01"

But, raw_field_value returns "999-12-26".

It seems there are 2 issues.

1 - date type in embulk-output-redshift doesn't support yyy-MM-dd
2 - date type in embulk-output-redshift modify date incorrectly somehow.

@hiroyuki-sato
Copy link
Contributor

Hello, @kyoshidajp

Thank you for the report.
Does this mean Embulk can't insert date '1000-01-01' correctly? Is this correct?
This issue probably that Embulk converts a string to invalid date value if date before < 1970-01-01.

I confirmed it on this test.
https://github.com/hiroyuki-sato/embulk-support/tree/master/output-jdbc-227-before1970

@dmikurube
Copy link
Member

While we still need to investigate this embulk-output-jdbc, I just confirmed that the Embulk core itself parses and formats 1000-01-01 correctly. I guess it's a problem inside embulk-output-jdbc.

FYI: just as a general advice, I'd basically suggest to use UTC or numeric time offsets like +09:00, and to avoid using region-based time zone IDs such as Asia/Tokyo as far as possible unless it is really really required. Region-based time zones are not really "stable" nor "consistent", and the fragility can confuse people.

Asia/Tokyo is a relatively stable one, but was confusing especially before 1970. It had different time offset before around 1888, and it had summer time around 1950.
https://github.com/eggert/tz/blob/f084056070c1ffff964c0b749017cdf9bf175c0c/asia#L1512-L1513
https://github.com/eggert/tz/blob/f084056070c1ffff964c0b749017cdf9bf175c0c/asia#L1467-L1477

As a result,
1888-01-01 00:00:00 Asia/Tokyo = -2587713539 (epoch seconds)
1889-01-01 00:00:00 Asia/Tokyo = -2556090000 (epoch seconds)

Embulk internally converts the TIMESTAMP type as seconds from the epoch (1970-01-01 00:00:00 UTC). The combination of this conversion and region-based time zones can easily cause behaviors which are unexpected for users. When region-based time zones are truly required, it is basically recommended to keep the date/time values as STRING, not as TIMESTAMP.

@dmikurube
Copy link
Member

A bit different in embulk-output-postgresql, but the cause is similar.

appendDelimiter();
cal.setTimeInMillis(v.getEpochSecond() * 1000);
String f = String.format(Locale.ENGLISH, "%02d-%02d-%02d",
cal.get(Calendar.YEAR),
cal.get(Calendar.MONTH) + 1,
cal.get(Calendar.DAY_OF_MONTH));
writer.write(f);

We could reproduce this easily.

import java.time.Instant;
import java.time.ZonedDateTime;
import java.time.ZoneId;
import java.util.Calendar;
import java.util.Locale;
import org.joda.time.DateTimeZone;

public class AncientDays {
    public static void main(String[] args) {
        final ZonedDateTime zoned = ZonedDateTime.of(1000, 1, 1, 0, 0, 0, 0, ZoneId.of("Asia/Tokyo"));
        final Instant instant = zoned.toInstant();
        System.out.println(instant.getEpochSecond());

        final Calendar cal = Calendar.getInstance(DateTimeZone.forID("Asia/Tokyo").toTimeZone(), Locale.ENGLISH);
        cal.setTimeInMillis(instant.getEpochSecond() * 1000L);
        System.out.println(String.format(Locale.ENGLISH, "%02d-%02d-%02d",
                cal.get(Calendar.YEAR),
                cal.get(Calendar.MONTH) + 1,
                cal.get(Calendar.DAY_OF_MONTH)));
    }
}
-30610257539
999-12-26

As @hito4t mentioned, the root cause looks like a problem of Gregorian calendar and Julius calendar. There actually are no "consistent" methods to handle old dates before 1582. We need to everywhere specify how to handle. Please note that dates before 1582 can be differently handled.
(See: https://ja.wikipedia.org/wiki/%E3%82%B0%E3%83%AC%E3%82%B4%E3%83%AA%E3%82%AA%E6%9A%A6#%E3%83%A6%E3%83%AA%E3%82%A6%E3%82%B9%E6%9A%A6%E3%81%AB%E3%82%88%E3%82%8B%E3%81%9A%E3%82%8C in Japanese.)

But here, the problem is not only different dates, but Exception. I guess the direct cause of the Exception may be that the lengths of the strings (original: 1000-01-01 (10 chars), applied: 999-12-26 (9 chars)) are different. Do you have any ideas on this, @hito4t?

@hito4t
Copy link
Contributor

hito4t commented Jan 30, 2018

@dmikurube
First, 1000-01-01 will be parsed as embulk Timestamp value.
embulk-output-jdbc will convert it to java.sql.Date (= 999-12-26) before inserting into DB.
It seems that Redshift can't parse yyy-MM-dd format.

embulk-output-redshift might be able to insert 1000-01-01 if it was defined as string value.

    - {name: birthday, type: string}

@dmikurube
Copy link
Member

@hito4t Thanks. Sounds correct, but I was not sure why 100-01-01 was working as reported... Anyways, type: string could be a good workaround.

@hiroyuki-sato
Copy link
Contributor

@dmikurube @hito4t JYI

In postgres I needed set column_options explicitly like the following.

in:
    columns:
    - {name: name, type: string}
    - {name: mail, type: string}
    - {name: birthday, type: string }
out:
  type: postgresql
  column_options:
    birthday: { type: date, value_type: string }
#    birthday: { type: date, value_type: date } <-- Can't insert DATE with this config.
\d users
              Table "public.users"
  Column  |           Type           | Modifiers
----------+--------------------------+-----------
 name     | character varying(80)    |
 mail     | character varying(65535) |
 birthday | date

@hiroyuki-sato
Copy link
Contributor

@kyoshidajp

Can you try this configuration?, It worked well in my PostgreSQL database.

in:
  type: file
  path_prefix: /path/to/users.csv
  parser:
    charset: UTF-8
    newline: LF
    type: csv
    delimiter: ','
    quote: '"'
    trim_if_not_quoted: true
    skip_header_lines: 1
    allow_extra_columns: false
    allow_optional_columns: false
    default_timezone: 'Asia/Tokyo'
    columns:
    - {name: name, type: string}
    - {name: mail, type: string}
    - {name: birthday, type: string } # <-- Change here
out:
  type: redshift
  host: XXXX.ap-northeast-1.redshift.amazonaws.com
  database: XXXX
  user: XXXX
  password: XXXX
  iam_user_name: XXXX
  aws_access_key_id: XXXX
  aws_secret_access_key: XXXX
  s3_bucket: XXXX
  s3_key_prefix: XXXX
  port: XXXX
  table: users
  mode: insert
  options: {loglevel: 1}
  default_timezone: 'Asia/Tokyo'
  column_options:
    name: {type: 'varchar(80)'}
    email: {type: 'varchar(80)'}
    birthday: { type: date, value_type: string } #<-- change here

@kyoshidajp
Copy link
Author

Thanks, everyone 🙇

@hiroyuki-sato Could you give me some time to run? I will probably be able to try it, 2-3 hours later.

@kyoshidajp
Copy link
Author

@hiroyuki-sato Sorry, I'm late.

It was a success by using your configuration.

@hiroyuki-sato
Copy link
Contributor

Thank you for reporting the result.
Good news

Please close this issue when you have a time.

@kyoshidajp
Copy link
Author

Thanks so much!

@hito4t
Copy link
Contributor

hito4t commented Feb 1, 2018

I've create the new issue #229 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants