Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-44214: [C++] JsonExtensionType equality check ignores storage type #44215

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

rok
Copy link
Member

@rok rok commented Sep 24, 2024

Rationale for this change

As noted in #13901 (review):

bool JsonExtensionType::ExtensionEquals(const ExtensionType& other) const {
  return other.extension_name() == this->extension_name();
}

This equality check does not take into account the storage type, but only the name.
As a consequence, a JsonExtensionType type will be seen as equal to JsonExtensionType<large_string>.

What changes are included in this PR?

This change introduces storage equality check into JsonExtensionType equality check.

This also fixes a storage type check in JsonExtensionType::Make.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@rok rok added this to the 18.0.0 milestone Sep 24, 2024
@rok rok linked an issue Sep 24, 2024 that may be closed by this pull request
Copy link

⚠️ GitHub issue #44214 has been automatically assigned in GitHub to PR creator.

cpp/src/arrow/extension/json.h Outdated Show resolved Hide resolved
@@ -763,7 +763,7 @@ TEST_F(TestConvertParquetSchema, ParquetSchemaArrowExtensions) {
props.set_arrow_extensions_enabled(true);
auto arrow_schema = ::arrow::schema(
{::arrow::field("json_1", ::arrow::extension::json(), true),
::arrow::field("json_2", ::arrow::extension::json(::arrow::large_utf8()),
::arrow::field("json_2", ::arrow::extension::json(::arrow::utf8()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of making the test easier, can you keep large_utf8 and ensure it passes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reconstruct storage type if it is not stored? In this case we know data is of JSON logical type, but we don't know what was it's storage type at write time. Am I missing something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(There was a test where arrow schema was available and wasn't used at read time. I've addressed that)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reconstruct storage type if it is not stored? In this case we know data is of JSON logical type, but we don't know what was it's storage type at write time. Am I missing something?

No, I'm asking you to write the test so that it reflects that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this? e522efb

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added another assertion to justify changing arrow_schema https://github.com/apache/arrow/pull/44215/files/94006f92085377fc4b72c2a2da844e36fb4d86d3..77c964b04972fc08f04957190866fb475b453c3c.
Let me know if something else should be changed.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Sep 24, 2024
@rok rok requested a review from pitrou September 24, 2024 20:33
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Sep 25, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 25, 2024
if (ex_type.storage_type()->Equals(*inferred->field->type())) {
if (ex_type.storage_type()->Equals(*inferred->field->type()) ||
(ex_type.extension_name() == "arrow.json" &&
!ex_type.storage_type()->Equals(*inferred->field->type()))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "types are not equal" check is superfluous? Because if they were equal, the first part of the if check would have passed and we wouldn't get here?

I assume that in case of JSON extension type, we either just have to assume that the data will correctly use string, or we check explicitly that the inferred type is string type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is superfluous indeed. Changed as per your suggestion: 3364c0f

@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Sep 25, 2024
return std::make_shared<JsonExtensionType>(storage_type);
}

std::shared_ptr<DataType> json(std::shared_ptr<DataType> storage_type) {
return JsonExtensionType::Make(storage_type).ValueOrDie();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return JsonExtensionType::Make(storage_type).ValueOrDie();
return JsonExtensionType::Make(std::move(storage_type)).ValueOrDie();

Also, please ensure that JsonExtensionType::Make takes a std::shared_ptr<DataType>.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Co-authored-by: Antoine Pitrou <pitrou@free.fr>
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Sep 25, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 25, 2024
@rok rok requested a review from pitrou September 25, 2024 16:55
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remaining nits

cpp/src/arrow/extension/json.cc Outdated Show resolved Hide resolved
cpp/src/arrow/extension/json.cc Outdated Show resolved Hide resolved
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
@rok
Copy link
Member Author

rok commented Sep 25, 2024

Remaining nits

@pitrou added your changes.

@rok rok requested a review from pitrou September 25, 2024 17:27
return std::make_shared<JsonExtensionType>(storage_type);
Result<std::shared_ptr<DataType>> JsonExtensionType::Make(
std::shared_ptr<DataType> storage_type) {
if (storage_type->id() != Type::STRING && storage_type->id() != Type::STRING_VIEW &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add the following static function to JsonExtensionType to remove repeated code here and in cpp/src/parquet/arrow/schema.cc below?

static bool IsSupportedStorageType(Type::type type_id) {
  return type_id == Type::STRING || type_id == Type::STRING_VIEW || type_id == Type::LARGE_STRING;
}

Copy link
Member Author

@rok rok Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed as suggested: 6dff127

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Sep 26, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] JsonExtensionType equality check ignores storage type
4 participants