Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] get_json_object parsing values. #10196

Closed
mlahir1 opened this issue Feb 2, 2022 · 10 comments
Closed

[BUG] get_json_object parsing values. #10196

mlahir1 opened this issue Feb 2, 2022 · 10 comments
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@mlahir1
Copy link

mlahir1 commented Feb 2, 2022

when the field is not present in the nested json field, it doesn't return empty list/NA, rather it just ignores the field.
This way it becomes impossible to map the values to right rows after mapping.

>>> import cudf
>>> df = cudf.DataFrame()
>>> x = '''\
{"tup":\
    [\
        {"id":"1","array":[1,2]},\
        {"id":"2"},\
        {"id":"3","array":[3,4]},\
        {"id":"4"}\
]}\
'''
>>> df['b'] = cudf.Series([x])
>>> df['id'] = df.b.str.get_json_object('$.tup[*].id')
>>> df['c'] = df.b.str.get_json_object('$.tup[*].array')
>>> df.id
0    ["1","2","3","4"]
Name: item_id, dtype: object
>>> df.c
0    [[1,2],[3,4]]
Name: c, dtype: object
@mlahir1 mlahir1 added Needs Triage Need team to review and classify bug Something isn't working labels Feb 2, 2022
@GregoryKimball GregoryKimball added feature request New feature or request and removed bug Something isn't working labels Feb 9, 2022
@GregoryKimball
Copy link
Contributor

Hello @nvdbaranec would you please comment on this request and the scope of adding behavioral flags to get_json_object?

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@nvdbaranec
Copy link
Contributor

Still relevant.

@GregoryKimball
Copy link
Contributor

Would you please clarify the use case? As I understand, the use case reads a text file containing JSON records, calls get_json_object to return strings of comma-separated values, calls.str.split(',') to create a list<str> column, and finally explodes the list column into a table of string columns. Given this use case, would this be the preferred output?

>>> df.at[0, 'c']
'[[1,2],null,[3,4],null]'

Would this issue be resolved by "Support nested types in JSON reader" (#8827)?

@mlahir1
Copy link
Author

mlahir1 commented Mar 28, 2022

Hello @GregoryKimball ,

If we can parse the column with json string, that would make other jobs for us easy too.

Our Usecase:
we have an orc files, with one col as json str. The json string is complex, nested and contains lists. With some values present in few rows and few might not have them. We are trying to extract various fields from this and trying to explode to flatten out the data. The above is just an example of this. When we are trying to flatten out the data and when the item is not present and get_json_obj doesn't return a null, we just end up mismatching the records when exploded.

If we are able to parse the string as an json object, I blv this should solve our use-case as long it returns Null when a field/value is not present.

If you have any further questions or wanting to discuss this further, please let me know.

@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Apr 5, 2022
@GregoryKimball
Copy link
Contributor

GregoryKimball commented Apr 7, 2022

Thanks @mlahir1 for your message. I'd like to propose a different JsonPath approach. I recommend iterating through your objects with JsonPath [0...x] instead of [*].

import cudf
x = '''\
{"tup":\
    [\
        {"id":"1","array":[1,2]},\
        {"id":"2"},\
        {"id":"3","array":[3,4]},\
        {"id":"4"}\
]}\
'''
s = cudf.Series([x])

df = cudf.DataFrame()
for i in range(4):
    df[f'{i}-array'] = s.str.get_json_object(f'$.tup[{i}].array')

Now the contents of tup will be spread across multiple columns of df. For nested values you can invoke get_json_object again on any of these child columns.

    0-array 1-array 2-array 3-array
0   [1,2]    None   [3,4]    None

@github-actions
Copy link

github-actions bot commented May 7, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@nvdbaranec
Copy link
Contributor

Still relevant.

@MikeChenfu
Copy link

Hi @GregoryKimball , it is an interesting solution about df[f'{i}-array'] = s.str.get_json_object(f'$.tup[{i}].array'). How about the performance if we have a large number of items? I checked the json_path.cu code and found a while loop to get each targeted item. It also seems like a sequence process. please correct me if I am wrong.
https://github.com/rapidsai/cudf/blob/branch-22.06/cpp/src/strings/json/json_path.cu#L728-L837

@nvdbaranec
Copy link
Contributor

The processing of the json strings is serial within each thread, but it is parallelized at the row level. So we have 1 thread per row in the input.

rapids-bot bot pushed a commit that referenced this issue Jun 21, 2022
Addresses: #10196 

Previously, `get_json_object()` ignored fields in a JsonPath expression that are missing in the json string. This PR adds the option to  return these missing fields as null instead.

Authors:
  - Srikar Vanavasam (https://github.com/SrikarVanavasam)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - MithunR (https://github.com/mythrocks)
  - David Wendt (https://github.com/davidwendt)

URL: #10970
rapids-bot bot pushed a commit that referenced this issue Jul 11, 2022
This PR exposes `get_json_object_options` to the Python API. Addresses #10196

Authors:
  - Srikar Vanavasam (https://github.com/SrikarVanavasam)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)
  - Paul Taylor (https://github.com/trxcllnt)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #11180
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

5 participants