Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fromjson() to support reading from stdin #667

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

yaniv-aknin
Copy link

Trivial PR to support reading from stdin (source=None) in fromjson().

Two things you may want me to do before merging -

  1. I didn't add tests because there are no tests I could find for the use of source=None in any io module (e.g., also not in csv etc). Generally, perhaps I missed them but I didn't see tests that fork/exec the bin/petl executable at all...?
  2. Add source=None to other modules. I ran git grep 'def from' | grep source | grep -v source=None and found a few more that miss it, let me know if you'd like me to add it to any/all of them.
    Modules which appear to be missing source=None: avro, bcolz, pytables (maybe), xml

Checklist

Use this checklist to ensure the quality of pull requests that include new code and/or make changes to existing code.

  • Source Code guidelines:
    • Includes unit tests
    • New functions have docstrings with examples that can be run with doctest
    • New functions are included in API docs
    • Docstrings include notes for any changes to API or behavior
    • All changes are documented in docs/changes.rst
  • Versioning and history tracking guidelines:
    • Using atomic commits whenever possible
    • Commits are reversible whenever possible
    • There are no incomplete changes in the pull request
    • There is no accidental garbage added to the source code
  • Testing guidelines:
    • Tested locally using tox / pytest
    • Rebased to master branch and tested before sending the PR
    • Automated testing passes (see CI)
    • Unit test coverage has not decreased (see Coveralls)
  • State of these changes is:
    • Just a proof of concept
    • Work in progress / Further changes needed
    • Ready to review
    • Ready to merge

@coveralls
Copy link

coveralls commented Apr 25, 2024

Pull Request Test Coverage Report for Build 9050397434

Details

  • 11 of 11 (100.0%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.03%) to 91.105%

Totals Coverage Status
Change from base Build 8715866607: 0.03%
Covered Lines: 13366
Relevant Lines: 14671

💛 - Coveralls

@@ -16,7 +16,7 @@
from petl.util.base import data, Table, dicts as _dicts, iterpeek


def fromjson(source, *args, **kwargs):
def fromjson(source=None, *args, **kwargs):

Check warning

Code scanning / Prospector (reported by Codacy)

Keyword argument before variable positional arguments list in the definition of fromjson function (keyword-arg-before-vararg) Warning

Keyword argument before variable positional arguments list in the definition of fromjson function (keyword-arg-before-vararg)
Copy link
Member

@juarezr juarezr Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaniv-aknin

Code scanning / Prospector (reported by Codacy)

Keyword argument before variable positional arguments list in the definition of fromjson function (keyword-arg-before-vararg)

This wouldn't break existing code?

Maybe read_source_from_arg can be taught to read even when stdin is passed as argument as in:

    table1 = etl.fromjson(sys.stdin, header=["foo", "bar"])
    print(table1)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong, but I don't think it should break existing code.

All these assertions pass -

>>> def f(s, *a, **kw):
...     return (s, a, kw)
... 
>>> def fn(s=None, *a, **kw):
...     return (s, a, kw)
... 
>>> assert f("hello") == fn("hello")
>>> assert f(None) == fn(None)
>>> assert f("hello", 1, 2) == fn("hello", 1, 2)
>>> assert f("hello", k=1) == fn("hello", k=1)
>>> assert f(s="hello") == fn(s="hello")
>>> assert f("hello", 1, 2, k=1, j=2) == fn("hello", 1, 2, k=1, j=2)

re. read_source_from_arg() - my interest is in the petl executable, i.e., I would love for syntax like this to work in shell: echo '{"a":2, "b":4}' | petl 'fromjson(lines=True)'.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that:

  1. Using petl as a command line tool is an interesting use case and we should give it some love by:
    a. Adding more examples in the documentation.
    b. Mention this usage pattern in the repository Readme/Frontage
  2. Currently fromjson() source argument is not consistent with other functions like:
    a. tojson() and tojsonarrays()
    b. fromcsv()
  3. We value not breaking existing code.
    a. But I haven't had the time to check if this is the case yet.
    b. If not it will be a good change for API consistency
    c. I would love to hear more opinions about this change.

petl/io/json.py Dismissed Show dismissed Hide dismissed
petl/io/json.py Dismissed Show dismissed Hide dismissed
@juarezr juarezr requested review from arturponinski and a team April 28, 2024 20:05
@juarezr
Copy link
Member

juarezr commented Apr 28, 2024

Adding to the investigation:

The fromjson() parameters args and kwargs are forwarded to the json library call:

def fromjson(source, *args, **kwargs):
    """..."""
   ...

...
dicts = json.load(f, *self.args, **self.kwargs)
...

@juarezr
Copy link
Member

juarezr commented Apr 28, 2024

But the json library calls used take only one positional argument followed by others keyword arguments:

 def json.load(fp, *, cls=None, object_hook=None, parse_float=None, parse_int=None, parse_constant=None, object_pairs_hook=None, **kw):
    ...

def  json.loads(s, *, cls=None, object_hook=None, parse_float=None, parse_int=None, parse_constant=None, object_pairs_hook=None, **kw):
   ...

It looks like the args parameter is simply ignored by json.load and json.loads calls.

Now I'm wondering if the function syntax shouldn't be:

def fromjson(source=None, *, **kwargs):
    """..."""
   ...

 

@juarezr
Copy link
Member

juarezr commented May 4, 2024

@yaniv-aknin,

After looking at the source code, I've concluded that your proposal looks good.

Also, some considerations:

  • It seems like some functions are not ready to use stdin or even can't work with stdin at all. For example:
  • As fromjson receives, but doesn't use the args parameter, maybe it would be interesting to change the function syntax/signature as described above.
  • This would be a good case for adding some test cases to avoid future regressions as the test suite is the major guarantee of stability for petl. The test cases will ensure that petl executable won't stop working after new features are merged.

So, what are your plans for this PR?

@yaniv-aknin
Copy link
Author

Thanks @juarezr , this sounds good.

  1. I'd be happy to add source=None to more cases, where it'd work easily. Where it'd take a more substantial refactoring (or even just infeasible), I'd leave it unchanged.

  2. I'm mildly indifferent about changing the signature, but don't mind doing it of course. If (sources=None, *, **kwargs) looks good to you, great, I'll do that.

  3. I'm also happy to add some tests that fork/exec petl, and in particular will test the "pipe from stdin" case.
    I propose to -

    • Create a new test module petl/test/test_script.py
    • Have it find bin/petl in the package directory and fork/exec it using the subprocess module, testing results
    • Add a few simple test cases around that
    • Make the above pass in petl's supported environments (2.7, 3.6-3.12)

Does that sound reasonable and I can update the PR?

@juarezr
Copy link
Member

juarezr commented May 6, 2024

After checking it, I've found that the change in signature won't work:

>>> def fromjson(source=None, *, **kwargs):
...    print("source:", source, "kwargs:", kwargs) 
... 
  File "<stdin>", line 1
SyntaxError: named arguments must follow bare *

So we should keep the args argument as it is currently, and proceed.

Eventually, we could remove the self.args property and the forwarding to json.load functions as it is pointless in:

dicts = json.load(f, *self.args, **self.kwargs)

But it shouldn't be mandatory.

 - adds a simple test invoking the petl executable
 - installs the package in CI so the executable is available.
petl/test/test_executable.py Fixed Show fixed Hide fixed
petl/test/test_executable.py Fixed Show fixed Hide fixed
petl/test/test_executable.py Fixed Show fixed Hide fixed
""", shell=True, check=True, capture_output=True)
assert result.stdout == b'foo\r\na\r\n'

def test_json_stdin():

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Missing function or method docstring Warning test

Missing function or method docstring
@@ -0,0 +1,22 @@
from __future__ import print_function, division, absolute_import

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Missing module docstring Warning test

Missing module docstring

import subprocess

def test_executable():

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Missing function or method docstring Warning test

Missing function or method docstring

import subprocess

def test_executable():

Check warning

Code scanning / Pylint (reported by Codacy)

Missing function docstring Warning test

Missing function docstring
@@ -0,0 +1,22 @@
from __future__ import print_function, division, absolute_import

Check warning

Code scanning / Pylint (reported by Codacy)

Missing module docstring Warning test

Missing module docstring
""", shell=True, check=True, capture_output=True)
assert result.stdout == b'foo\r\na\r\n'

def test_json_stdin():

Check warning

Code scanning / Pylint (reported by Codacy)

Missing function docstring Warning test

Missing function docstring
petl/test/test_executable.py Fixed Show fixed Hide fixed
petl/test/test_executable.py Fixed Show fixed Hide fixed
petl/test/test_executable.py Fixed Show fixed Hide fixed
petl/test/test_executable.py Fixed Show fixed Hide fixed
@@ -0,0 +1,22 @@
from __future__ import print_function, division, absolute_import

import subprocess

Check warning

Code scanning / Bandit (reported by Codacy)

Consider possible security implications associated with the subprocess module. Warning test

Consider possible security implications associated with the subprocess module.
petl/test/test_executable.py Fixed Show fixed Hide fixed
petl/test/test_executable.py Fixed Show fixed Hide fixed
petl/test/test_executable.py Fixed Show fixed Hide fixed
petl/test/test_executable.py Fixed Show fixed Hide fixed
petl/test/test_executable.py Fixed Show fixed Hide fixed
( echo '{"foo": "a", "bar": "b"}' ; echo '{"foo": "c", "bar": "d"}' ) |
petl 'fromjson(lines=True).tocsv()'
""", shell=True)
assert result == b'foo,bar\r\na,b\r\nc,d\r\n'

Check notice

Code scanning / Semgrep (reported by Codacy)

The application was found using assert in non-test code. Note test

The application was found using assert in non-test code.
(echo foo,bar ; echo a,b; echo c,d) |
petl 'fromcsv().cut("foo").head(1).tocsv()'
""", shell=True)
assert result == b'foo\r\na\r\n'

Check notice

Code scanning / Semgrep (reported by Codacy)

The application was found using assert in non-test code. Note test

The application was found using assert in non-test code.
echo '[{"foo": "a", "bar": "b"}]' |
petl 'fromjson().tocsv()'
""", shell=True)
assert result == b'foo,bar\r\na,b\r\n'

Check notice

Code scanning / Semgrep (reported by Codacy)

The application was found using assert in non-test code. Note test

The application was found using assert in non-test code.
assert result == b'foo\r\na\r\n'

def test_json_stdin():
result = subprocess.check_output("""

Check warning

Code scanning / Bandit (reported by Codacy)

Starting a process with a partial executable path Warning test

Starting a process with a partial executable path
import subprocess

def test_executable():
result = subprocess.check_output("""

Check warning

Code scanning / Bandit (reported by Codacy)

Starting a process with a partial executable path Warning test

Starting a process with a partial executable path
echo '[{"foo": "a", "bar": "b"}]' |
petl 'fromjson().tocsv()'
""", shell=True)
assert result == b'foo,bar\r\na,b\r\n'

Check warning

Code scanning / Bandit (reported by Codacy)

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code. Warning test

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.
result = subprocess.check_output("""
echo '[{"foo": "a", "bar": "b"}]' |
petl 'fromjson().tocsv()'
""", shell=True)

Check failure

Code scanning / Bandit (reported by Codacy)

subprocess call with shell=True seems safe, but may be changed in the future, consider rewriting without shell Error test

subprocess call with shell=True seems safe, but may be changed in the future, consider rewriting without shell
petl 'fromjson().tocsv()'
""", shell=True)
assert result == b'foo,bar\r\na,b\r\n'
result = subprocess.check_output("""

Check warning

Code scanning / Bandit (reported by Codacy)

Starting a process with a partial executable path Warning test

Starting a process with a partial executable path
(echo foo,bar ; echo a,b; echo c,d) |
petl 'fromcsv().cut("foo").head(1).tocsv()'
""", shell=True)
assert result == b'foo\r\na\r\n'

Check warning

Code scanning / Bandit (reported by Codacy)

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code. Warning test

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.
( echo '{"foo": "a", "bar": "b"}' ; echo '{"foo": "c", "bar": "d"}' ) |
petl 'fromjson(lines=True).tocsv()'
""", shell=True)
assert result == b'foo,bar\r\na,b\r\nc,d\r\n'

Check warning

Code scanning / Bandit (reported by Codacy)

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code. Warning test

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.
result = subprocess.check_output("""
(echo foo,bar ; echo a,b; echo c,d) |
petl 'fromcsv().cut("foo").head(1).tocsv()'
""", shell=True)

Check failure

Code scanning / Bandit (reported by Codacy)

subprocess call with shell=True seems safe, but may be changed in the future, consider rewriting without shell Error test

subprocess call with shell=True seems safe, but may be changed in the future, consider rewriting without shell
result = subprocess.check_output("""
( echo '{"foo": "a", "bar": "b"}' ; echo '{"foo": "c", "bar": "d"}' ) |
petl 'fromjson(lines=True).tocsv()'
""", shell=True)

Check failure

Code scanning / Bandit (reported by Codacy)

subprocess call with shell=True seems safe, but may be changed in the future, consider rewriting without shell Error test

subprocess call with shell=True seems safe, but may be changed in the future, consider rewriting without shell
@yaniv-aknin
Copy link
Author

I've made some progress, but not all checks pass yet.

  • Python2.7 -- probably my fault, I need to look further
  • Windows tests -- not sure, if I can sort out a windows test environment I will do it
  • macOS tests -- seem broken unrelated to me
  • security test complaining about shell=True -- I disagree with the check, I am literally testing shell behaviour; I could do this differently, but this is short readable and (in test context) perfectly safe
  • No docstring - don't think they're needed for something like this...?

I'll try to look into this again next week, but admit I'm beginning to question how much time I can put into this; small fix, lots of overhead 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants