Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sanitize python env and subprocess i/o #16736

Merged
merged 12 commits into from
Jun 13, 2022
Merged

Conversation

pmp-p
Copy link
Contributor

@pmp-p pmp-p commented Apr 15, 2022

While working with python 3.11 wasm and pip modules, emcc is often called as subprocess from a tweaked python.

i encountered some troubles, emcc.py is inheriting some unwanted values from parent shell
to name a few _PYTHON_SYSCONFIGDATA_NAME and PYTHONHOME / PYTHONPATH

adding -E will prevent PYTHON* to interfere, and the unset _PYTHON_SYSCONFIGDATA_NAME is very unlikely to be used anywhere except in cpython modules cross compiling.

@sbc100
Copy link
Collaborator

sbc100 commented Jun 8, 2022

I'm a little worried about removing these from the environment since you can imagine some users wanting to set PYTHONPATH to point to a location where they have install some python modules (for example the ones in requirements-dev.txt).. but I suppose it might be good default and advanced users can bypass the launcher scripts maybe?

@pmp-p
Copy link
Contributor Author

pmp-p commented Jun 8, 2022

but I suppose it might be good default and advanced users can bypass the launcher scripts maybe?

indeed it absolutely does not cover all cases, it's just sane default to provide basic module compiling for CPython 3.11+

for building pygame on wasm with https://github.com/pygame-web/python-wasm-sdk , emcc em++ and cc have finally turned to something more complicated and not easily relocatable like :

#!/bin/bash

unset _PYTHON_SYSCONFIGDATA_NAME
unset PYTHONHOME
unset PYTHONPATH

SHARED=""
IS_SHARED=false

for arg do
    shift

    # that is for some very bad setup.py behaviour regarding cross compiling. should not be needed ..
    [ "\$arg" = "-I/usr/include" ] && continue
    [ "\$arg" = "-I/usr/include/SDL2" ] && continue
    [ "\$arg" = "-L/usr/lib64" ]	&& continue
    [ "\$arg" = "-L/usr/lib" ]   && continue

    if [ "\$arg" = "-shared" ]
    then
        IS_SHARED=true
        SHARED="$SHARED -sSIDE_MODULE"
    fi

    if echo "\$arg"|grep -q wasm32-emscripten.so\$
    then
        IS_SHARED=true
        SHARED="$SHARED -shared -sSIDE_MODULE"
    fi
    set -- "\$@" "\$arg"
done

if $IS_SHARED
then
    $SYS_PYTHON -E $0.py $SHARED -L/opt/python-wasm-sdk/devices/emsdk/usr/lib "$@"
else
    $SYS_PYTHON -E $0.py $SHARED -I/opt/python-wasm-sdk/devices/emsdk/usr/include "$@"
fi

where SYS_PYTHON is set to stable/system python ( should be f"python{PYMAJOR}.{PYMINOR-1}" when building alpha/beta and "new" python / python3 ( 3.11 here) are in the path while building modules )
via

for py in 10 9 8 7
do
    if command -v python3.${py} >/dev/null
    then
        export SYS_PYTHON=$(command -v python3.${py})
        break
    fi
done

SYS_PYTHON was added because EMSDK_PYTHON is cleared by emsdk_env.sh ( why ? )

also emsdk/upstream/emscripten/system/bin does not seem in the PATH so sdl2-config from host often gets grabbed instead of the emsdk sysroot one / target install prefix ( here /opt/python-wasm-sdk/devices/emsdk/usr )

@hoodmane
Copy link
Contributor

hoodmane commented Jun 9, 2022

I think this change would break Pyodide's package compilation. We could probably find a workaround though.

@hoodmane
Copy link
Contributor

hoodmane commented Jun 9, 2022

In particular, we need to explicitly control _PYTHON_SYSCONFIGDATA_NAME and other similar environment variables to set up the cross compilation environment. I think you should explicitly set up the environment you need and then emcc should pass it along. I think this is the way that environment variables normally work.

@sbc100
Copy link
Collaborator

sbc100 commented Jun 9, 2022

In particular, we need to explicitly control _PYTHON_SYSCONFIGDATA_NAME and other similar environment variables to set up the cross compilation environment. I think you should explicitly set up the environment you need and then emcc should pass it along. I think this is the way that environment variables normally work.

I don't think this change stops you from using _PYTHON_SYSCONFIGDATA_NAME does it? It just ignore it while emcc is running.. it doesn't effect processes outside of the compiler itself.

You don't need that environment variable to be set within the compiler/emcc do you?

emconfigure Show resolved Hide resolved
tools/shared.py Show resolved Hide resolved
@hoodmane
Copy link
Contributor

hoodmane commented Jun 9, 2022

You don't need that environment variable to be set within the compiler/emcc do you?

No, I guess not, we don't need them inside of the compiler wrapper. I guess this change actually makes a lot of sense because it defends emcc.py against the custom cross compilation Python environment set up outside of it.

em++ Outdated
@@ -11,6 +11,10 @@
# To make modifications to this file, edit `tools/run_python_compiler.sh` and
# then run `tools/create_entry_points.py`

unset _PYTHON_SYSCONFIGDATA_NAME
unset PYTHONHOME
unset PYTHONPATH
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need these two lines if you are using -E below?

Copy link
Contributor Author

@pmp-p pmp-p Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think -E does not clear the env it just disregard the values, so better safe than sorry later as these 2 are really critical if another python child process is launched without -E

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then in that case can you remove the -E? There is no point in adding it if we have another method of doing that same thing and we think that we cannot inject -E into all the right place. That would be -E as partial, and redundant solution here, right?

Maybe add a comment to that effect. Something like "Unset these environment variables explicitly, rather than using python -E"

Copy link
Contributor Author

@pmp-p pmp-p Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-E covers (a lot!) of other variables that should not be impacted for emsdk, the two unset are really usefull failsafe for subprocesses where I can't be sure -E will be set ( i did not read all emsdk source code).

tldr :

  1. -E ensure fully proper operation of emsdk base tool called in first place.

  2. the unset HOME/PATH ensure minimal proper operation of anything that could spawn as subprocesses of that emcc tool over those the shell script has no direct control.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know what else -E covers? I guess the doc is quite open ended:

       -E     Ignore environment variables like PYTHONPATH and PYTHONHOME that modify the  behavior
              of the interpreter.

There are indeed a couple of places where emscripten spawns internal stuff using sys.interpreter. With this change, those would run in partially-clean environment rather then that fully cleaned -E one? It seems odd that -E would not also cover python sub-processes. It also seems odd that we would consider partially-cleaned OK in some circumstances but not OK in others.

One example of emcc launching a python subprocess:

emscripten/tools/building.py

Lines 1435 to 1440 in 4549d9d

sourcemap_cmd = [PYTHON, path_from_root('tools/wasm-sourcemap.py'),
wasm_file,
'--dwarfdump=' + LLVM_DWARFDUMP,
'-o', map_file,
'--basepath=' + base_path]
check_call(sourcemap_cmd)

Copy link
Contributor Author

@pmp-p pmp-p Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually all system tools on good distribution should already set -E but again ... can't be 100% sure ( it was worst in the past there are probably still remnants or obsolete systems that cannot upgrade for some reason ) so it's better emconfigure sanitize by default instead of expecting all children to do it on their own.

PYTHONPATH or PYTHONHOME should never influence anything under configure/make/cmake ( or any system tool ! ) ever and they should use a stable system python.

When the (auto)tools need to configure for/with another python they (should) provide a parameter to designate the targeted interpreter/tool for run or getting config data. This is where a given shell script wrapper should clear env then set PYTHON*, PYTHONPATH=/required/by/mytool and run the target correctly ( exactly like the "virtualenv" in python world does ) .

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But PYTHONPATH=/required/by/mytool ./configure does work today for cases where configure run some python script as part of configuring.

This change would break that no? Same for PYTHONPATH=/required/by/mytool emmake.

It doesnt seem like it should be our job to sanitize the environment for all our subprocesses, especially in the case of emmake/emcmake/emconfigure where the subprocess are use controlled and the user has a reasonable expectation that the environment they set will be passed to the children of make/cmake/configure.

So... I suggest we start with just using -E.. since that is clearly correct, and revisit that if it turns out that doesn't solve your issue?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe we should add -E to the windows .bat files for consistency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But PYTHONPATH=/required/by/mytool ./configure does work today for cases where configure run some python script as part of configuring.

imho using PYTHON* that way is bad because configure is not a python script.

PYTHON* is for the end of the chain : the tool py script itself. not for influencing autotools that is thousands of lines that run thousands of tools .

revisit that if it turns out that doesn't solve your issue

well actually it's not specifically my issue, it's more the various hard to solve potential issues that will come of thousands of modules around.

That we may hope to cross compile to wasm without patching emsdk ( pyodide still do that and iirc even rebuild it fully) and without having thousands of PR to sanitize subprocesses or helper tools we have not yet heard of.

That in the case python-wasm-sdk come to interest more than just pygame ( less than 3minutes to build the wasm python module without docker, i'm sure it will get traction ! )

And maybe we should add -E to the windows .bat files for consistency?

i don't know anymore how system python runs on windows or mac systems (since 2.7), i think you should but i'm certainly not the one to ask if you can safely.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imho using PYTHON* that way is bad because configure is not a python script.

It might be in-advised but that doesn't mean we should break anyone who want to use it.

well actually it's not specifically my issue, it's more the various hard to solve potential issues that will come of thousands of modules around.

That we may hope to cross compile to wasm without patching emsdk ( pyodide still do that and iirc even rebuild it fully) and without having thousands of PR to sanitize subprocesses or helper tools we have not yet heard of.

I think that number of python subprocess that get run by emcc should be very limited, we should be able to keep track of them and ensure they all get run with -E.

i don't know anymore how system python runs on windows or mac systems (since 2.7), i think you should but i'm certainly not the one to ask if you can safely.

I think we should add -E everywhere (including on windows) for consistency. It could be confusing otherwise. We have a fair amount of CI testing to ensure it works, so you don't need to worry about locally testing it.

emranlib Outdated Show resolved Hide resolved
@pmp-p
Copy link
Contributor Author

pmp-p commented Jun 10, 2022

@hoodmane

. I think you should explicitly set up the environment you need and then emcc should pass it along. I think this is the way that environment variables normally work.

no. setup.py should naturally call the python in the PATH ( like in venv) that should be tweaked/wrapped for cross compiling and handling all funny stuff that happens with setup.py, while all emsdk tools should use EMSDK_PYTHON ( which most likely is to be the stable system python ( please without devel headers, setup.py craves for them ! ))

it's especially annoying to have

build module (eg pymunk) =>
    python(clang+wasm sysconfig) => setup.py
        setup.py => ./configure some C lib (eg libchipmunk)
            ./configure => emsdk and tools that fail by catching _PYTHON_SYSCONFIGDATA_NAME and PYTHON*
        setup.py fails because lib build failed        

or to have to use the replay system from pyodide that use too much heuristic and is not always 32 bits aware, and did i mention docker ?

@sbc100
Copy link
Collaborator

sbc100 commented Jun 10, 2022

Ok, it sounds reasonable to leave _PYTHON_SYSCONFIGDATA_NAME in there then.. and have -E take a care of the rest in a less subprocess-invasive way.

Perhaps add a comment such as: For some reason -E doesn't ignore _PYTHON_SYSCONFIGDATA_NAME like it does all the other python environment variables. (I wonder if its worth filing a an upstream bug about that?)

@pmp-p
Copy link
Contributor Author

pmp-p commented Jun 10, 2022

(I wonder if its worth filing a an upstream bug about that?)

i think @tiran could say more on that he loves cross compiling ;)

is check_call() is really the only other place -E is required ?

do unset PYTHONHOME and unset PYTHONPATH go away ?

@sbc100
Copy link
Collaborator

sbc100 commented Jun 10, 2022

(I wonder if its worth filing a an upstream bug about that?)

i think @tiran could say more on that he loves cross compiling ;)

is check_call() is really the only other place -E is required ?

I think it should be anywhere we use shared.PYTHON to run sub-processes. And I think we should probably avoid ever doing that and instead use the wrapper scripts instead.

do unset PYTHONHOME and unset PYTHONPATH go away ?

Yes please.

sbc100 added a commit that referenced this pull request Jun 10, 2022
This was nothing more than an alias for sys.executable.  While this
is used a fair amount in testing, emscripten itself should not be
forking python subprocesses itself, but instead using the launcher
scripts.

There is one place in emscripten where we do fork a python subprocess
directly (when we run tools/wasm-sourcemap.py) and I'm not sure its
worth adding a wrapper for that minor tool.  For that one case I've
added `-E` to match the changes we are planning to the launcher
scripts: #16736
sbc100 added a commit that referenced this pull request Jun 11, 2022
This was nothing more than an alias for sys.executable.  While this
is used a fair amount in testing, emscripten itself should not be
forking python subprocesses itself, but instead using the launcher
scripts.

There is one place in emscripten where we do fork a python subprocess
directly (when we run tools/wasm-sourcemap.py) and I'm not sure its
worth adding a wrapper for that minor tool.  For that one case I've
added `-E` to match the changes we are planning to the launcher
scripts: #16736
@sbc100
Copy link
Collaborator

sbc100 commented Jun 11, 2022

Do you mind if I re-title this PR something like Run python with -E to ignore incoming PYTHONPATH/etc environment variables when I land it?

@pmp-p
Copy link
Contributor Author

pmp-p commented Jun 11, 2022

please do as you wish

@sbc100 sbc100 merged commit e0c39d9 into emscripten-core:main Jun 13, 2022
sbc100 added a commit that referenced this pull request Jun 13, 2022
This was nothing more than an alias for sys.executable.  While this
is used a fair amount in testing, emscripten itself should not be
forking python subprocesses itself, but instead using the launcher
scripts.

There is one place in emscripten where we do fork a python subprocess
directly (when we run tools/wasm-sourcemap.py) and I'm not sure its
worth adding a wrapper for that minor tool.  For that one case I've
added `-E` to match the changes we are planning to the launcher
scripts: #16736
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants