Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent usage of TCHAR, WCHAR, LPSTR, LPWSTR, LPCTSTR, char, etc #655

Closed
mrjefftang opened this issue Jul 14, 2015 · 16 comments
Closed
Labels

Comments

@mrjefftang
Copy link
Collaborator

The Windows section uses all sorts of different string types. It's not clear to me without delving through it thoroughly to see if these types are the correct usage.

This inconsistency is likely to lead to some unicode issues like that seen in #652 . There may be parts of the psutil API for Windows which may not function correctly when given a multibyte character string.

@giampaolo
Copy link
Owner

Yes, there surely are different problems with unicode handling.
I tried to do some research. This is what I determined.

Issues

net_if_stats()

fixed by 064e65e

Test: rename a NIC by using a non-ASCII character ("è") from the control panel.

Python 2 - issue: we should return unicode instead of str.

C:\>C:\python27\python -c "import psutil;  print(psutil.net_if_stats().keys())"
[..., 'Local Area Connection\xe8']

Python 3 - issue: exception.

C:\>C:\python34\python -c "import psutil;  print(psutil.net_if_stats().keys())"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\python34\lib\site-packages\psutil-3.1.0-py3.4-win32.egg\psutil\__init__.py", line 1801, in net_if_stats
    return _psplatform.net_if_stats()
  File "C:\python34\lib\site-packages\psutil-3.1.0-py3.4-win32.egg\psutil\_pswindows.py", line 217, in net_if_stats
    ret = cext.net_if_stats()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 21: unexpected end of data

net_io_counters()

Test: rename a NIC by using a non-ASCII character ("è") from the control panel.

Cmdline: C:\pythonXX\python -c "import psutil; print(psutil.net_io_counters(pernic=True).keys())

Result: everything cool on python 2 and 3.

users()

Test: create a user with a non-ASCII character ("è").

Cmdline: C:\PythonXX\python -c "import psutil; print(psutil.users())"

Result: everything cool on python 2 and 3.

Process.username()

Test: create a user with a non-ASCII character ("è").

Cmdline: C:\python27\python -c "import psutil; print(psutil.Process().username())"

Result: everything cool on python 2 and 3.

Process.cwd()

Test: create a directory with a non-ASCII character ("è").

Python 2 - issue: should return unicode:

C:\Users\fooè>C:\Python27\python -c "import psutil; print(psutil.Process().cwd())"
C:\Users\fooè
C:\Users\fooè>C:\Python27\python -c "import psutil; print(repr(psutil.Process().
cwd()))"
'C:\\Users\\foo\xc3\xa8'

Python 3 - everything's cool.

C:\Users\fooè>C:\Python34\python -c "import psutil; print(psutil.Process().
cwd())"
C:\\Users\\fooè

Process.exe()

Test - create an executable with a non-ASCII character.

Cmdline - C:\Python27\python -c "import psutil; print(psutil.Process(3020).exe())" C:\Users\fooè.exe

Result - everything's cool on python 2 and 3.

Process.name()

Fixed in 0847640

Same as above, all cool on Python 2.7 and 3.4.
Fallback proc_name does not return unicode though.

Process.cmdline()

Python 2 - should return str instead of unicode.

Python 3 - everything's cool.

Process.open_files()

Python 2 - should return str instead of unicode.

C:\cygwin64\home\user\fooè>C:\python27\python -c "import psutil; f = open('fooè', 'w'); print(psutil.Process().open_files())"
[popenfile(path=u'C:\\cygwin64\\home\\user\\foo\xe8\\foo\xe8', fd=-1)]

Python 3 - everything's cool.

C:\cygwin64\home\user\fooè>C:\python34\python -c "import psutil; f = open('fooè', 'w'); print(psutil.Process().open_files())"
[popenfile(path='C:\\cygwin64\\home\\user\\fooè\\fooè', fd=-1)]

@giampaolo
Copy link
Owner

I am going to create separate tickets for each one of these issues.

@mrjefftang
Copy link
Collaborator Author

That sounds like there's 2 issues.

1 - Inconsistent string handling in C (the intial problem).
2 - Inconsistent return values of PyString and PyUnicode.

I think there should be a decision on the API to consistently return the same type for both Python2 and Python3. It'll probably be easiest and simplest to always return PyUnicode though there may be API breaking changes.

My main concern on this ticket was if we are using char string types internally in C (and using ascii API calls instead of unicode API calls) and returning PyUnicode since we would be losing information.

@giampaolo
Copy link
Owner

Yes, I agree that at least on Windows we should always return unicode. In general UNIX systems have more "discipline", meaning it's unlikely we'll bump into an exe, user or network interface with non-ASCII chars, so returning str on Python 2 might be OK. The same cannot be told for Windows.

As for UNIX we can think about returning unicode for Process.cwd() only (but this is a different issue).

2 - Inconsistent return values of PyString and PyUnicode.

Exactly. We should be consistent and try to use PyUnicode everywhere.
Also there's a side issue: there are certain Windows APIs which come with two versions: one supporting unicode, another one not supporting it. Can't remember which ones exactly though.

My main concern on this ticket was if we are using char string types internally in C

If you think the two things should be treated differently feel free to address the TCHAR, WCHAR, LPSTR, LPWSTR, LPCTSTR issue first, then we'll get to using PyUnicode everywhere.
In the meantime I can write a set of tests similarl to the ones I pasted above and we can use that as a reference of what we should expect and consider the work "done" when all of them "pass".
The hard part will be figuring out how to rename NICs and create users (wmic lib should be able to do that).

@mrjefftang
Copy link
Collaborator Author

I think the errors / discrepancies we see are due to a combination of the two issues.

I don't know if they should be treated differently as fixing the error requires fixing both issues.

Once we have a test suite, we can go through each psutil API call and fix them up as appropriate.

@giampaolo
Copy link
Owner

I'm gonna create a branch later today (or tomorrow) with a couple of preliminary tests.
If you want I can give you write access so that we can work on the same branch.
Just do not push anything on master and always work on a branch.

@mrjefftang
Copy link
Collaborator Author

Sounds good.

Also I'm not sure if it makes sense for the behavior on Windows and Linux to be different.

Having an API which returns different types based on the OS seems wrong, especially if you are using psutil for a cross platform application.

At the same time it sucks for one OS to pay the overhead due to another OS misbehaving.

I think there just needs to be a decision made to always return PyString or PyUnicode. In my mind, it makes more sense to always return PyUnicode since it's the way forward in Python3. It should make things cleaner internally within psutil without all of the #ifdef checking between Python2 and Python3.

@mrjefftang
Copy link
Collaborator Author

@ddaeschler
Copy link
Contributor

If you have determined a way forward on this I'd love to help patch it up.

giampaolo added a commit that referenced this issue Aug 1, 2015
@giampaolo
Copy link
Owner

Sorry for the delay. It's been a crazy week.
So, I created a separate branch:
https://github.com/giampaolo/psutil/compare/655-windows-unicode
...and added some preliminary unit tests for all process methods returning strings except for username():
21da6e7
As expected, they break on Python 2. On Python 3 all is good except for process name().
I would say let's start working on these first (name, exe, cmdline, cwd, open_files) and have them all return unicode on python 2.
It may be possible that despite tests are correct on python 3 they may need some adjustments on python 2, but that can be figured out after psutil returns the correct type and data.

@mrjefftang I gave you write access.

@giampaolo
Copy link
Owner

In Python 3 this is the only thing (process related) which is plain broken and should be fixed independently from the str/unicode unification:

======================================================================
ERROR: test_proc_name (__main__.TestUnicode)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test\_windows.py", line 483, in test_proc_name
    self.assertEqual(psutil._psplatform.cext.proc_name(subp.pid),
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 7: invalid continuation byte
----------------------------------------------------------------------

@mrjefftang
Copy link
Collaborator Author

Havent tested it yet but #670 should take care of the net_* unicode issues identified here.

@giampaolo
Copy link
Owner

I fixed net_if_stats as of 064e65e

@giampaolo
Copy link
Owner

Process.name fallback method now returns unicode instead of str as of 0847640

@giampaolo
Copy link
Owner

Thinking back about the approach we decided to use here I take back the proposal of returning unicode all the time on Python 2. It's a too disruptive change and may break a lot of code out there. I think a better approach is to return str in case of ASCII strings, else unicode.

@giampaolo
Copy link
Owner

I changed my mind once again: unexpectedly returning a different type (str or unicode) depending on the string we're dealing with is not a consistent API and may lead to unpleasant user code like if isinstance(s, 'unicode'): .... Also, that would affect Windows users only, which is even worse.
I decided that on Python 2 we will always return a string (str type) encoded by using sys.getfilesystemencoding() codec and I will make this clear in the doc.
It must be noted that this is the behavior we're already using on Linux / Python 2 and other platforms as well (since we're not returning unicode anywhere) so what I'm I'm trying to do here is to be consistent with the established behavior in a cross-platform fashion.
It must be noted that this issue does not exist on Python 3 since Python 3 fixed unicode handling once and for all.
The relevant PR which I've just merged is: #676

@mrjefftang : I realize I've improperly been using this issue as the original subject was Inconsistent usage of TCHAR, WCHAR, LPSTR, LPWSTR, LPCTSTR. Feel free to open a new one in which we will address that problem.

giampaolo added a commit that referenced this issue Apr 28, 2017
…ken on Python 2 so disable tests which check for exact path match
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants