Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fit_class_random_forest: max_variables parameter could additionally be of type string #355

Closed
JeroenVerstraelen opened this issue Mar 23, 2022 · 4 comments · Fixed by #351
Assignees
Milestone

Comments

@JeroenVerstraelen
Copy link

JeroenVerstraelen commented Mar 23, 2022

Describe the issue:
In Pyspark and Sklearn max_variables appears to coincide with these parameters:
Pyspark: featureSubsetStrategy
Sklearn: max_features

Spark requires one of these strings [“auto”, “all”, “sqrt”, “log2”, “onethird”], while sklearn requires one of these strings [“auto”, “sqrt”, “log2”], an int or a float.

An integer seems like the best general type to support all libraries, but it might be beneficial for the user if we also support string types. That way they don't have to calculate e.g. the sqrt or log2 themselves. Is this possible in openEO?

@m-mohr m-mohr added this to the 1.3.0 milestone Mar 23, 2022
@m-mohr m-mohr self-assigned this Mar 23, 2022
@m-mohr
Copy link
Member

m-mohr commented Mar 23, 2022

So there's no way to implement the nuermical behavior defined right now in pyspark? I guess I need to do another crosswalk and check commonly used options so that we can find a good subset that everyone can implement.

@JeroenVerstraelen
Copy link
Author

As far as I can see there is no support for numerical values in pyspark. We plan to automatically convert the provided integer to one of the possible strings in pyspark. If there is no associated string then we will return an invalid parameter error.

@m-mohr
Copy link
Member

m-mohr commented Mar 23, 2022

Library Parameter Pre-defined string options Float for fraction Integer for number of vars
Pyspark (Py) featureSubsetStrategy auto, all, sqrt, log2, onethird No No
Sklearn (Py) max_features auto, sqrt, log2 Yes Yes
ranger (R) mtry sqrt No Yes
randomForest (R) mtry sqrt (for classification), onethird (for regression) Yes No
RandomForests (Fortran) mtry0 sqrt No Yes
Vigra (C++) features_per_node all, sqrt, log Yes Yes

In openEO we can't distinguish between integers and floats, so can't allow both float and int separately. Seeing this table above, I'd propose adding string values (all, sqrt, log2, onethird) and ints for the number of vars, but to NOT provide a default value.

@m-mohr
Copy link
Member

m-mohr commented Mar 23, 2022

@JeroenVerstraelen Please review PR #351

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants