Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PySpark StandardScaler conversion fails #540

Closed
alvaro-budria opened this issue Apr 8, 2022 · 1 comment
Closed

PySpark StandardScaler conversion fails #540

alvaro-budria opened this issue Apr 8, 2022 · 1 comment

Comments

@alvaro-budria
Copy link

alvaro-budria commented Apr 8, 2022

While trying to convert a PySpark StandardScaler to ONNX, I came across an error having to do with the data structure the values by which to scale/center are stored in.
The issue is that PySpark stores these values inside a pyspark.ml.linalg.DenseVector, which cannot be dealt with by function make_attribute inside onnx/helper.py.
A possible solution is to check the type insider helper.py and typecast the input from DenseVector to python list, but this would add a dependency to pyspark.

Code to reproduce the error:

from pyspark.ml.feature import StandardScaler
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler

from onnxconverter_common.data_types import FloatTensorType
from onnxmltools.convert import convert_sparkml
from onnx.defs import onnx_opset_version
from onnxconverter_common.onnx_ex import DEFAULT_OPSET_NUMBER
TARGET_OPSET = min(DEFAULT_OPSET_NUMBER, onnx_opset_version())

import numpy as np


spark = SparkSession.builder.master("local") \
                    .appName("onnxConversion") \
                    .getOrCreate()

# sample data
arr = np.array(
    [
        [1.,    2.02,   3.1],
        [1.01,  2.,     3.05],
        [2.,    3.99,   5.9],
        [1.9,   4.01,   6.1]
    ]
)
rdd = spark.sparkContext.parallelize(arr)
rdd = rdd.map(lambda x: [float(i) for i in x])
data = rdd.toDF(['0', '1', '2'])

# Assemble all feature variables with a VectorAssembler
assembler = VectorAssembler(inputCols=['0', '1', '2'], outputCol='features')
stages = [assembler]

scaler = StandardScaler(
            inputCol='features',
            outputCol='scaledFeatures',
            withStd=True,
            withMean=True,
        )
pipeline = Pipeline(stages=stages + [scaler])
model = pipeline.fit(data)

# convert to ONNX
initial_types = [ (col, FloatTensorType([None, 1])) for col in ['0', '1', '2'] ]
onx = convert_sparkml(
    model,
    'myPCA',
    initial_types,
    spark_session=spark,
    target_opset=TARGET_OPSET
)

And the error message:

22/04/08 13:49:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
/Users/alvarobudria/anaconda3/envs/py37/lib/python3.7/site-packages/onnxconverter_common/topology.py:749: UserWarning: Some input names are not compliant with ONNX naming convention: ['0', '1', '2']
warnings.warn('Some input names are not compliant with ONNX naming convention: %s' % invalid_name)
Traceback (most recent call last):
File "PCA_PR.py", line 54, in
target_opset=TARGET_OPSET
File "/Users/alvarobudria/anaconda3/envs/py37/lib/python3.7/site-packages/onnxmltools/convert/main.py", line 167, in convert_sparkml
custom_conversion_functions, custom_shape_calculators, spark_session)
File "/Users/alvarobudria/anaconda3/envs/py37/lib/python3.7/site-packages/onnxmltools/convert/sparkml/convert.py", line 71, in convert
onnx_model = convert_topology(topology, name, doc_string, target_opset, targeted_onnx)
File "/Users/alvarobudria/anaconda3/envs/py37/lib/python3.7/site-packages/onnxconverter_common/topology.py", line 776, in convert_topology
get_converter(operator.type)(scope, operator, container)
File "/Users/alvarobudria/anaconda3/envs/py37/lib/python3.7/site-packages/onnxmltools/convert/sparkml/operator_converters/scaler.py", line 40, in convert_sparkml_scaler
container.add_node(op_type, input_name, operator.output_full_names, op_domain='ai.onnx.ml', **attrs)
File "/Users/alvarobudria/anaconda3/envs/py37/lib/python3.7/site-packages/onnxconverter_common/container.py", line 171, in add_node
node = helper.make_node(op_type, inputs, outputs, **attrs)
File "/Users/alvarobudria/anaconda3/envs/py37/lib/python3.7/site-packages/onnx/helper.py", line 118, in make_node
for key, value in sorted(kwargs.items())
File "/Users/alvarobudria/anaconda3/envs/py37/lib/python3.7/site-packages/onnx/helper.py", line 119, in
if value is not None)
File "/Users/alvarobudria/anaconda3/envs/py37/lib/python3.7/site-packages/onnx/helper.py", line 447, in make_attribute
'value "{}" is not valid attribute data type.'.format(value))
TypeError: value "[1.4775,3.005,4.5375]" is not valid attribute data type.

@memoryz
Copy link
Contributor

memoryz commented Jun 6, 2022

@xadupre this issue can be closed.

@xadupre xadupre closed this as completed Jun 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants