PySpark StandardScaler conversion fails #540

alvaro-budria · 2022-04-08T11:59:34Z

While trying to convert a PySpark StandardScaler to ONNX, I came across an error having to do with the data structure the values by which to scale/center are stored in.
The issue is that PySpark stores these values inside a pyspark.ml.linalg.DenseVector, which cannot be dealt with by function make_attribute inside onnx/helper.py.
A possible solution is to check the type insider helper.py and typecast the input from DenseVector to python list, but this would add a dependency to pyspark.

Code to reproduce the error:

from pyspark.ml.feature import StandardScaler
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler

from onnxconverter_common.data_types import FloatTensorType
from onnxmltools.convert import convert_sparkml
from onnx.defs import onnx_opset_version
from onnxconverter_common.onnx_ex import DEFAULT_OPSET_NUMBER
TARGET_OPSET = min(DEFAULT_OPSET_NUMBER, onnx_opset_version())

import numpy as np


spark = SparkSession.builder.master("local") \
                    .appName("onnxConversion") \
                    .getOrCreate()

# sample data
arr = np.array(
    [
        [1.,    2.02,   3.1],
        [1.01,  2.,     3.05],
        [2.,    3.99,   5.9],
        [1.9,   4.01,   6.1]
    ]
)
rdd = spark.sparkContext.parallelize(arr)
rdd = rdd.map(lambda x: [float(i) for i in x])
data = rdd.toDF(['0', '1', '2'])

# Assemble all feature variables with a VectorAssembler
assembler = VectorAssembler(inputCols=['0', '1', '2'], outputCol='features')
stages = [assembler]

scaler = StandardScaler(
            inputCol='features',
            outputCol='scaledFeatures',
            withStd=True,
            withMean=True,
        )
pipeline = Pipeline(stages=stages + [scaler])
model = pipeline.fit(data)

# convert to ONNX
initial_types = [ (col, FloatTensorType([None, 1])) for col in ['0', '1', '2'] ]
onx = convert_sparkml(
    model,
    'myPCA',
    initial_types,
    spark_session=spark,
    target_opset=TARGET_OPSET
)

And the error message:

22/04/08 13:49:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
/Users/alvarobudria/anaconda3/envs/py37/lib/python3.7/site-packages/onnxconverter_common/topology.py:749: UserWarning: Some input names are not compliant with ONNX naming convention: ['0', '1', '2']
warnings.warn('Some input names are not compliant with ONNX naming convention: %s' % invalid_name)
Traceback (most recent call last):
File "PCA_PR.py", line 54, in
target_opset=TARGET_OPSET
File "/Users/alvarobudria/anaconda3/envs/py37/lib/python3.7/site-packages/onnxmltools/convert/main.py", line 167, in convert_sparkml
custom_conversion_functions, custom_shape_calculators, spark_session)
File "/Users/alvarobudria/anaconda3/envs/py37/lib/python3.7/site-packages/onnxmltools/convert/sparkml/convert.py", line 71, in convert
onnx_model = convert_topology(topology, name, doc_string, target_opset, targeted_onnx)
File "/Users/alvarobudria/anaconda3/envs/py37/lib/python3.7/site-packages/onnxconverter_common/topology.py", line 776, in convert_topology
get_converter(operator.type)(scope, operator, container)
File "/Users/alvarobudria/anaconda3/envs/py37/lib/python3.7/site-packages/onnxmltools/convert/sparkml/operator_converters/scaler.py", line 40, in convert_sparkml_scaler
container.add_node(op_type, input_name, operator.output_full_names, op_domain='ai.onnx.ml', **attrs)
File "/Users/alvarobudria/anaconda3/envs/py37/lib/python3.7/site-packages/onnxconverter_common/container.py", line 171, in add_node
node = helper.make_node(op_type, inputs, outputs, **attrs)
File "/Users/alvarobudria/anaconda3/envs/py37/lib/python3.7/site-packages/onnx/helper.py", line 118, in make_node
for key, value in sorted(kwargs.items())
File "/Users/alvarobudria/anaconda3/envs/py37/lib/python3.7/site-packages/onnx/helper.py", line 119, in
if value is not None)
File "/Users/alvarobudria/anaconda3/envs/py37/lib/python3.7/site-packages/onnx/helper.py", line 447, in make_attribute
'value "{}" is not valid attribute data type.'.format(value))
TypeError: value "[1.4775,3.005,4.5375]" is not valid attribute data type.

The text was updated successfully, but these errors were encountered:

memoryz · 2022-06-06T03:59:45Z

@xadupre this issue can be closed.

memoryz mentioned this issue May 28, 2022

fix: SparkML StandardScaler conversion fails when withStd or withMean is set to true #555

Merged

xadupre closed this as completed Jun 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PySpark StandardScaler conversion fails #540

PySpark StandardScaler conversion fails #540

alvaro-budria commented Apr 8, 2022 •

edited

Loading

memoryz commented Jun 6, 2022

PySpark StandardScaler conversion fails #540

PySpark StandardScaler conversion fails #540

Comments

alvaro-budria commented Apr 8, 2022 • edited Loading

memoryz commented Jun 6, 2022

alvaro-budria commented Apr 8, 2022 •

edited

Loading