You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While trying to convert a PySpark StandardScaler to ONNX, I came across an error having to do with the data structure the values by which to scale/center are stored in.
The issue is that PySpark stores these values inside a pyspark.ml.linalg.DenseVector, which cannot be dealt with by function make_attribute inside onnx/helper.py.
A possible solution is to check the type insider helper.py and typecast the input from DenseVector to python list, but this would add a dependency to pyspark.
Code to reproduce the error:
from pyspark.ml.feature import StandardScaler
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from onnxconverter_common.data_types import FloatTensorType
from onnxmltools.convert import convert_sparkml
from onnx.defs import onnx_opset_version
from onnxconverter_common.onnx_ex import DEFAULT_OPSET_NUMBER
TARGET_OPSET = min(DEFAULT_OPSET_NUMBER, onnx_opset_version())
import numpy as np
spark = SparkSession.builder.master("local") \
.appName("onnxConversion") \
.getOrCreate()
# sample data
arr = np.array(
[
[1., 2.02, 3.1],
[1.01, 2., 3.05],
[2., 3.99, 5.9],
[1.9, 4.01, 6.1]
]
)
rdd = spark.sparkContext.parallelize(arr)
rdd = rdd.map(lambda x: [float(i) for i in x])
data = rdd.toDF(['0', '1', '2'])
# Assemble all feature variables with a VectorAssembler
assembler = VectorAssembler(inputCols=['0', '1', '2'], outputCol='features')
stages = [assembler]
scaler = StandardScaler(
inputCol='features',
outputCol='scaledFeatures',
withStd=True,
withMean=True,
)
pipeline = Pipeline(stages=stages + [scaler])
model = pipeline.fit(data)
# convert to ONNX
initial_types = [ (col, FloatTensorType([None, 1])) for col in ['0', '1', '2'] ]
onx = convert_sparkml(
model,
'myPCA',
initial_types,
spark_session=spark,
target_opset=TARGET_OPSET
)
And the error message:
22/04/08 13:49:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
/Users/alvarobudria/anaconda3/envs/py37/lib/python3.7/site-packages/onnxconverter_common/topology.py:749: UserWarning: Some input names are not compliant with ONNX naming convention: ['0', '1', '2']
warnings.warn('Some input names are not compliant with ONNX naming convention: %s' % invalid_name)
Traceback (most recent call last):
File "PCA_PR.py", line 54, in
target_opset=TARGET_OPSET
File "/Users/alvarobudria/anaconda3/envs/py37/lib/python3.7/site-packages/onnxmltools/convert/main.py", line 167, in convert_sparkml
custom_conversion_functions, custom_shape_calculators, spark_session)
File "/Users/alvarobudria/anaconda3/envs/py37/lib/python3.7/site-packages/onnxmltools/convert/sparkml/convert.py", line 71, in convert
onnx_model = convert_topology(topology, name, doc_string, target_opset, targeted_onnx)
File "/Users/alvarobudria/anaconda3/envs/py37/lib/python3.7/site-packages/onnxconverter_common/topology.py", line 776, in convert_topology
get_converter(operator.type)(scope, operator, container)
File "/Users/alvarobudria/anaconda3/envs/py37/lib/python3.7/site-packages/onnxmltools/convert/sparkml/operator_converters/scaler.py", line 40, in convert_sparkml_scaler
container.add_node(op_type, input_name, operator.output_full_names, op_domain='ai.onnx.ml', **attrs)
File "/Users/alvarobudria/anaconda3/envs/py37/lib/python3.7/site-packages/onnxconverter_common/container.py", line 171, in add_node
node = helper.make_node(op_type, inputs, outputs, **attrs)
File "/Users/alvarobudria/anaconda3/envs/py37/lib/python3.7/site-packages/onnx/helper.py", line 118, in make_node
for key, value in sorted(kwargs.items())
File "/Users/alvarobudria/anaconda3/envs/py37/lib/python3.7/site-packages/onnx/helper.py", line 119, in
if value is not None)
File "/Users/alvarobudria/anaconda3/envs/py37/lib/python3.7/site-packages/onnx/helper.py", line 447, in make_attribute
'value "{}" is not valid attribute data type.'.format(value))
TypeError: value "[1.4775,3.005,4.5375]" is not valid attribute data type.
The text was updated successfully, but these errors were encountered:
While trying to convert a PySpark StandardScaler to ONNX, I came across an error having to do with the data structure the values by which to scale/center are stored in.
The issue is that PySpark stores these values inside a pyspark.ml.linalg.DenseVector, which cannot be dealt with by function
make_attribute
insideonnx/helper.py
.A possible solution is to check the type insider helper.py and typecast the input from DenseVector to python list, but this would add a dependency to pyspark.
Code to reproduce the error:
And the error message:
The text was updated successfully, but these errors were encountered: