Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRIVERS-2926] [PYTHON-4577] BSON Binary Vector Subtype Support #1813

Open
wants to merge 28 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
245c869
First commit on DRIVERS-2926-BSON-Binary-Vectors
caseyclements Aug 22, 2024
031cd8c
Turns dtype into enum. Adds handling of padding, __eq__. Removal of n…
caseyclements Aug 23, 2024
8d4e8a2
Added docstring and comments
caseyclements Aug 23, 2024
2df0d6b
Changed order of BinaryVector and Binary in bson._ENCODERS to get tes…
caseyclements Aug 23, 2024
315a115
Changed order of BinaryVector and Binary in bson._ENCODERS to get tes…
caseyclements Aug 23, 2024
d74314d
json_util dumps/loads of BinaryVector
caseyclements Aug 23, 2024
27f13c8
Added bson_corpus tests. Needs more, and review of json_util
caseyclements Aug 24, 2024
263f8c7
Removed BinaryVector as separate class. Instead, Binary includes as_v…
caseyclements Sep 12, 2024
f8bcdef
Stop setting _USD_C to False
caseyclements Sep 13, 2024
5435785
mypy fixes
caseyclements Sep 13, 2024
5c4d152
Removed stub vector.json for bson_corpus tests
caseyclements Sep 13, 2024
f86d040
More tests
caseyclements Sep 13, 2024
adcb945
Added description of subtype 9 to bson.Binary docstring
caseyclements Sep 14, 2024
7986cc5
Addressed comments in docstrings.
caseyclements Sep 16, 2024
26b8398
Eased string comparison of exception in xfail in test_bson
caseyclements Sep 16, 2024
28de28a
Updates to docstrings of BinaryVector and BinaryVectorDtype
caseyclements Sep 17, 2024
68235b8
Simplified expected exeption case. Will be refactored with yaml anyway..
caseyclements Sep 17, 2024
e2a1a3c
Added draft of test runner
caseyclements Sep 18, 2024
bf9758a
Added test cases: padding, and overflow
caseyclements Sep 19, 2024
e1590aa
Merge branch 'master' into DRIVERS-2926-BSON-Binary-Vectors
caseyclements Sep 19, 2024
c4c7af7
Cast Path to str
caseyclements Sep 19, 2024
de5a245
Simplified as_vector API
caseyclements Sep 20, 2024
43bcce4
Added test case: list of floats with dtype int8 raises exception
caseyclements Sep 20, 2024
41ee0bb
Set default padding to 0 in test runner
caseyclements Sep 20, 2024
9d52aeb
Updated test_bson for new as_vector API
caseyclements Sep 20, 2024
0d34464
Updated resync-specs.sh to include bson-binary-vector
caseyclements Sep 20, 2024
1d49656
Updated resync-specs.sh and test cases
caseyclements Sep 20, 2024
2af0ca4
Broke tests into 3 files by dtype
caseyclements Sep 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions test/bson_binary_vector/vector-test-cases.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
{
"description": "Basic Tests of Binary Vectors, subtype 9",
"test_key": "vector",
"tests": [
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's follow the rest of the BSON corpus and name this field "valid".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format now follows one of the standards I saw.

{
"description": "Simple Vector INT8",
"valid": true,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the invalid tests are separated in a different array, you can probably remove this key

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one's out of date.

"vector": [127, 7],
"dtype_hex": "0x03",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear why expected hex value of dtype needs to be specified separately from the expected BSON encoding of the vector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the dtype in addition to the vector/numbers to encode the data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to distinguish, for example, between int8, packed_bit, or even float32.

"dtype_alias": "INT8",
"padding": 0,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear why padding needs to be specified as an expectation. Isn't that just part of the encoding? Maybe if you add some tests where padding is non-zero it will be clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not required. There are now tests that include non-zero padding (both invlaid and valid ones).

"canonical_bson": "1600000005766563746F7200040000000903007F0700",
"canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"AwB/Bw==\", \"subType\": \"09\"}}}"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to test JSON encoding here since that is well-tested elsewhere in the bson corpus

},
{
"description": "Simple Vector FLOAT32",
"valid": true,
"vector": [127.0, 7.0],
"dtype_hex": "0x27",
"dtype_alias": "FLOAT32",
"padding": 0,
"canonical_bson": "1C00000005766563746F72000A0000000927000000FE420000E04000",
"canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"JwAAAP5CAADgQA==\", \"subType\": \"09\"}}}"
},
{
"description": "Simple Vector PACKED_BIT",
"valid": true,
"vector": [127, 7],
"dtype_hex": "0x10",
"dtype_alias": "PACKED_BIT",
"padding": 0,
"canonical_bson": "1600000005766563746F7200040000000910007F0700",
"canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"EAB/Bw==\", \"subType\": \"09\"}}}"
},
{
"description": "Empty Vector INT8",
"valid": true,
"vector": [],
"dtype_hex": "0x03",
"dtype_alias": "INT8",
"padding": 0,
"canonical_bson": "1400000005766563746F72000200000009030000",
"canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"AwA=\", \"subType\": \"09\"}}}"
},
{
"description": "Empty Vector FLOAT32",
"valid": true,
"vector": [],
"dtype_hex": "0x27",
"dtype_alias": "FLOAT32",
"padding": 0,
"canonical_bson": "1400000005766563746F72000200000009270000",
"canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"JwA=\", \"subType\": \"09\"}}}"
},
{
"description": "Empty Vector PACKED_BIT",
"valid": true,
"vector": [],
"dtype_hex": "0x10",
"dtype_alias": "PACKED_BIT",
"padding": 0,
"canonical_bson": "1400000005766563746F72000200000009100000",
"canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"EAA=\", \"subType\": \"09\"}}}"
},
{
"description": "Infinity Vector FLOAT32",
"valid": true,
"vector": ["-inf", 0.0, "inf"],
"dtype_hex": "0x27",
"dtype_alias": "FLOAT32",
"padding": 0,
"canonical_bson": "2000000005766563746F72000E000000092700000080FF000000000000807F00",
"canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"JwAAAID/AAAAAAAAgH8=\", \"subType\": \"09\"}}}"
}
],
"invalid": [
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be different categories of invalid vectors? Note that in the BSON corpus we have both "decodeErrors" and "parseErrors".

{
"description": "Overflow Vector INT8",
"valid": false,
"vector": [256],
"dtype_hex": "0x03",
"dtype_alias": "INT8",
"padding": 0
}
]
}

113 changes: 113 additions & 0 deletions test/test_bson_binary_vector.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# Copyright 2024-present MongoDB, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from __future__ import annotations

import binascii
import codecs
import functools
import glob
import json
import os
import struct
import sys
from decimal import DecimalException
from pathlib import Path
from test import unittest

from bson import decode, encode, json_util
from bson.binary import Binary, BinaryVectorDtype

_TEST_PATH = Path(__file__).parent / "bson_binary_vector"


class TestBSONBinaryVector(unittest.TestCase):
"""Runs Binary Vector subtype tests.

Follows the style of the BSON corpus specification tests.
Tests are automatically generated on import
from json files in _TEST_PATH via `create_tests`.
The actual tests are defined in the inner function `run_test`
of the test generator `create_test`."""


def create_test(case_spec):
"""Create standard test given specification in json.

We use the naming convention expected (exp) and observed (obj)
to differentiate what is in the json (expected or suffix _exp)
from what is produced by the API (observed or suffix _obs)
"""
test_key = case_spec.get("test_key")

def run_test(self):
for test_case in case_spec.get("tests", []):
description = test_case["description"]
vector_exp = test_case["vector"]
dtype_hex_exp = test_case["dtype_hex"]
dtype_alias_exp = test_case.get("dtype_alias")
padding_exp = test_case.get("padding")
canonical_bson_exp = test_case["canonical_bson"]
canonical_extjson_exp = test_case["canonical_extjson"]
# Convert dtype hex string into bytes
dtype_exp = BinaryVectorDtype(int(dtype_hex_exp, 16).to_bytes(1, byteorder="little"))

if test_case["valid"]:
# Convert bson string to bytes
cB_exp = binascii.unhexlify(canonical_bson_exp.encode("utf8"))
decoded_doc = decode(cB_exp)
binary_obs = decoded_doc[test_key]
# Handle special float cases like '-inf'
if dtype_exp in [BinaryVectorDtype.FLOAT32]:
vector_exp = [float(x) for x in vector_exp]

# Test round-tripping canonical bson.
self.assertEqual(encode(decoded_doc), cB_exp, description)

# Test BSON to Binary Vector
vector_obs = binary_obs.as_vector()
self.assertEqual(vector_obs.dtype, dtype_exp)
if dtype_alias_exp:
self.assertEqual(vector_obs.dtype, BinaryVectorDtype[dtype_alias_exp])
self.assertEqual(vector_obs.data, vector_exp)
self.assertEqual(vector_obs.padding, padding_exp)

# Test Binary Vector to BSON
vector_exp = Binary.from_vector(vector_exp, dtype_exp, padding_exp)
cB_obs = binascii.hexlify(encode({test_key: vector_exp})).decode().upper()
self.assertEqual(cB_obs, canonical_bson_exp)

# Test JSON
self.assertEqual(json_util.loads(canonical_extjson_exp), decoded_doc)
self.assertEqual(json_util.dumps(decoded_doc), canonical_extjson_exp)

else:
with self.assertRaises(struct.error):
Binary.from_vector(vector_exp, dtype_exp)

return run_test


def create_tests():
for filename in _TEST_PATH.glob("*.json"):
with codecs.open(filename, encoding="utf-8") as test_file:
test_method = create_test(json.load(test_file))
setattr(TestBSONBinaryVector, "test_" + filename.stem, test_method)


create_tests()


if __name__ == "__main__":
unittest.main()
Loading