Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRIVERS-2926] [PYTHON-4577] BSON Binary Vector Subtype Support #1813

Open
wants to merge 28 commits into
base: master
Choose a base branch
from

Conversation

caseyclements
Copy link
Contributor

@caseyclements caseyclements commented Aug 24, 2024

Adds support for new subtype of Binary BSON subtype to be used for densely-packed 1d arrays (vectors).
Corresponding changes are in specifications repo PR #1658

bson/vector.py Outdated Show resolved Hide resolved
bson/vector.py Outdated Show resolved Hide resolved
bson/__init__.py Outdated Show resolved Hide resolved
bson/vector.py Outdated Show resolved Hide resolved
bson/binary.py Outdated Show resolved Hide resolved
bson/binary.py Outdated Show resolved Hide resolved
@caseyclements caseyclements marked this pull request as ready for review September 19, 2024 17:25
@caseyclements
Copy link
Contributor Author

Hi @shane. Do you still have changes that haven't been addressed?

"dtype_alias": "INT8",
"padding": 0,
"canonical_bson": "1600000005766563746F7200040000000903007F0700",
"canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"AwB/Bw==\", \"subType\": \"09\"}}}"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to test JSON encoding here since that is well-tested elsewhere in the bson corpus

"vector": [127, 7],
"dtype_hex": "0x03",
"dtype_alias": "INT8",
"padding": 0,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear why padding needs to be specified as an expectation. Isn't that just part of the encoding? Maybe if you add some tests where padding is non-zero it will be clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not required. There are now tests that include non-zero padding (both invlaid and valid ones).

"description": "Simple Vector INT8",
"valid": true,
"vector": [127, 7],
"dtype_hex": "0x03",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear why expected hex value of dtype needs to be specified separately from the expected BSON encoding of the vector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the dtype in addition to the vector/numbers to encode the data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to distinguish, for example, between int8, packed_bit, or even float32.

"tests": [
{
"description": "Simple Vector INT8",
"valid": true,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the invalid tests are separated in a different array, you can probably remove this key

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one's out of date.

{
"description": "Basic Tests of Binary Vectors, subtype 9",
"test_key": "vector",
"tests": [
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's follow the rest of the BSON corpus and name this field "valid".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format now follows one of the standards I saw.

"canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"JwAAAID/AAAAAAAAgH8=\", \"subType\": \"09\"}}}"
}
],
"invalid": [
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be different categories of invalid vectors? Note that in the BSON corpus we have both "decodeErrors" and "parseErrors".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants