Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC Added SkipMissing wrapper metric #261

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rofinn
Copy link
Member

@rofinn rofinn commented Dec 29, 2023

A potential solution to #11

Pros:

  • Logic is isolated to a few dozen lines in 1 file
  • Independent of distance metric types
  • Seems like a pretty common approach elsewhere (ex scikit learn)
  • skipmissing already exists in base, so hopefully this approach is relatively intuitive

Cons:

  • Performance costs (ie: generates a mask and views on each call)
  • Sufficiently different from Base.skipmissing that we shouldn't export and could cause confusion?
  • No scaling factor to avoid underestimating distances
    • scipy includes a ntotal / ncomplete factor inside the sqrt for euclidean distances

Alternatives:

  1. Have missing data versions of specific metrics
  2. Relax type restrictions to allow callables which act like metrics? (ex NearestNeighbors.jl)

"""
Exclude any missing indices from being included the wrappped distance metric.
"""
struct SkipMissing{D<:PreMetric} <: PreMetric
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if there's a benefit to have specific subtypes here? All the metrics I tested seemed to work.

# what the parameters mean.
# NTOE: We call disallowmissings to avoid downstream type promotion issues.
if dist.d isa UnionMetrics
params = parameters(dist.d)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, parameters is only defined for UnionMetrics?

isnothing(params) ? params : view(params, mask),
)
else
return dist.d(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if there's a safer fallback?

@test_throws MethodError pairwise(dist, A, dims=2)
end

# Handle weights
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're just testing that skipmissing returns the same results as deleting indices missing in either vector.

@test D(a, b) == dist([1, 5], [6, 10])
end

@test colwise(D, A, B)[3:4] ≈ colwise(dist, disallowmissing(A[:, 3:4]), disallowmissing(B[:, 3:4]))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double check results are the same for non-missing columns.


M = pairwise(D, A, dims=2)
@test size(M) == (4, 4)
@test !any(ismissing, M)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could manually check the calculations here, but I figured a few smoke tests that it doesn't error or return missings was good enough?

@rofinn
Copy link
Member Author

rofinn commented Dec 29, 2023

Hmm, looks like some of the metrics throw different errors on different Julia versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant