[Frontend][Tensorflow] Add unique operator #7441

ymwangg · 2021-02-11T01:55:34Z

This PR adds the tensorflow unique operator as described in https://www.tensorflow.org/api_docs/python/tf/unique.

I'm not sure I follow the best practices. Comments and suggestions are welcome. @yongwww @kevinthesun @codeislife99

python/tvm/relay/op/_transform.py

python/tvm/relay/op/strategy/generic.py

src/relay/op/algorithm/unique.cc

masahi · 2021-02-11T03:27:31Z

Thanks, I was planning to work on unique next week, happy to collaborate.

I can add TIR unqiue impl both cpu and gpu later. We can add relay boilarplate, temp impl in cpp, and tests in this PR.

ymwangg · 2021-02-11T18:25:00Z

Thanks, I was planning to work on unique next week, happy to collaborate.

I can add TIR unqiue impl both cpu and gpu later. We can add relay boilarplate, temp impl in cpp, and tests in this PR.

That would be great!

masahi · 2021-02-11T23:22:43Z

@ymwangg For a general op like unique, we should follow numpy API, rather than being too specific to TF. PyTorch unique should be supported by the same API. Framework specific details should go into the frontend.

Numpy and PyTorch supports dim argument to do unique on multidimensional input, but I don't think it's a good idea. So restricting to 1D, at least for the first implementation, sounds good to me.

We can implement unique via sorting and cumsum (without hash table). If implemented this way, the same code works on both CPU and GPU. That's I'm planning to do, but if you feel brave, you can try that in this PR 🙂 But it is likely not going to be faster than the hash table based implementation, since it requires multiple passes over input. This could be useful if the hash based impl cannot be used for some reason.

ymwangg · 2021-02-17T01:22:51Z

@masahi Thanks for your comment.
Here's the algorithm that I came up with based on your suggestions.

# topi
def unique(data, data_sorted, data_argsorted):
    output = [0] * len(data)
    count = [0] * len(data)
    first_occurrence = [len(data)] * len(data)
    inverse_indices = [0] * len(data)
    num_unique = 0
    # ir_builder
    for i in range(len(data)):
        if i == 0 or data_sorted[i] != data_sorted[i-1]:
            num_unique += 1
            output[num_unique-1] = data_sorted[i]
            first_occurrence[num_unique-1] = min(first_occurrence[num_unique-1], data_argsorted[i])
        count[num_unique-1] += 1
        inverse_indices[data_argsorted[i]] = num_unique - 1
    return output, count, first_occurrence, inverse_indices, num_unique

# tf front end
def tf_unique(data):
    output, count, first_occurrence, inverse_indices, num_unique = unique(data, np.sort(data), np.argsort(data))
    sorted_occurence_indices = np.argsort(first_occurrence) # relay.argsort
    new_output = [output[sorted_occurence_indices[i]] for i in range(num_unique)] # relay.take
    index_converter = np.argsort(sorted_occurence_indices) # relay.argsort
    new_inverse_indices = [index_converter[i] for i in inverse_indices] # relay.take
    return new_output, new_inverse_indices

It defines a topi function that is similar to np.unique but requires the sorted data and the argsort of the data. In the frontend, it needs to do argsort twice if we want to keep the unique elements in the order of their first occurrence.

Does this look good to you?

masahi · 2021-02-17T04:12:57Z

It can be a lot simpler than that. Unique is basically sort + adjacent difference + exclusive scan. If you don't understand that statement, the following example should help. We have exclusive scan for CPU (cumsum op with exclusive=True), and GPU (see #7303).

If we implement unique this way, the same code runs on both CPU and GPU.

import numpy as np


def exclusive_scan(arr):
    return np.cumsum(arr) - arr


inp = np.random.randint(0, 10, size=(15,))
argsort_indices = np.argsort(inp)
sorted_inp = np.array([inp[i] for i in argsort_indices])
print("sorted input:", sorted_inp)

adj_diff = np.concatenate([[1],  np.diff(sorted_inp)])
print("adjacent difference:", adj_diff)

non_zero = adj_diff != 0
ex_scan = exclusive_scan(non_zero)
print("exclusive scan:", ex_scan)

unique = np.zeros(inp.shape[0], dtype=np.int)

for i in range(inp.shape[0]):
    if non_zero[i] != 0:
        unique[ex_scan[i]] = inp[argsort_indices[i]]

print("num unique element:", ex_scan[-1] + 1)
print("unique:", unique)

Output:

sorted input: [0 0 0 4 5 5 6 6 6 6 6 7 8 8 9]
adjacent difference: [0 0 0 4 1 0 1 0 0 0 0 1 1 0 1]
exclusive scan: [0 1 1 1 2 3 3 4 4 4 4 4 5 6 6]
num unique element: 7
unique: [0 4 5 6 7 8 9 0 0 0 0 0 0 0 0]

codeislife99 · 2021-02-17T05:12:24Z

Hey @masahi , can your example be extended to provide counts as well ? https://www.tensorflow.org/api_docs/python/tf/unique_with_counts

masahi · 2021-02-17T05:16:48Z

Yes, it's possible but a bit complicated. PyTorch also has return_counts option https://pytorch.org/docs/stable/generated/torch.unique.html

I think for the first PR, not all options need to be implemented. We can follow up later.

I'm using PyTorch GPU impl as reference, see for example below on how they support count
https://github.com/pytorch/pytorch/blob/22a34bcf4e5eaa348f0117c414c3dd760ec64b13/aten/src/ATen/native/cuda/Unique.cu#L60-L68

codeislife99 · 2021-02-17T05:24:15Z

I see, I was interested in counts option and probably you might be as well , because SparseSegmentSqrtN or other variants SparseLengthSum / EmbeddingBag can be written as a combination of take, expand, repeat,scatter_add and unique_count. So I was interested if we could do this either in this PR or in a parallel PR(and later merge them)

ymwangg · 2021-02-17T08:20:43Z

@masahi Thanks for the explanation and it is very helpful!
It looks like the main thing we need to do is to implement a topi.adjacent_difference op similar to thrust::adjacent_difference. And in the frontend, we do something like:

sorted_data = relay.sort(data)
argsort_indices = relay.argsort(data)
adj_diff = relay.adjacent_difference(sorted_data, first_value=0, "not_equal")
ex_scan = relay.cumsum(adj_diff, exclusive=True)
inverse_indices = relay.scatter(data, argsort_indices, ex_scan)
unique = relay.scatter(data, ex_scan, sorted_data)
unique_sliced = relay.strided_slice(unique, [0], relay.take(ex_scan,[-1]), slice_mode="size")
return unique_sliced, inverse_indices

I saw PyTorch uses thrust::unique to get the unique array. I think we can use relay.scatter to do the same thing.

To support counting, it looks like we need to implement a topi.unique_by_key op similar to thrust::unique_by_key. I think maybe we can do it in a different PR and focus on adjacent_difference in this PR.

masahi · 2021-02-17T09:56:19Z

For your first implementation, combination-based approach is ok. But unique is important enough that I think it deserves its own operator. Also implementation directly in ir builder will likely be faster. Supporting other options will also be easier if we write in ir builder.

So use ir builder if you are comfortable with it, otherwise combination of relay ops is fine. Performance + support for options can be done later (by me).

Don't worry about unique_by_key. Last time I checked the pytorch implementation, I concluded that we can do eveything pytorch does via ir builder.

Add unit tests for unique operator

masahi · 2021-02-20T01:25:34Z

Looks good 👍 GPU is not supported right?

tests/python/relay/test_op_level3.py

python/tvm/topi/unique.py

masahi · 2021-02-20T02:55:57Z

Can you also add pytorch frontend? Not all option need to be supported. Likely the same as tf conversion

ymwangg · 2021-02-20T02:56:39Z

@masahi Yeah, I only added CPU version in this PR. I'm not very familiar with GPU IR now but I can do it later. If the overall structure looks good, I can add unique_with_counts since their implementations are very similar.

I'll add the pytorch frontend in this PR.

masahi · 2021-02-20T03:04:08Z

I can do the GPU version. It will likely require ir builder. But let me know if you want to do GPU as well, you can certainly do it. The idea is identical with CPU version, just using different parallelization.

If unique_with_counts can be supported by adding another option to unique, that sounds good. We shouldn't add relay.unique_with_counts or topi.unique_with_counts.

ymwangg · 2021-02-20T08:19:26Z

@masahi I added the return_counts option for the topi.unique operator. I also added pytorch frontend. Interestingly, it looks like pytorch returns unique elements in random order when sorted=False.

I'll work on the GPU version of unique next week.

python/tvm/relay/frontend/pytorch.py

ymwangg · 2021-02-23T20:49:19Z

@masahi I added the GPU version and it's ready for review.

python/tvm/relay/frontend/pytorch.py

python/tvm/relay/op/transform.py

python/tvm/topi/unique.py

python/tvm/topi/cuda/unique.py

masahi · 2021-02-23T23:55:42Z

@ymwangg @codeislife99 I found a neat trick PyTorch uses for count. https://github.com/pytorch/pytorch/blob/22a34bcf4e5eaa348f0117c414c3dd760ec64b13/aten/src/ATen/native/cuda/Unique.cu#L60-L68

Basically, after you get ex scan, instead of copying from the original input, you copy from an array [0, 1, 2, ....]. This will give you something like [0, 2, 5], and doing adjacent element on it directly gives the count. Does this make sense? It should be much faster than atomic.

ymwangg · 2021-02-24T00:46:10Z

@masahi thanks. I'll try using arange and adjacent_difference to compute the counts rather than counting by adding.

python/tvm/relay/frontend/tensorflow.py

python/tvm/topi/cuda/unique.py

python/tvm/relay/frontend/tensorflow.py

masahi

LGTM 👍

masahi · 2021-02-26T01:59:00Z

Thanks @ymwangg @codeislife99, this is really a great work!

ymwangg · 2021-02-26T02:02:25Z

@masahi thanks for making this such an interesting project!

* Initial commit of the unique operator Add unit tests for unique operator * Add tensorflow unique op * Refactor unique to use sort-based algorithm * Change relay.unique test to run only on cpu * Change topi.unique test to run only on cpu * Change range to parallel for parallelizable loops * Add return_counts option for relay.unique and topi.unique, add pytorch frontend * Fix pylint * Patch pytorch frontend * Initial support of topi.cuda.unique * Refactor to use ir_builder directly * Modularize adjacent difference * Refactor to simplify * Fix typo * Combine _unique and _unique_with_counts * Reuse indices_ptr to remove arange_ptr Co-authored-by: Yanming Wang <yanmwang@amazon.com>

ymwangg changed the title ~~[Relay][Tensorflow] Add unique operator~~ [Frontend][Tensorflow] Add unique operator Feb 11, 2021

codeislife99 reviewed Feb 11, 2021

View reviewed changes

python/tvm/relay/op/_transform.py Outdated Show resolved Hide resolved

codeislife99 reviewed Feb 11, 2021

View reviewed changes

python/tvm/relay/op/strategy/generic.py Outdated Show resolved Hide resolved

codeislife99 reviewed Feb 11, 2021

View reviewed changes

src/relay/op/algorithm/unique.cc Outdated Show resolved Hide resolved

masahi self-assigned this Feb 11, 2021

ymwangg added 2 commits February 19, 2021 15:19

Initial commit of the unique operator

6681484

Add unit tests for unique operator

Add tensorflow unique op

9d9fae5

ymwangg force-pushed the unique-op branch from a1d5c43 to 25442e2 Compare February 19, 2021 23:19

Refactor unique to use sort-based algorithm

e83259c

ymwangg force-pushed the unique-op branch from 53663b3 to e83259c Compare February 20, 2021 00:54

masahi reviewed Feb 20, 2021

View reviewed changes

tests/python/relay/test_op_level3.py Show resolved Hide resolved

Change relay.unique test to run only on cpu

99463dd

masahi reviewed Feb 20, 2021

View reviewed changes

python/tvm/topi/unique.py Outdated Show resolved Hide resolved

masahi reviewed Feb 20, 2021

View reviewed changes

python/tvm/topi/unique.py Outdated Show resolved Hide resolved

ymwangg added 2 commits February 19, 2021 18:35

Change topi.unique test to run only on cpu

0ffad98

Change range to parallel for parallelizable loops

f31f53f

Fix pylint

2c2f2c4

masahi reviewed Feb 22, 2021

View reviewed changes

python/tvm/relay/frontend/pytorch.py Show resolved Hide resolved

ymwangg added 2 commits February 23, 2021 20:33

Patch pytorch frontend

8c74e19

Initial support of topi.cuda.unique

1553d48

Refactor to use ir_builder directly

14811bf

ymwangg force-pushed the unique-op branch from 7173b3d to 14811bf Compare February 23, 2021 21:15