Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Replace multiprocessing tokenization with batched fast tokenization #3089

Merged
merged 4 commits into from
Aug 31, 2022
Merged

fix: Replace multiprocessing tokenization with batched fast tokenization #3089

merged 4 commits into from
Aug 31, 2022

Conversation

vblagoje
Copy link
Member

@vblagoje vblagoje commented Aug 23, 2022

Related Issues

Proposed Changes:

  • Eviscerate multiprocessing code in data_silo.py and replace it with large batches fast tokenization
  • Remove all deprecated tokenization method invocations (batch_encode_plus, encode_plus)

How did you test it?

No test yet

## Todo
This PR is mainly in the draft stage; need to keep the CI churning

@vblagoje vblagoje requested a review from a team as a code owner August 23, 2022 14:12
@vblagoje vblagoje requested review from julian-risch and removed request for a team August 23, 2022 14:12
@tholor
Copy link
Member

tholor commented Aug 23, 2022

I remember assessing the full switch to fast tokenizers ~ 1-1.5 years ago. Back then the blocker was that not all popular model architectures (i.e their tokenizers) were supported by fast tokenizers. Could you please verify that it's now the case for the most common ones we see being used by our community (telemetry might help here)? roberta, electra, minilm, t5 are the first ones to pop in my mind but there way more...

@vblagoje
Copy link
Member Author

Adding on my todo list for tomorrow

@vblagoje
Copy link
Member Author

vblagoje commented Aug 24, 2022

The most up-to-date list is available in transformers docs

batch_size = self.max_multiprocessing_chunksize
for i in tqdm(range(0, num_dicts, batch_size), desc="Preprocessing dataset", unit=" Dicts"):
processing_batch = dicts[i : i + batch_size]
indices = [i for i in range(len(processing_batch))] # indices is a remnant from multiprocessing era
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can also use

indices = list(range(len(processing_batch)))

which is a bit faster than the list comprehension for this use case.

@julian-risch julian-risch changed the title Replace multiprocessing tokenization with batched fast tokenization fix: Replace multiprocessing tokenization with batched fast tokenization Aug 25, 2022
@vblagoje
Copy link
Member Author

@sjrl I updated the branch a bit more and completed some rudimentary performance measurements. I found the old tokenizer batch_encode_plus method twice as slow as the default call on the tokenizer. We can remove batch_encode_plus in a separate PR, but I added it here to ensure CI is green. As we are not only doing tokenization but also some 'basketization" of data, the old multiprocessing is still faster (especially because we used fast tokenizers already), but I don't think the slowdown is as dramatic as we feared. I'll do more measurements on larger bodies of text. So far it looks good.

@vblagoje
Copy link
Member Author

As @sjrl is currently away, would you please look at this one @julian-risch ?

@sjrl
Copy link
Contributor

sjrl commented Aug 30, 2022

@vblagoje I'm back today, but having @julian-risch eyes on this as well would be great! It looks good to me. Do you have some final timing comparisons of the preprocessing before and after the changes?

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good to me and nice to see a small optimization regarding array copying (np.asarray) as well. 👍 There is one # TODO remove indices comment that needs further explanation though. Do you plan to address it before merging?

@@ -41,7 +37,7 @@ def __init__(
eval_batch_size: Optional[int] = None,
distributed: bool = False,
automatic_loading: bool = True,
max_multiprocessing_chunksize: int = 2000,
max_multiprocessing_chunksize: int = 512,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a power of two makes sense 👍 Any intuition on why 512? 2048 would have been closer to the previous number.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this parameter is corresponds to the batch size now and needs to work with a single process, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a future release we should rename this parameter to batch_size as there is no multiprocessing anymore. A separate PR could add a deprecation warning and a new batch_size parameter. We could then support both parameter names for some time until we remove max_multiprocessing_chunksize completely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Here is how I chose - 512. This value is basically then translated into a length of a list of str segments passed to a tokenizer. HF documentation says to have any effect for multithreading in underlying tokenizers we should pass large arrays so this value somehow made sense as default batching size for tokenisation in HF datasets is 1000. We can increase it if we see some significant effect. In summary - we can play with this value a bit more for sure

f"Got ya {num_cpus_used} parallel workers to convert {num_dicts} dictionaries "
f"to pytorch datasets (chunksize = {multiprocessing_chunk_size})..."
)
log_ascii_workers(num_cpus_used, logger)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case we remove multiprocessing from inference as well in a separate PR, then we could remove the implementation of log_ascii_workers completely from haystack/modeling/utils.py

for i in tqdm(range(0, num_dicts, batch_size), desc="Preprocessing dataset", unit=" Dicts"):
processing_batch = dicts[i : i + batch_size]
dataset, tensor_names, problematic_sample_ids = self.processor.dataset_from_dicts(
dicts=processing_batch, indices=list(range(len(processing_batch))) # TODO remove indices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a # TODO remove indices here. Could you please explain what this is about?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, multiprocessing relied on it "indices used during multiprocessing so that IDs assigned to our baskets are unique". It is still in the Processor signature, but we can remove it in the future!

texts, return_offsets_mapping=True, return_special_tokens_mask=True, add_special_tokens=False, verbose=False
)

# Extract relevant data
tokenids_batch = tokenized_docs_batch["input_ids"]
offsets_batch = []
for o in tokenized_docs_batch["offset_mapping"]:
offsets_batch.append(np.array([x[0] for x in o]))
offsets_batch.append(np.asarray([x[0] for x in o], dtype="int16"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a nice small speed optimization. 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some np.array calls in the document stores (faiss, milvus, inmemory,...) that we could also change to asarray maybe, if the arrays don't need to be copied. If the speed up is significant here, we should optimize the document stores in a separate PR. It's about arrays of document ids and embeddings there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't detect significant speedup from the use of np.asarray unfortunately

@vblagoje
Copy link
Member Author

@vblagoje I'm back today, but having @julian-risch eyes on this as well would be great! It looks good to me. Do you have some final timing comparisons of the preprocessing before and after the changes?

You are right @sjrl - we should have this. I'll prepare them soon.

@vblagoje
Copy link
Member Author

vblagoje commented Aug 30, 2022

Ok @sjrl and @julian-risch I roughly benchmarked tokenizing squad train set. The old multi-process approach is appx twice as fast as the current single process approach. I tried fine-tuning the number of processes (max_processes) for the old approach but the best result I could get is 11 sec for tokenizing the train dataset. With the new single-threaded approach the train dataset tokenization took 23 sec. Should we try tokenizing some bigger datasets? It would be cool if you could give it a quick spin as well @sjrl - just to make sure.

@vblagoje vblagoje merged commit 66f3f42 into deepset-ai:main Aug 31, 2022
@vblagoje vblagoje mentioned this pull request Sep 27, 2022
@vblagoje vblagoje deleted the fast_token branch February 28, 2023 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace multiprocessing tokenization with batched fast tokenization
4 participants