fix: Replace multiprocessing tokenization with batched fast tokenization #3089

vblagoje · 2022-08-23T14:12:18Z

Related Issues

fixes Replace multiprocessing tokenization with batched fast tokenization #3087

Proposed Changes:

Eviscerate multiprocessing code in data_silo.py and replace it with large batches fast tokenization
Remove all deprecated tokenization method invocations (batch_encode_plus, encode_plus)

How did you test it?

No test yet

## Todo
This PR is mainly in the draft stage; need to keep the CI churning

tholor · 2022-08-23T17:25:06Z

I remember assessing the full switch to fast tokenizers ~ 1-1.5 years ago. Back then the blocker was that not all popular model architectures (i.e their tokenizers) were supported by fast tokenizers. Could you please verify that it's now the case for the most common ones we see being used by our community (telemetry might help here)? roberta, electra, minilm, t5 are the first ones to pop in my mind but there way more...

vblagoje · 2022-08-23T18:46:36Z

Adding on my todo list for tomorrow

vblagoje · 2022-08-24T08:27:58Z

The most up-to-date list is available in transformers docs

sjrl · 2022-08-24T15:16:16Z

haystack/modeling/data_handler/data_silo.py

+        batch_size = self.max_multiprocessing_chunksize
+        for i in tqdm(range(0, num_dicts, batch_size), desc="Preprocessing dataset", unit=" Dicts"):
+            processing_batch = dicts[i : i + batch_size]
+            indices = [i for i in range(len(processing_batch))]  # indices is a remnant from multiprocessing era


I think you can also use

indices = list(range(len(processing_batch)))

which is a bit faster than the list comprehension for this use case.

vblagoje · 2022-08-25T13:31:47Z

@sjrl I updated the branch a bit more and completed some rudimentary performance measurements. I found the old tokenizer batch_encode_plus method twice as slow as the default call on the tokenizer. We can remove batch_encode_plus in a separate PR, but I added it here to ensure CI is green. As we are not only doing tokenization but also some 'basketization" of data, the old multiprocessing is still faster (especially because we used fast tokenizers already), but I don't think the slowdown is as dramatic as we feared. I'll do more measurements on larger bodies of text. So far it looks good.

vblagoje · 2022-08-30T11:30:41Z

As @sjrl is currently away, would you please look at this one @julian-risch ?

sjrl · 2022-08-30T11:52:34Z

@vblagoje I'm back today, but having @julian-risch eyes on this as well would be great! It looks good to me. Do you have some final timing comparisons of the preprocessing before and after the changes?

julian-risch

Looks very good to me and nice to see a small optimization regarding array copying (np.asarray) as well. 👍 There is one # TODO remove indices comment that needs further explanation though. Do you plan to address it before merging?

julian-risch · 2022-08-30T12:03:51Z

haystack/modeling/data_handler/data_silo.py

@@ -41,7 +37,7 @@ def __init__(
        eval_batch_size: Optional[int] = None,
        distributed: bool = False,
        automatic_loading: bool = True,
-        max_multiprocessing_chunksize: int = 2000,
+        max_multiprocessing_chunksize: int = 512,


Having a power of two makes sense 👍 Any intuition on why 512? 2048 would have been closer to the previous number.

Because this parameter is corresponds to the batch size now and needs to work with a single process, right?

In a future release we should rename this parameter to batch_size as there is no multiprocessing anymore. A separate PR could add a deprecation warning and a new batch_size parameter. We could then support both parameter names for some time until we remove max_multiprocessing_chunksize completely.

Good point. Here is how I chose - 512. This value is basically then translated into a length of a list of str segments passed to a tokenizer. HF documentation says to have any effect for multithreading in underlying tokenizers we should pass large arrays so this value somehow made sense as default batching size for tokenisation in HF datasets is 1000. We can increase it if we see some significant effect. In summary - we can play with this value a bit more for sure

julian-risch · 2022-08-30T12:07:32Z

haystack/modeling/data_handler/data_silo.py

-                    f"Got ya {num_cpus_used} parallel workers to convert {num_dicts} dictionaries "
-                    f"to pytorch datasets (chunksize = {multiprocessing_chunk_size})..."
-                )
-                log_ascii_workers(num_cpus_used, logger)


In case we remove multiprocessing from inference as well in a separate PR, then we could remove the implementation of log_ascii_workers completely from haystack/modeling/utils.py

julian-risch · 2022-08-30T12:08:51Z

haystack/modeling/data_handler/data_silo.py

+        for i in tqdm(range(0, num_dicts, batch_size), desc="Preprocessing dataset", unit=" Dicts"):
+            processing_batch = dicts[i : i + batch_size]
+            dataset, tensor_names, problematic_sample_ids = self.processor.dataset_from_dicts(
+                dicts=processing_batch, indices=list(range(len(processing_batch)))  # TODO remove indices


There is a # TODO remove indices here. Could you please explain what this is about?

Right, multiprocessing relied on it "indices used during multiprocessing so that IDs assigned to our baskets are unique". It is still in the Processor signature, but we can remove it in the future!

julian-risch · 2022-08-30T12:22:45Z

haystack/modeling/model/tokenization.py

        texts, return_offsets_mapping=True, return_special_tokens_mask=True, add_special_tokens=False, verbose=False
    )

    # Extract relevant data
    tokenids_batch = tokenized_docs_batch["input_ids"]
    offsets_batch = []
    for o in tokenized_docs_batch["offset_mapping"]:
-        offsets_batch.append(np.array([x[0] for x in o]))
+        offsets_batch.append(np.asarray([x[0] for x in o], dtype="int16"))


Looks like a nice small speed optimization. 👍

There are some np.array calls in the document stores (faiss, milvus, inmemory,...) that we could also change to asarray maybe, if the arrays don't need to be copied. If the speed up is significant here, we should optimize the document stores in a separate PR. It's about arrays of document ids and embeddings there.

I didn't detect significant speedup from the use of np.asarray unfortunately

…_strategy

vblagoje · 2022-08-30T15:07:46Z

@vblagoje I'm back today, but having @julian-risch eyes on this as well would be great! It looks good to me. Do you have some final timing comparisons of the preprocessing before and after the changes?

You are right @sjrl - we should have this. I'll prepare them soon.

vblagoje · 2022-08-30T15:57:26Z

Ok @sjrl and @julian-risch I roughly benchmarked tokenizing squad train set. The old multi-process approach is appx twice as fast as the current single process approach. I tried fine-tuning the number of processes (max_processes) for the old approach but the best result I could get is 11 sec for tokenizing the train dataset. With the new single-threaded approach the train dataset tokenization took 23 sec. Should we try tokenizing some bigger datasets? It would be cool if you could give it a quick spin as well @sjrl - just to make sure.

vblagoje requested a review from a team as a code owner August 23, 2022 14:12

vblagoje requested review from julian-risch and removed request for a team August 23, 2022 14:12

sjrl reviewed Aug 24, 2022

View reviewed changes

julian-risch changed the title ~~Replace multiprocessing tokenization with batched fast tokenization~~ fix: Replace multiprocessing tokenization with batched fast tokenization Aug 25, 2022

julian-risch added topic:speed topic:modeling labels Aug 25, 2022

vblagoje added 3 commits August 29, 2022 12:26

Replace multiprocessing tokenization with batched fast tokenization

b71123b

Fix tokenization in table reader

882f4bc

Replace deprecated tokenization method invocations

e55590d

julian-risch approved these changes Aug 30, 2022

View reviewed changes

Data silo: add deprecated notes for max_processes and multiprocessing…

4993295

…_strategy

sjrl approved these changes Aug 31, 2022

View reviewed changes

vblagoje merged commit 66f3f42 into deepset-ai:main Aug 31, 2022

masci mentioned this pull request Aug 31, 2022

DPR Tutorial 9: code gets stuck in data preprocessing #3084

Closed

vblagoje mentioned this pull request Sep 27, 2022

FARMReader slow #1077

Closed

vblagoje deleted the fast_token branch February 28, 2023 12:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Replace multiprocessing tokenization with batched fast tokenization #3089

fix: Replace multiprocessing tokenization with batched fast tokenization #3089

vblagoje commented Aug 23, 2022 •

edited

Loading

tholor commented Aug 23, 2022

vblagoje commented Aug 23, 2022

vblagoje commented Aug 24, 2022 •

edited

Loading

sjrl Aug 24, 2022

vblagoje commented Aug 25, 2022

vblagoje commented Aug 30, 2022

sjrl commented Aug 30, 2022

julian-risch left a comment

julian-risch Aug 30, 2022

julian-risch Aug 30, 2022

julian-risch Aug 30, 2022

vblagoje Aug 30, 2022

julian-risch Aug 30, 2022

julian-risch Aug 30, 2022

vblagoje Aug 30, 2022

julian-risch Aug 30, 2022

julian-risch Aug 30, 2022

vblagoje Aug 30, 2022

vblagoje commented Aug 30, 2022

vblagoje commented Aug 30, 2022 •

edited

Loading

fix: Replace multiprocessing tokenization with batched fast tokenization #3089

fix: Replace multiprocessing tokenization with batched fast tokenization #3089

Conversation

vblagoje commented Aug 23, 2022 • edited Loading

Related Issues

Proposed Changes:

How did you test it?

tholor commented Aug 23, 2022

vblagoje commented Aug 23, 2022

vblagoje commented Aug 24, 2022 • edited Loading

Choose a reason for hiding this comment

vblagoje commented Aug 25, 2022

vblagoje commented Aug 30, 2022

sjrl commented Aug 30, 2022

julian-risch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vblagoje commented Aug 30, 2022

vblagoje commented Aug 30, 2022 • edited Loading

vblagoje commented Aug 23, 2022 •

edited

Loading

vblagoje commented Aug 24, 2022 •

edited

Loading

vblagoje commented Aug 30, 2022 •

edited

Loading