diff --git a/notebooks/document-chunking/Makefile b/notebooks/document-chunking/Makefile index bcd601f2..8704a788 100644 --- a/notebooks/document-chunking/Makefile +++ b/notebooks/document-chunking/Makefile @@ -1,5 +1,6 @@ NBTEST = ../../bin/nbtest NOTEBOOKS = \ + tokenization.ipynb \ with-index-pipelines.ipynb \ with-langchain-splitters.ipynb diff --git a/notebooks/document-chunking/tokenization.ipynb b/notebooks/document-chunking/tokenization.ipynb new file mode 100644 index 00000000..cab24aa1 --- /dev/null +++ b/notebooks/document-chunking/tokenization.ipynb @@ -0,0 +1,329 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "s49gpkvZ7q53" + }, + "source": [ + "# Tokenization for Semantic Search (ELSER and E5)\n", + "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/search/tokenization.ipynb)\n", + "\n", + "Elasticsearch offers [semantic search](https://www.elastic.co/what-is/semantic-search) models, most notably [ELSER](https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html) and [E5](https://www.elastic.co/search-labs/blog/articles/multilingual-vector-search-e5-embedding-model), to search through documents in a way that takes the text's meaning into account. Part of the semantic search process is breaking up texts into tokens (both for documents and for queries). Tokens are commonly thought of as words, but this is not completely accurate. Different semantic models use different concepts of tokens. Many treat punctuation separately and some break up compound words. For example ELSER (our English language model) uses the [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) tokenizer.\n", + "\n", + "For users of Elasticsearch it is important to know how texts are broken up into tokens because currently only the [first 512 tokens per field](https://www.elastic.co/guide/en/machine-learning/8.12/ml-nlp-limitations.html#ml-nlp-elser-v1-limit-512) are considered. This means that when you index longer texts, all tokens after the 512th are ignored in your semantic search. Hence it is valuable to know the number of tokens for your input texts before choosing the right model and indexing method.\n", + "\n", + "Currently it is not possible to get the token count information via the API, so here we share the code for calculating token counts. This notebook also shows how to break longer text up into chunks of the right size so that no information is lost during indexing. Currently (as of version 8.12) this has to be done by the user. Future versions will remove this necessity and Elasticsearch will automatically create chunks behind the scenes." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "gaTFHLJC-Mgi" + }, + "source": [ + "# Install packages\n", + "\n", + "As stated above, ELSER uses [BERT](https://huggingface.co/blog/bert-101)'s tokenizer internally. Here we install the `transformers` package that gives us an interface to this tokenizer. (We install the `tabulate` packge to be able to print a nice table for comparison later on.)" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "K9Q1p2C9-wce", + "outputId": "204d5aee-571e-4363-be6e-f87d058f2d29" + }, + "outputs": [], + "source": [ + "!pip install -qU tabulate transformers" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we import everything we need. You can ignore a potential warning on models not being available because we only need the tokenizer here." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "from urllib.request import urlopen\n", + "\n", + "from tabulate import tabulate\n", + "from transformers import AutoTokenizer, BertTokenizer" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Define tokenizers\n", + "\n", + "Now we are ready to initialize the BERT tokenizer that ELSER uses and the E5 tokenizer for the multilingual semantic search. We also define a whitespace tokenizer in order to compare the naive version on creating tokens to the two tokenizers." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n", + "e5_tokenizer = AutoTokenizer.from_pretrained('intfloat/multilingual-e5-base')\n", + "\n", + "def whitespace_tokenize(text):\n", + " return text.split()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Load example data\n", + "\n", + "Download the movies example data that is also used in the other search examples." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "url = \"https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/notebooks/search/movies.json\"\n", + "response = urlopen(url)\n", + "movies = json.load(response)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Compare token counts\n", + "\n", + "Compare the token counts of the different tokenization methods for the descriptions of the movies." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " whitespace BERT E5 text\n", + "------------ ------ ---- -----------------------------------------------------------------------------------\n", + " 16 21 30 An organized crime dynasty's aging patriarch transfers control of his clandestin...\n", + " 19 25 32 Two imprisoned men bond over a number of years, finding solace and eventual rede...\n", + " 20 25 34 Two detectives, a rookie and a veteran, hunt a serial killer who uses the seven ...\n", + " 20 33 36 An insomniac office worker and a devil-may-care soapmaker form an underground fi...\n", + " 22 28 27 An undercover cop and a mole in the police attempt to identify each other while ...\n", + " 23 26 31 A computer hacker learns from mysterious rebels about the true nature of his rea...\n", + " 26 36 42 A thief who steals corporate secrets through the use of dream-sharing technology...\n", + " 27 36 42 The lives of two mob hitmen, a boxer, a gangster and his wife, and a pair of din...\n", + " 27 40 48 A young F.B.I. cadet must receive the help of an incarcerated and manipulative c...\n", + " 30 35 43 A sole survivor tells of the twisty events leading up to a horrific gun battle o...\n", + " 33 39 44 When the menace known as the Joker wreaks havoc and chaos on the people of Gotha...\n", + " 33 40 44 The story of Henry Hill and his life in the mob, covering his relationship with ...\n" + ] + } + ], + "source": [ + "def count_tokens(text):\n", + " whitespace_tokens = len(whitespace_tokenize(text))\n", + " bert_tokens = len(bert_tokenizer.encode(text))\n", + " e5_tokens = len(e5_tokenizer.encode(text))\n", + " return [whitespace_tokens, bert_tokens, e5_tokens, f\"{text[:80]}...\"]\n", + "\n", + "counts = [count_tokens(movie[\"plot\"]) for movie in movies]\n", + "\n", + "print(tabulate(sorted(counts), [\"whitespace\", \"BERT\", \"E5\", \"text\"]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice that both the BERT the E5 tokenizers yields more tokens in every example, in some cases even twice as many. Why is that? Let's look at an example:" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The lives of two mob hitmen, a boxer, a gangster and his wife, and a pair of diner bandits intertwine in four tales of violence and redemption.\n", + "\n", + "['[CLS]', 'the', 'lives', 'of', 'two', 'mob', 'hit', '##men', ',', 'a', 'boxer', ',', 'a', 'gangster', 'and', 'his', 'wife', ',', 'and', 'a', 'pair', 'of', 'diner', 'bandits', 'inter', '##t', '##wine', 'in', 'four', 'tales', 'of', 'violence', 'and', 'redemption', '.', '[SEP]']\n" + ] + } + ], + "source": [ + "example_movie = movies[0][\"plot\"]\n", + "print(example_movie)\n", + "print()\n", + "\n", + "movie_tokens = bert_tokenizer.encode(example_movie)\n", + "print(str([bert_tokenizer.decode([t]) for t in movie_tokens]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can observe:\n", + "- There are special tokens `[CLS]` and `[SEP]` to model the the beginning and end of the text. These two extra tokens will become relevant below.\n", + "- Punctuations are their own tokens.\n", + "- Compound words are split into two tokens, for example `hitmen` becomes `hit` and `##men`.\n", + "\n", + "Given this behavior, it is easy to see how longer texts yield lots of tokens and can quickly get beyond the 512 tokens limitation mentioned above." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Handling long texts\n", + "\n", + "We saw how to count the number of tokens using the tokenizers from different models. ELSER uses the BERT tokenizer, so when using `.elser_model_2` it internally splits the text with this method.\n", + "\n", + "Currently there is a limitation that [only the first 512 tokens are considered](https://www.elastic.co/guide/en/machine-learning/8.12/ml-nlp-limitations.html#ml-nlp-elser-v1-limit-512). To work around this, we can first split the input text into chunks of 512 tokens and feed the chunks to Elasticsearch separately. Actually, we need to use a limit of 510 to leave space for the two special tokens (`[CLS]` and `[SEP]`) that we saw.\n", + "\n", + "Furthermore, in practice we often see improved performance when using overlapping chunks. With ELSER, we recommend 50% token overlap (i.e. a 255 token stride)." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "SEMANTIC_SEARCH_TOKEN_LIMIT = 510 # 512 minus space for the 2 special tokens\n", + "\n", + "def chunk(tokens, chunk_size=SEMANTIC_SEARCH_TOKEN_LIMIT):\n", + " step_size = round(chunk_size * .5) # 50% token overlap between chunks is recommended for ELSER\n", + "\n", + " for i in range(0, len(tokens), step_size):\n", + " yield tokens[i:i+chunk_size]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here we load a longer example text:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "url = \"https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/datasets/book_summaries_1000_chunked.json\"\n", + "response = urlopen(url)\n", + "book_summaries = json.load(response)\n", + "\n", + "long_text = book_summaries[0][\"synopsis\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next we tokenize the long text, exclude the special tokens at beginning and end, create chunks of size 510 tokens and map the tokens back to texts." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Token indices sequence length is longer than the specified maximum sequence length for this model (1242 > 512). Running this sequence through the model will result in indexing errors\n" + ] + }, + { + "data": { + "text/plain": [ + "['old major, the old boar on the manor farm, calls the animals on the farm for a meeting, where he compares the humans to parasites and teaches the animals a revolutionary song,\\'beasts of england \\'. when major dies, two young pigs, snowball and napoleon, assume command and turn his dream into a philosophy. the animals revolt and drive the drunken and irresponsible mr jones from the farm, renaming it \" animal farm \". they adopt seven commandments of animal - ism, the most important of which is, \" all animals are equal \". snowball attempts to teach the animals reading and writing ; food is plentiful, and the farm runs smoothly. the pigs elevate themselves to positions of leadership and set aside special food items, ostensibly for their personal health. napoleon takes the pups from the farm dogs and trains them privately. napoleon and snowball struggle for leadership. when snowball announces his plans to build a windmill, napoleon has his dogs chase snowball away and declares himself leader. napoleon enacts changes to the governance structure of the farm, replacing meetings with a committee of pigs, who will run the farm. using a young pig named squealer as a \" mouthpiece \", napoleon claims credit for the windmill idea. the animals work harder with the promise of easier lives with the windmill. after a violent storm, the animals find the windmill annihilated. napoleon and squealer convince the animals that snowball destroyed it, although the scorn of the neighbouring farmers suggests that its walls were too thin. once snowball becomes a scapegoat, napoleon begins purging the farm with his dogs, killing animals he accuses of consorting with his old rival. he and the pigs abuse their power, imposing more control while reserving privileges for themselves and rewriting history, villainising snowball and glorifying napoleon. squealer justifies every statement napoleon makes, even the pigs\\'alteration of the seven commandments of animalism to benefit themselves.\\'beasts of england\\'is replaced by an anthem glorifying napoleon, who appears to be adopting the lifestyle of a man. the animals remain convinced that they are better off than they were when under mr jones. squealer abuses the animals\\'poor memories and invents numbers to show their improvement. mr frederick, one of the neighbouring farmers, attacks the farm, using blasting powder to blow up the restored windmill. though the animals win the battle, they do so at great',\n", + " 'windmill idea. the animals work harder with the promise of easier lives with the windmill. after a violent storm, the animals find the windmill annihilated. napoleon and squealer convince the animals that snowball destroyed it, although the scorn of the neighbouring farmers suggests that its walls were too thin. once snowball becomes a scapegoat, napoleon begins purging the farm with his dogs, killing animals he accuses of consorting with his old rival. he and the pigs abuse their power, imposing more control while reserving privileges for themselves and rewriting history, villainising snowball and glorifying napoleon. squealer justifies every statement napoleon makes, even the pigs\\'alteration of the seven commandments of animalism to benefit themselves.\\'beasts of england\\'is replaced by an anthem glorifying napoleon, who appears to be adopting the lifestyle of a man. the animals remain convinced that they are better off than they were when under mr jones. squealer abuses the animals\\'poor memories and invents numbers to show their improvement. mr frederick, one of the neighbouring farmers, attacks the farm, using blasting powder to blow up the restored windmill. though the animals win the battle, they do so at great cost, as many, including boxer the workhorse, are wounded. despite his injuries, boxer continues working harder and harder, until he collapses while working on the windmill. napoleon sends for a van to take boxer to the veterinary surgeon\\'s, explaining that better care can be given there. benjamin, the cynical donkey, who \" could read as well as any pig \", notices that the van belongs to a knacker, and attempts to mount a rescue ; but the animals\\'attempts are futile. squealer reports that the van was purchased by the hospital and the writing from the previous owner had not been repainted. he recounts a tale of boxer\\'s death in the hands of the best medical care. years pass, and the pigs learn to walk upright, carry whips and wear clothes. the seven commandments are reduced to a single phrase : \" all animals are equal, but some animals are more equal than others \". napoleon holds a dinner party for the pigs and the humans of the area, who congratulate napoleon on having the hardest - working but least fed animals in the country. napoleon announces an alliance with the humans, against the labouring classes of both \" worlds \". he abolishes practices and traditions related to the revolution,',\n", + " 'cost, as many, including boxer the workhorse, are wounded. despite his injuries, boxer continues working harder and harder, until he collapses while working on the windmill. napoleon sends for a van to take boxer to the veterinary surgeon\\'s, explaining that better care can be given there. benjamin, the cynical donkey, who \" could read as well as any pig \", notices that the van belongs to a knacker, and attempts to mount a rescue ; but the animals\\'attempts are futile. squealer reports that the van was purchased by the hospital and the writing from the previous owner had not been repainted. he recounts a tale of boxer\\'s death in the hands of the best medical care. years pass, and the pigs learn to walk upright, carry whips and wear clothes. the seven commandments are reduced to a single phrase : \" all animals are equal, but some animals are more equal than others \". napoleon holds a dinner party for the pigs and the humans of the area, who congratulate napoleon on having the hardest - working but least fed animals in the country. napoleon announces an alliance with the humans, against the labouring classes of both \" worlds \". he abolishes practices and traditions related to the revolution, and changes the name of the farm to \" the manor farm \". the animals, overhearing the conversation, notice that the faces of the pigs have begun changing. during a poker match, an argument breaks out between napoleon and mr pilkington, and the animals realise that the faces of the pigs look like the faces of humans, and no one can tell the difference between them. the pigs snowball, napoleon, and squealer adapt old major\\'s ideas into an actual philosophy, which they formally name animalism. soon after, napoleon and squealer indulge in the vices of humans ( drinking alcohol, sleeping in beds, trading ). squealer is employed to alter the seven commandments to account for this humanisation, an allusion to the soviet government\\'s revising of history in order to exercise control of the people\\'s beliefs about themselves and their society. the original commandments are : # whatever goes upon two legs is an enemy. # whatever goes upon four legs, or has wings, is a friend. # no animal shall wear clothes. # no animal shall sleep in a bed. # no animal shall drink alcohol. # no animal shall kill any other animal. # all animals are equal.',\n", + " 'and changes the name of the farm to \" the manor farm \". the animals, overhearing the conversation, notice that the faces of the pigs have begun changing. during a poker match, an argument breaks out between napoleon and mr pilkington, and the animals realise that the faces of the pigs look like the faces of humans, and no one can tell the difference between them. the pigs snowball, napoleon, and squealer adapt old major\\'s ideas into an actual philosophy, which they formally name animalism. soon after, napoleon and squealer indulge in the vices of humans ( drinking alcohol, sleeping in beds, trading ). squealer is employed to alter the seven commandments to account for this humanisation, an allusion to the soviet government\\'s revising of history in order to exercise control of the people\\'s beliefs about themselves and their society. the original commandments are : # whatever goes upon two legs is an enemy. # whatever goes upon four legs, or has wings, is a friend. # no animal shall wear clothes. # no animal shall sleep in a bed. # no animal shall drink alcohol. # no animal shall kill any other animal. # all animals are equal. later, napoleon and his pigs secretly revise some commandments to clear them of accusations of law - breaking ( such as \" no animal shall drink alcohol \" having \" to excess \" appended to it and \" no animal shall sleep in a bed \" with \" with sheets \" added to it ). the changed commandments are as follows, with the changes bolded : * 4 no animal shall sleep in a bed with sheets. * 5 no animal shall drink alcohol to excess. * 6 no animal shall kill any other animal without cause. eventually these are replaced with the maxims, \" all animals are equal, but some animals are more equal than others \", and \" four legs good, two legs better! \" as the pigs become more human. this is an ironic twist to the original purpose of the seven commandments, which were supposed to keep order within animal farm by uniting the animals together against the humans, and prevent animals from following the humans\\'evil habits. through the revision of the commandments, orwell demonstrates how simply political dogma can be turned into malleable propaganda.',\n", + " 'later, napoleon and his pigs secretly revise some commandments to clear them of accusations of law - breaking ( such as \" no animal shall drink alcohol \" having \" to excess \" appended to it and \" no animal shall sleep in a bed \" with \" with sheets \" added to it ). the changed commandments are as follows, with the changes bolded : * 4 no animal shall sleep in a bed with sheets. * 5 no animal shall drink alcohol to excess. * 6 no animal shall kill any other animal without cause. eventually these are replaced with the maxims, \" all animals are equal, but some animals are more equal than others \", and \" four legs good, two legs better! \" as the pigs become more human. this is an ironic twist to the original purpose of the seven commandments, which were supposed to keep order within animal farm by uniting the animals together against the humans, and prevent animals from following the humans\\'evil habits. through the revision of the commandments, orwell demonstrates how simply political dogma can be turned into malleable propaganda.']" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "tokens = bert_tokenizer.encode(long_text)[1:-1] # exclude special tokens at the beginning and end\n", + "chunked = [\n", + " bert_tokenizer.decode(tokens_chunk)\n", + " for tokens_chunk in chunk(tokens)\n", + "]\n", + "chunked" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "And there we go. Now these chunks can be indexed together on the same document in a [nested field](https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html) and we can be sure the semantic search model considers our whole text." + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.7" + }, + "vscode": { + "interpreter": { + "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e" + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}