Releases: deepset-ai/haystack
v2.6.1-rc1
Release Notes
v2.6.1-rc1
Bug Fixes
- Revert change to PyPDFConverter that broke the deserialization of pre 2.6.0 YAMLs.
v2.6.0
Release Notes
⬆️ Upgrade Notes
gpt-3.5-turbo
was replaced bygpt-4o-mini
as the default model for all components relying on OpenAI API- Support for the legacy filter syntax and operators (e.g., "$and", "$or", "$eq", "$lt", etc.), which originated in Haystack v1, has been fully removed. Users must now use only the new filter syntax. See the docs for more details.
🚀 New Features
-
Added a new component
DocumentNDCGEvaluator
, which is similar toDocumentMRREvaluator
and useful for retrieval evaluation. It calculates the normalized discounted cumulative gain, an evaluation metric useful when there are multiple ground truth relevant documents and the order in which they are retrieved is important. -
Add new
CSVToDocument
component. Loads the file as bytes object. Adds the loaded string as a new document that can be used for further processing by the Document Splitter. -
Adds support for zero shot document classification via new
TransformersZeroShotDocumentClassifier
component. This allows you to classify documents into user-defined classes (binary and multi-label classification) using pre-trained models from Hugging Face. -
Added the option to use a custom splitting function in
DocumentSplitter
. The function must accept a string as input and return a list of strings, representing the split units. To use the feature initialiseDocumentSplitter
withsplit_by="function"
providing the custom splitting function assplitting_function=custom_function
. -
Add new
JSONConverter
Component to convert JSON files to Document. Optionally it can use jq to filter the source JSON files and extract only specific parts.
import json
from haystack.components.converters import JSONConverter
from haystack.dataclasses import ByteStream
data = {
"laureates": [
{
"firstname": "Enrico",
"surname": "Fermi",
"motivation": "for his demonstrations of the existence of new radioactive elements produced "
"by neutron irradiation, and for his related discovery of nuclear reactions brought about by slow neutrons",
},
{
"firstname": "Rita",
"surname": "Levi-Montalcini",
"motivation": "for their discoveries of growth factors",
},
],
}
source = ByteStream.from_string(json.dumps(data))
converter = JSONConverter(jq_schema=".laureates[]", content_key="motivation", extra_meta_fields=["firstname", "surname"])
results = converter.run(sources=[source])
documents = results["documents"] print(documents[0].content)
# 'for his demonstrations of the existence of new radioactive elements produced by
# neutron irradiation, and for his related discovery of nuclear reactions brought
# about by slow neutrons'
print(documents[0].meta)
# {'firstname': 'Enrico', 'surname': 'Fermi'}
print(documents[1].content)
# 'for their discoveries of growth factors' print(documents[1].meta) # {'firstname': 'Rita', 'surname': 'Levi-Montalcini'}
-
Added a new
NLTKDocumentSplitter
, a component enhancing document preprocessing capabilities with NLTK. This feature allows for fine-grained control over the splitting of documents into smaller parts based on configurable criteria such as word count, sentence boundaries, and page breaks. It supports multiple languages and offers options for handling sentence boundaries and abbreviations, facilitating better handling of various document types for further processing tasks. -
Updates
SentenceTransformersDocumentEmbedder
andSentenceTransformersTextEmbedder
somodel_max_length
passed throughtokenizer_kwargs
also updates themax_seq_length
of the underlying SentenceTransformer model.
⚡️ Enhancement Notes
-
Adapts how
ChatPromptBuilder
creates ChatMessages. Messages are deep copied to ensure all meta fields are copied correctly. -
Expose
default_headers
to pass custom headers to Azure API including APIM subscription key. -
Add optional
azure_kwargs
dictionary parameter to pass in parameters undefined in Haystack but supported by AzureOpenAI. -
Allow the ability to add the current date inside a template in
PromptBuilder
using the following syntax:{% now 'UTC' %}
: Get the current date for the UTC timezone.{% now 'America/Chicago' + 'hours=2' %}
: Add two hours to the current date in the Chicago timezone.{% now 'Europe/Berlin' - 'weeks=2' %}
: Subtract two weeks from the current date in the Berlin timezone.{% now 'Pacific/Fiji' + 'hours=2', '%H' %}
: Display only the number of hours after adding two hours to the Fiji timezone.{% now 'Etc/GMT-4', '%I:%M %p' %}
: Change the date format to AM/PM for the GMT-4 timezone.
Note that if no date format is provided, the default will be
%Y-%m-%d %H:%M:%S
. Please refer to list of tz database for a list of timezones. -
Adds
usage
meta field withprompt_tokens
andcompletion_tokens
keys toHuggingFaceAPIChatGenerator
. -
Add new
GreedyVariadic
input type. This has a similar behaviour toVariadic
input type as it can be connected to multiple output sockets, though the Pipeline will run it as soon as it receives an input without waiting for others. This replaces theis_greedy
argument in the@component
decorator. If you had a Component with aVariadic
input type and@component(is_greedy=True)
you need to change the type toGreedyVariadic
and removeis_greedy=true
from@component
. -
Add new Pipeline init argument
max_runs_per_component
, this has the same identical behaviour as the existingmax_loops_allowed
argument but is more descriptive of its actual effects. -
Add new
PipelineMaxLoops
to reflect newmax_runs_per_component
init argument -
We added batching during inference time to the
TransformerSimilarityRanker
to help prevent OOMs when ranking large amounts of Documents.
⚠️ Deprecation Notes
- The
DefaultConverter
class used by thePyPDFToDocument
component has been deprecated. Its functionality will be merged into the component in 2.7.0. - Pipeline init argument
debug_path
is deprecated and will be removed in version 2.7.0. @component
decoratoris_greedy
argument is deprecated and will be removed in version 2.7.0. UseGreedyVariadic
type instead.- Deprecate connecting a Component to itself when calling
Pipeline.connect()
, it will raise an error from version 2.7.0 onwards - Pipeline init argument
max_loops_allowed
is deprecated and will be removed in version 2.7.0. Usemax_runs_per_component
instead. PipelineMaxLoops
exception is deprecated and will be removed in version 2.7.0. UsePipelineMaxComponentRuns
instead.
🐛 Bug Fixes
- Fix the serialization of
PyPDFToDocument
component to prevent the default converter from being serialized unnecessarily. - Add constraints to
component.set_input_type
andcomponent.set_input_types
to prevent undefined behaviour when therun
method does not contain a variadic keyword argument. - Prevent
set_output_types
from being called when theoutput_types
decorator is used. - Update the
CHAT_WITH_WEBSITE
Pipeline template to reflect the changes in theHTMLToDocument
converter component. - Fix a Pipeline visualization issue due to changes in the new release of Mermaid.
- Fixing the filters in the
SentenceWindowRetriever
allowing now support for 3 more DocumentStores: Astra, PGVector, Qdrant - Fix Pipeline not running Components with Variadic input even if it received inputs only from a subset of its senders
- The
from_dict
method ofConditionalRouter
now correctly handles the case where thedict
passed to it contains the keycustom_filters
explicitly set toNone
. Previously this was causing anAttributeError
- Make the
from_dict
method of thePyPDFToDocument
more robust to cases when the converter is not provided in the dictionary.
v2.6.0-rc3
Release Notes
⬆️ Upgrade Notes
gpt-3.5-turbo
was replaced bygpt-4o-mini
as the default model for all components relying on OpenAI API- Support for the legacy filter syntax and operators (e.g., "$and", "$or", "$eq", "$lt", etc.), which originated in Haystack v1, has been fully removed. Users must now use only the new filter syntax. See the docs for more details.
🚀 New Features
-
Added a new component
DocumentNDCGEvaluator
, which is similar toDocumentMRREvaluator
and useful for retrieval evaluation. It calculates the normalized discounted cumulative gain, an evaluation metric useful when there are multiple ground truth relevant documents and the order in which they are retrieved is important. -
Add new
CSVToDocument
component. Loads the file as bytes object. Adds the loaded string as a new document that can be used for further processing by the Document Splitter. -
Adds support for zero shot document classification via new
TransformersZeroShotDocumentClassifier
component. This allows you to classify documents into user-defined classes (binary and multi-label classification) using pre-trained models from Hugging Face. -
Added the option to use a custom splitting function in
DocumentSplitter
. The function must accept a string as input and return a list of strings, representing the split units. To use the feature initialiseDocumentSplitter
withsplit_by="function"
providing the custom splitting function assplitting_function=custom_function
. -
Add new
JSONConverter
Component to convert JSON files to Document. Optionally it can use jq to filter the source JSON files and extract only specific parts.
import json
from haystack.components.converters import JSONConverter
from haystack.dataclasses import ByteStream
data = {
"laureates": [
{
"firstname": "Enrico",
"surname": "Fermi",
"motivation": "for his demonstrations of the existence of new radioactive elements produced "
"by neutron irradiation, and for his related discovery of nuclear reactions brought about by slow neutrons",
},
{
"firstname": "Rita",
"surname": "Levi-Montalcini",
"motivation": "for their discoveries of growth factors",
},
],
}
source = ByteStream.from_string(json.dumps(data))
converter = JSONConverter(jq_schema=".laureates[]", content_key="motivation", extra_meta_fields=["firstname", "surname"])
results = converter.run(sources=[source])
documents = results["documents"] print(documents[0].content)
# 'for his demonstrations of the existence of new radioactive elements produced by
# neutron irradiation, and for his related discovery of nuclear reactions brought
# about by slow neutrons'
print(documents[0].meta)
# {'firstname': 'Enrico', 'surname': 'Fermi'}
print(documents[1].content)
# 'for their discoveries of growth factors' print(documents[1].meta) # {'firstname': 'Rita', 'surname': 'Levi-Montalcini'}
-
Added a new
NLTKDocumentSplitter
, a component enhancing document preprocessing capabilities with NLTK. This feature allows for fine-grained control over the splitting of documents into smaller parts based on configurable criteria such as word count, sentence boundaries, and page breaks. It supports multiple languages and offers options for handling sentence boundaries and abbreviations, facilitating better handling of various document types for further processing tasks. -
Updates
SentenceTransformersDocumentEmbedder
andSentenceTransformersTextEmbedder
somodel_max_length
passed throughtokenizer_kwargs
also updates themax_seq_length
of the underlying SentenceTransformer model.
⚡️ Enhancement Notes
-
Adapts how
ChatPromptBuilder
creates ChatMessages. Messages are deep copied to ensure all meta fields are copied correctly. -
Expose
default_headers
to pass custom headers to Azure API including APIM subscription key. -
Add optional
azure_kwargs
dictionary parameter to pass in parameters undefined in Haystack but supported by AzureOpenAI. -
Allow the ability to add the current date inside a template in
PromptBuilder
using the following syntax:{% now 'UTC' %}
: Get the current date for the UTC timezone.{% now 'America/Chicago' + 'hours=2' %}
: Add two hours to the current date in the Chicago timezone.{% now 'Europe/Berlin' - 'weeks=2' %}
: Subtract two weeks from the current date in the Berlin timezone.{% now 'Pacific/Fiji' + 'hours=2', '%H' %}
: Display only the number of hours after adding two hours to the Fiji timezone.{% now 'Etc/GMT-4', '%I:%M %p' %}
: Change the date format to AM/PM for the GMT-4 timezone.
Note that if no date format is provided, the default will be
%Y-%m-%d %H:%M:%S
. Please refer to list of tz database for a list of timezones. -
Adds
usage
meta field withprompt_tokens
andcompletion_tokens
keys toHuggingFaceAPIChatGenerator
. -
Add new
GreedyVariadic
input type. This has a similar behaviour toVariadic
input type as it can be connected to multiple output sockets, though the Pipeline will run it as soon as it receives an input without waiting for others. This replaces theis_greedy
argument in the@component
decorator. If you had a Component with aVariadic
input type and@component(is_greedy=True)
you need to change the type toGreedyVariadic
and removeis_greedy=true
from@component
. -
Add new Pipeline init argument
max_runs_per_component
, this has the same identical behaviour as the existingmax_loops_allowed
argument but is more descriptive of its actual effects. -
Add new
PipelineMaxLoops
to reflect newmax_runs_per_component
init argument -
We added batching during inference time to the
TransformerSimilarityRanker
to help prevent OOMs when ranking large amounts of Documents.
⚠️ Deprecation Notes
- The
DefaultConverter
class used by thePyPDFToDocument
component has been deprecated. Its functionality will be merged into the component in 2.7.0. - Pipeline init argument
debug_path
is deprecated and will be removed in version 2.7.0. @component
decoratoris_greedy
argument is deprecated and will be removed in version 2.7.0. UseGreedyVariadic
type instead.- Deprecate connecting a Component to itself when calling
Pipeline.connect()
, it will raise an error from version 2.7.0 onwards - Pipeline init argument
max_loops_allowed
is deprecated and will be removed in version 2.7.0. Usemax_runs_per_component
instead. PipelineMaxLoops
exception is deprecated and will be removed in version 2.7.0. UsePipelineMaxComponentRuns
instead.
🐛 Bug Fixes
- Fix the serialization of
PyPDFToDocument
component to prevent the default converter from being serialized unnecessarily. - Add constraints to
component.set_input_type
andcomponent.set_input_types
to prevent undefined behaviour when therun
method does not contain a variadic keyword argument. - Prevent
set_output_types
from being called when theoutput_types
decorator is used. - Update the
CHAT_WITH_WEBSITE
Pipeline template to reflect the changes in theHTMLToDocument
converter component. - Fix a Pipeline visualization issue due to changes in the new release of Mermaid.
- Fixing the filters in the
SentenceWindowRetriever
allowing now support for 3 more DocumentStores: Astra, PGVector, Qdrant - Fix Pipeline not running Components with Variadic input even if it received inputs only from a subset of its senders
- The
from_dict
method ofConditionalRouter
now correctly handles the case where thedict
passed to it contains the keycustom_filters
explicitly set toNone
. Previously this was causing anAttributeError
- Make the
from_dict
method of thePyPDFToDocument
more robust to cases when the converter is not provided in the dictionary.
v2.6.0-rc2
Release Notes
⬆️ Upgrade Notes
gpt-3.5-turbo
was replaced bygpt-4o-mini
as the default model for all components relying on OpenAI API- The legacy filter syntax support has been completely removed. Users need to use the new filter syntax. See the docs for more details.
🚀 New Features
-
Add new
CSVToDocument
component. Loads the file as bytes object. Adds the loaded string as a new document that can be used for further processing by the Document Splitter. -
Adds support for zero shot document classification via new
TransformersZeroShotDocumentClassifier
component. This allows you to classify documents into user-defined classes (binary and multi-label classification) using pre-trained models from Hugging Face. -
Added the option to use a custom splitting function in
DocumentSplitter
. The function must accept a string as input and return a list of strings, representing the split units. To use the feature initialiseDocumentSplitter
withsplit_by="function"
providing the custom splitting function assplitting_function=custom_function
. -
Add new
JSONConverter
Component to convert JSON files to Document. Optionally it can use jq to filter the source JSON files and extract only specific parts.
import json
from haystack.components.converters import JSONConverter
from haystack.dataclasses import ByteStream
data = {
"laureates": [
{
"firstname": "Enrico",
"surname": "Fermi",
"motivation": "for his demonstrations of the existence of new radioactive elements produced "
"by neutron irradiation, and for his related discovery of nuclear reactions brought about by slow neutrons",
},
{
"firstname": "Rita",
"surname": "Levi-Montalcini",
"motivation": "for their discoveries of growth factors",
},
],
}
source = ByteStream.from_string(json.dumps(data))
converter = JSONConverter(jq_schema=".laureates[]", content_key="motivation", extra_meta_fields=["firstname", "surname"])
results = converter.run(sources=[source])
documents = results["documents"] print(documents[0].content)
# 'for his demonstrations of the existence of new radioactive elements produced by
# neutron irradiation, and for his related discovery of nuclear reactions brought
# about by slow neutrons'
print(documents[0].meta)
# {'firstname': 'Enrico', 'surname': 'Fermi'}
print(documents[1].content)
# 'for their discoveries of growth factors' print(documents[1].meta) # {'firstname': 'Rita', 'surname': 'Levi-Montalcini'}
-
Added a new
NLTKDocumentSplitter
, a component enhancing document preprocessing capabilities with NLTK. This feature allows for fine-grained control over the splitting of documents into smaller parts based on configurable criteria such as word count, sentence boundaries, and page breaks. It supports multiple languages and offers options for handling sentence boundaries and abbreviations, facilitating better handling of various document types for further processing tasks. -
Updates
SentenceTransformersDocumentEmbedder
andSentenceTransformersTextEmbedder
somodel_max_length
passed throughtokenizer_kwargs
also updates themax_seq_length
of the underlying SentenceTransformer model.
⚡️ Enhancement Notes
-
Adapts how
ChatPromptBuilder
creates ChatMessages. Messages are deep copied to ensure all meta fields are copied correctly. -
Expose
default_headers
to pass custom headers to Azure API including APIM subscription key. -
Add optional
azure_kwargs
dictionary parameter to pass in parameters undefined in Haystack but supported by AzureOpenAI. -
Allow the ability to add the current date inside a template in
PromptBuilder
using the following syntax:{% now 'UTC' %}
: Get the current date for the UTC timezone.{% now 'America/Chicago' + 'hours=2' %}
: Add two hours to the current date in the Chicago timezone.{% now 'Europe/Berlin' - 'weeks=2' %}
: Subtract two weeks from the current date in the Berlin timezone.{% now 'Pacific/Fiji' + 'hours=2', '%H' %}
: Display only the number of hours after adding two hours to the Fiji timezone.{% now 'Etc/GMT-4', '%I:%M %p' %}
: Change the date format to AM/PM for the GMT-4 timezone.
Note that if no date format is provided, the default will be
%Y-%m-%d %H:%M:%S
. Please refer to list of tz database for a list of timezones. -
Adds
usage
meta field withprompt_tokens
andcompletion_tokens
keys toHuggingFaceAPIChatGenerator
. -
Add new
GreedyVariadic
input type. This has a similar behaviour toVariadic
input type as it can be connected to multiple output sockets, though the Pipeline will run it as soon as it receives an input without waiting for others. This replaces theis_greedy
argument in the@component
decorator. If you had a Component with aVariadic
input type and@component(is_greedy=True)
you need to change the type toGreedyVariadic
and removeis_greedy=true
from@component
. -
Add new Pipeline init argument
max_runs_per_component
, this has the same identical behaviour as the existingmax_loops_allowed
argument but is more descriptive of its actual effects. -
Add new
PipelineMaxLoops
to reflect newmax_runs_per_component
init argument -
We added batching during inference time to the
TransformerSimilarityRanker
to help prevent OOMs when ranking large amounts of Documents.
⚠️ Deprecation Notes
- Pipeline init argument
debug_path
is deprecated and will be removed in version 2.7.0. @component
decoratoris_greedy
argument is deprecated and will be removed in version 2.7.0. UseGreedyVariadic
type instead.- Deprecate connecting a Component to itself when calling
Pipeline.connect()
, it will raise an error from version 2.7.0 onwards - Pipeline init argument
max_loops_allowed
is deprecated and will be removed in version 2.7.0. Usemax_runs_per_component
instead. PipelineMaxLoops
exception is deprecated and will be removed in version 2.7.0. UsePipelineMaxComponentRuns
instead.
🐛 Bug Fixes
- Add constraints to
component.set_input_type
andcomponent.set_input_types
to prevent undefined behaviour when therun
method does not contain a variadic keyword argument. - Prevent
set_output_types
from being called when theoutput_types
decorator is used. - Update the
CHAT_WITH_WEBSITE
Pipeline template to reflect the changes in theHTMLToDocument
converter component. - Fix a Pipeline visualization issue due to changes in the new release of Mermaid.
- Fixing the filters in the
SentenceWindowRetriever
allowing now support for 3 more DocumentStores: Astra, PGVector, Qdrant - Fix Pipeline not running Components with Variadic input even if it received inputs only from a subset of its senders
- The
from_dict
method ofConditionalRouter
now correctly handles the case where thedict
passed to it contains the keycustom_filters
explicitly set toNone
. Previously this was causing anAttributeError
- Make the
from_dict
method of thePyPDFToDocument
more robust to cases when the converter is not provided in the dictionary.
v2.5.1
Release Notes
⚡️ Enhancement Notes
- Add
default_headers
init argument toAzureOpenAIGenerator
andAzureOpenAIChatGenerator
🐛 Bug Fixes
- Fix the Pipeline visualization issue due to changes in the new release of Mermaid
- Fix
Pipeline
not running Components with Variadic input even if it received inputs only from a subset of its senders - The
from_dict
method ofConditionalRouter
now correctly handles the case where thedict
passed to it contains the keycustom_filters
explicitly set toNone
. Previously this was causing anAttributeError
v2.5.1-rc2
Release Notes
⚡️ Enhancement Notes
- Add
default_headers
init argument toAzureOpenAIGenerator
andAzureOpenAIChatGenerator
🐛 Bug Fixes
- Fix the Pipeline visualization issue due to changes in the new release of Mermaid
- Fix
Pipeline
not running Components with Variadic input even if it received inputs only from a subset of its senders - The
from_dict
method ofConditionalRouter
now correctly handles the case where thedict
passed to it contains the keycustom_filters
explicitly set toNone
. Previously this was causing anAttributeError
v2.5.1-rc1
Release Notes
⚡️ Enhancement Notes
- Add
default_headers
init argument toAzureOpenAIGenerator
andAzureOpenAIChatGenerator
🐛 Bug Fixes
- Fix
Pipeline
not running Components with Variadic input even if it received inputs only from a subset of its senders - The
from_dict
method ofConditionalRouter
now correctly handles the case where thedict
passed to it contains the keycustom_filters
explicitly set toNone
. Previously this was causing anAttributeError
v2.5.0
Release Notes
⬆️ Upgrade Notes
- Removed
ChatMessage.to_openai_format
method. Usehaystack.components.generators.openai_utils._convert_message_to_openai_format
instead. - Removed unused
debug
parameter fromPipeline.run
method. - Removed deprecated
SentenceWindowRetrieval
. UseSentenceWindowRetriever
instead.
🚀 New Features
- Added the unsafe argument to enable behavior that could lead to remote code execution in
ConditionalRouter
andOutputAdapter
. By default, unsafe behavior is disabled, and users must explicitly setunsafe=True
to enable it. When unsafe is enabled, types such asChatMessage
,Document
, andAnswer
can be used as output types. We recommend enabling unsafe behavior only when the Jinja template source is trusted. For more information, see the documentation forConditionalRouter
andOutputAdapter
.
⚡️ Enhancement Notes
- Adapts how
ChatPromptBuilder
createsChatMessages
. Messages are deep copied to ensure all meta fields are copied correctly. - The parameter,
min_top_k
, has been added to theTopPSampler
. This parameter sets the minimum number of documents to be returned when the top-p sampling algorithm selects fewer documents than desired. Documents with the next highest scores are added to meet the minimum. This is useful when guaranteeing a set number of documents to pass through while still allowing the Top-P algorithm to determine if more documents should be sent based on scores. - Introduced a utility function to deserialize a generic Document Store from the
init_parameters
of a serialized component. - Refactor
deserialize_document_store_in_init_parameters
to clarify that the function operates in place and does not return a value. - The
SentenceWindowRetriever
now returnscontext_documents
as well as thecontext_windows
for eachDocument
inretrieved_documents
. This allows you to get a list of Documents from within the context window for each retrieved document.
⚠️ Deprecation Notes
- The default model for
OpenAIGenerator
andOpenAIChatGenerator
, previously 'gpt-3.5-turbo', will be replaced by 'gpt-4o-mini'.
🐛 Bug Fixes
- Fixed an issue where page breaks were not being extracted from DOCX files.
- Used a forward reference for the
Paragraph
class in theDOCXToDocument
converter to prevent import errors. - The metadata produced by
DOCXToDocument
component is now JSON serializable. Previously, it containeddatetime
objects automatically extracted from DOCX files, which are not JSON serializable. Thesedatetime
objects are now converted to strings. - Starting from
haystack-ai==2.4.0
, Haystack is compatible withsentence-transformers>=3.0.0
; earlier versions ofsentence-transformers
are not supported. We have updated the test dependencies and LazyImport messages to reflect this change. - For components that support multiple Document Stores, prioritize using the specific
from_dict
class method for deserialization when available. Otherwise, fall back to the genericdefault_from_dict
method. This impacts the following generic components:CacheChecker
,DocumentWriter
,FilterRetriever
, andSentenceWindowRetriever
.
v2.5.0-rc3
Release Notes
Enhancement Notes
- Adapts how ChatPromptBuilder creates ChatMessages. Messages are deep copied to ensure all meta fields are copied correctly.
v2.5.0-rc2
Release Notes
Upgrade Notes
- Remove ChatMessage.to_openai_format method. Use haystack.components.generators.openai_utils._convert_message_to_openai_format instead.
- Remove unused debug parameter from Pipeline.run method.
- Removing deprecated SentenceWindowRetrieval, replaced by SentenceWindowRetriever
New Features
- Add unsafe argument to enable behaviour that could lead to remote code execution in ConditionalRouter and OutputAdapter. By default unsafe behaviour is not enabled, the user must set it explicitly to True. This means that user types like ChatMessage, Document, and Answer can be used as output types when unsafe is True. We recommend using unsafe behaviour only when the Jinja templates source is trusted. For more info see the documentation for ConditionalRouter and OutputAdapter
Enhancement Notes
- The parameter min_top_k is added to the TopPSampler which sets the minimum number of documents to be returned when the top-p sampling algorithm results in fewer documents being selected. The documents with the next highest scores are added to the selection. This is useful when we want to guarantee a set number of documents will always be passed on, but allow the Top-P algorithm to still determine if more documents should be sent based on document score.
- Introduce an utility function to deserialize a generic Document Store from the init_parameters of a serialized component.
- Refactor deserialize_document_store_in_init_parameters so that new function name indicates that the operation occurs in place, with no return value.
- The SentenceWindowRetriever has now an extra output key containing all the documents belonging to the context window.
Deprecation Notes
- SentenceWindowRetrieval is deprecated and will be removed in future. Use SentenceWindowRetriever instead.
- The 'gpt-3.5-turbo' as the default model for the OpenAIGenerator and OpenAIChatGenerator will be replaced by 'gpt-4o-mini'.
Bug Fixes
- Fixed an issue where page breaks were not being extracted from DOCX files.
- Use a forward reference for the Paragraph class in the DOCXToDocument converter to prevent import errors.
- The metadata produced by DOCXToDocument component is now JSON serializable. Previously, it contained datetime objects automatically extracted from DOCX files, which are not JSON serializable. Now, the datetime objects are converted to strings.
- Starting from haystack-ai==2.4.0, Haystack is compatible with sentence-transformers>=3.0.0; earlier versions of sentence-transformers are not supported. We are updating the test dependency and the LazyImport messages to reflect that.
- For components that support multiple Document Stores, prioritize using the specific from_dict class method for deserialization when available. Otherwise, fall back to the generic default_from_dict method. This impacts the following generic components: CacheChecker, DocumentWriter, FilterRetriever, and SentenceWindowRetriever.