Skip to content

Latest commit

 

History

History
87 lines (57 loc) · 6.23 KB

pip-377.md

File metadata and controls

87 lines (57 loc) · 6.23 KB

PIP-377: Automatic retry for failed acknowledgements

Motivation

Apache Pulsar currently has known gaps in acknowledgement (ack) handling, particularly in scenarios involving key ordered message processing requirements (Failover, Exclusive, or Key_Shared subscriptions) and during broker restarts or topic unloads triggered by Pulsar load balancing events. In these scenarios, acknowledgements can be lost, resulting in stuck consumers due to key order message delivery rules and additional message duplication, which affects the reliability and end-to-end latency of message processing. The intention is to have a solution that doesn't require enabling Pulsar transactions.

Pulsar's default mode is at-least-once messaging, so duplicates are acceptable, but lost acknowledgements cause unnecessary duplicate messages. In the case of key-ordered message processing with Key_Shared subscriptions, a lost acknowledgement will cause message delivery to stop for further messages with keys that the lost acknowledgement's message has.

These situations currently cause unnecessary disruptions to Key_Shared processing applications, where manual intervention or automated monitoring solutions are needed to detect stuck consumers and recover the situation by restarting individual consumers.

Alternative solution considerations

One of the primary motivations for adding automatic retries for failed acknowledgements is to enhance the reliability of key order message processing using Key_Shared subscriptions. However, it is possible to improve the current situation without implementing the proposed solution. A deeper analysis should be conducted to reproduce the current issues and identify the root cause of the problem. If the root cause necessitates implementing automatic retries for failed acknowledgements, this proposal's priority should be increased. Otherwise, alternative solutions should be prioritized before considering the automatic retry solution for failed acknowledgements.

Detailed Design

This proposal aims to address these issues by enhancing the existing "ack receipt" feature with an automatic retry mechanism for failed acknowledgements. Users do not need to configure the "ack receipt" feature explicitly when autoRetryAcknowledgement is enabled. The solution is built upon the existing "ack receipt" feature at the binary protocol level. The gaps in the current "ack receipt" feature, such as Bug: When ack receipts are enabled, no response is sent to the client if the topic has been unloaded or is being transferred #23261, need to be addressed to achieve the desired outcome.

Public API

The following new methods will be added to the ConsumerBuilder interface:

    /**
     * Enable or disable automatic retry for failed acknowledgements.
     *
     * @param autoRetryAcknowledgement whether to automatically retry failed acknowledgements
     * @return the consumer builder instance
     */
    ConsumerBuilder<T> autoRetryAcknowledgement(boolean autoRetryAcknowledgement);

   /**
     * Overrides the default maximum number of retry attempts for a failed acknowledgement
     * when autoRetryAcknowledgement is enabled.
     *
     * @param maxAckRetries the maximum number of retry attempts
     * @return the consumer builder instance
     */
    ConsumerBuilder<T> maxAcknowledgementRetries(int maxAckRetries);

    /**
     * Overrides the default the retry delay backoff for acknowledgement retries.
     * This is used when autoRetryAcknowledgement is enabled.
     *
     * @param ackRetryBackoff the backoff strategy to use for retries
     * @return the consumer builder instance
     */
    ConsumerBuilder<T> autoRetryAcknowledgementBackoff(RedeliveryBackoff ackRetryBackoff);

This example applies to the Pulsar Java client. Other clients can implement similar changes for adding the autoRetryAcknowledgement mode.

Proposed Changes

  • Implement a new autoRetryAcknowledgement mode for Pulsar clients where acknowledgements that fail (due to broker restarts, topic unloads, Pulsar load balancing, or other issues) are automatically retried by the client.

  • Modify the ServerCnx class to send failure responses for discarded acknowledgements when ack receipts are enabled to fix issue #23261.

  • Implement a new component in the client library to manage automatic retries of failed acknowledgements.

  • When autoRetryAcknowledgement is enabled, the "ack receipt" feature is used under the covers. One of the differences is that the .acknowledge method should remain asynchronous, and the retries should happen in the background. The existing "ack receipt" feature makes .acknowledge synchronous, which is not the desired behavior for many applications since it will cause performance issues by adding a server round-trip when "ack receipt" is synchronous.

  • When both autoRetryAcknowledgement and "ack receipt" are enabled, the existing "ack receipt" behavior of synchronous acks will be used. The .acknowledge method will only return after the ack retry has succeeded or failed after all retry attempts. Similarly, the .acknowledgeAsync method will return after the autoRetryAcknowledgement completes.

  • Update the ConsumerBuilder interface to include options for configuring automatic ack retries. This applies to the Java client. Other clients could implement similar changes.

  • Implement additional client-side metrics to track failed acknowledgements, retry attempts, and success rates.

  • Update relevant documentation to reflect the new feature and its proper usage.

Compatibility, Deprecation, and Migration Plan

This feature will be opt-in. It doesn't introduce backwards compatibility issues with existing implementations. Clients not utilizing the new automatic retry option will continue to function as before. No deprecation or migration is required for existing users.

Test Plan

Comprehensive testing will include:

  1. Unit tests for the new retry mechanism.
  2. Integration tests simulating various failure scenarios (broker restarts, topic unloads, network issues).
  3. Performance tests to ensure the retry mechanism does not introduce significant overhead.

Links