[KAFKA-17831] Transaction coordinators returning COORDINATOR_LOAD_IN_PROGRESS until leader changes or brokers are restarted after network instability - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.6.1, 3.7.1
Fix Version/s: None
Component/s: core
Labels:
None

Description

After experiencing a (heavy) network outage/instability, our brokers arrived in a state where some producers were not able to perform transactions, but the brokers continued to respond to those producers with `COORDINATOR_LOAD_IN_PROGRESS`. We were able to see corresponding DEBUG logs in the brokers:

DEBUG [TransactionCoordinator id=11] Returning COORDINATOR_LOAD_IN_PROGRESS error code to client for my-client's AddPartitions request (kafka.coordinator.transaction.TransactionCoordinator) [data-plane-kafka-request-handler-5]

This did not occur for all transactions, but for a subset of transactional ids with the same hash that would go through the same transaction coordinator/partition leader for the corresponding `__transaction_state` partition. We were able to resolve this the first time by reassigning partition leaders for the transaction topic around and the second time by simply restarting brokers.

This lead us to believe that it has to be some kind of dirty in-memory state transaction coordinators have for a {}{}transaction_state partition. We found two cases (#1, #2) in which the TransactionStateManager returns `COORDINATOR_LOAD_IN_PROGRESS`. In both cases `loadingPartitions` has some state that signals that the TransactionStateManager is still occupied with initializing transactional data for that `{_}_transaction_state` partition.

We believe that the network outage caused partition leaders to be shifted around continuously between their replicas and somehow this lead to outdated data in `loadingPartitions` that wasn't cleaned up. I had a look at the method where it is updated and cleaned, but wasn't able to identify a case in which there could be a failure to clean. The logs showed that at least after the network outage, affected partitions seem to have been loaded successfully, but the coordinator continued to return COORDINATOR_LOAD_IN_PROGRESS

...
INFO [Transaction State Manager 11]: Loading transaction metadata from __transaction_state-33 at epoch 240 (kafka.coordinator.transaction.TransactionStateManager) [transaction-log-manager-0]
...
INFO [Transaction State Manager 11]: Finished loading 1 transaction metadata from __transaction_state-33 in 511 milliseconds, of which 1 milliseconds was spent in the scheduler. (kafka.coordinator.transaction.TransactionStateManager) [transaction-log-manager-0]
...
INFO [Transaction State Manager 11]: Completed loading transaction metadata from __transaction_state-33 for coordinator epoch 240 (kafka.coordinator.transaction.TransactionStateManager) [transaction-log-manager-0]
...

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Kay Hartmann

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 18/Oct/24 10:21

Updated:: 21/Oct/24 08:49