Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-15147

Measure pending and outstanding Remote Segment operations

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 3.7.0
    • core

    Description

      https://cwiki.apache.org/confluence/display/KAFKA/KIP-963%3A+Upload+and+delete+lag+metrics+in+Tiered+Storage

      KAFKA-15833: RemoteCopyLagBytes

      KAFKA-16002: RemoteCopyLagSegments, RemoteDeleteLagBytes, RemoteDeleteLagSegments

      KAFKA-16013: ExpiresPerSec

      KAFKA-16014: RemoteLogSizeComputationTime, RemoteLogSizeBytes, RemoteLogMetadataCount

      KAFKA-15158: RemoteDeleteRequestsPerSec, RemoteDeleteErrorsPerSec, BuildRemoteLogAuxStateRequestsPerSec, BuildRemoteLogAuxStateErrorsPerSec

      ====

      Remote Log Segment operations (copy/delete) are executed by the Remote Storage Manager, and recorded by Remote Log Metadata Manager (e.g. default TopicBasedRLMM writes to the internal Kafka topic state changes on remote log segments).

      As executions run, fail, and retry; it will be important to know how many operations are pending and outstanding over time to alert operators.

      Pending operations are not enough to alert, as values can oscillate closer to zero. An additional condition needs to apply (running time > threshold) to consider an operation outstanding.

      Proposal:

      RemoteLogManager could be extended with 2 concurrent maps (pendingSegmentCopies, pendingSegmentDeletes) `Map[Uuid, Long]` to measure segmentId time when operation started, and based on this expose 2 metrics per operation:

      • pendingSegmentCopies: gauge of pendingSegmentCopies map
      • outstandingSegmentCopies: loop over pending ops, and if now - startedTime > timeout, then outstanding++ (maybe on debug level?)

      Is this a valuable metric to add to Tiered Storage? or better to solve on a custom RLMM implementation?

      Also, does it require a KIP?

      Thanks!

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            christo_lolov Christo Lolov
            jeqo Jorge Esteban Quilcate Otoya
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment