Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Umbrella Issue: Performance Counters #12882

Open
43 tasks
miacim opened this issue Sep 19, 2024 · 0 comments
Open
43 tasks

Umbrella Issue: Performance Counters #12882

miacim opened this issue Sep 19, 2024 · 0 comments
Assignees
Labels
perf for issues tracking performance problems/improvements performance monitoring Feature or bug related to performance monitoring

Comments

@miacim
Copy link

miacim commented Sep 19, 2024

Introduction

Performance monitoring is essential for optimizing system performance. This umbrella issue focuses on implementing a performance counters system that encompasses both hardware and software counters, accessible from kernel and user-space environments. The system aims to provide deep insights into the accelerator's operation, and to facilitate detailed performance analysis and debugging.

Note: Individual issues for each task will be opened and linked to this umbrella issue.

Hardware Counters

The hardware counters are accessed through debug registers and can be directly accessed from the RISC cores. There are five physical hardware units that provide access to number of different counters. These counters provide valuable data on various aspects of the accelerator's performance, such as Floating Point Unit (FPU) utilization, Level 1 (L1) cache behavior, and other critical metrics.

Software Counters

In addition to hardware counters, there is a need for software counters to measure the duration of specific functions or events within the firmware running on the RISC cores. These include measuring times between key operations like:

  • First-unpack to last-pack
  • Unpack/math/pack runtime
  • Time between previous kernel stop and new kernel start
  • Default target (middle input) versus first, middle, last inputs
  • Wait times for tiles and free tiles

Data Collection Considerations

Data collection strategies must balance the need for detailed information with performance overhead. Options include collecting all values for specific counters, sampling every nth value, or summarizing data. Enabling software counters affects firmware performance, so it's crucial to enable them judiciously, possibly through compile-time options.

Objective

The goal is to develop a robust performance counters system that integrates both hardware and software counters, provides efficient data collection and storage mechanisms, and allows for detailed performance analysis without significantly impacting system performance.

Milestone 1: Establishing the Hardware Performance Counters Foundation

Objective: Lay the groundwork by defining all hardware performance counters and developing the kernel-level API, accompanied by comprehensive documentation and initial testing. Establishing a solid foundation with well-documented hardware counters and a robust kernel-level API is crucial before proceeding to more complex aspects of the project. Initial testing ensures that the core components function correctly, reducing potential issues later on.

  • Definition and Documentation of Hardware Counter Types: Define and document all hardware performance counters, including their usage across different architectures and RISC cores.

    • Enumerate Existing Counters: List and define all existing hardware performance counters
    • Comprehensive Documentation: Write a detailed README explaining the functionality, usage, and access methods for each counter.
    • Header Files Generation: Produce header files containing constants and definitions for each supported architecture
  • Kernel-Level API Development for Hardware Performance Counters: Develop a kernel-level API to interact with hardware performance counters effectively, accessible from the RISC cores via debug registers.

    • Implement init_counter(): Initialize specific hardware performance counters.
    • Implement start_counter(): Start an individual hardware performance counter.
    • Implement stop_counter(): Stop an individual hardware performance counter.
    • Implement reset_counter(): Reset counters to their default state.
    • Implement read_counter_value(): Retrieve the current value of a hardware performance counter.
    • Implement read_counter_cycles(): Obtain the number of cycles counted.
    • Implement start_all_counters(): Start all available hardware performance counters simultaneously.
    • Implement stop_all_counters(): Stop all active hardware performance counters simultaneously.
    • Document API: Write detailed README explaining API
  • Testing (Initial Phase)

    • Kernel API Testing: Write unit tests for each kernel API function (init_counter, start_counter, stop_counter, etc.).
    • Hardware Testing: Verify that counters work and have meaningful values.

Milestone 2: Development of Data Collection Strategies and Software Performance Counters

Objective: Formulate efficient data collection strategies and implement software performance counters within firmware, addressing challenges related to limited shared memory and data transfer mechanisms.

  • Data Collection Strategies: Formulate strategies for collecting performance data efficiently, balancing detail and performance overhead.

    • Functional Spec: Write functional specification for data collection that will cover buffers, data sampling options, realtime collection and post execution collection. Define abstract performance counter.
      • Review existing data collection strategies on TT-Metal and TT-BudaBackend
    • Define L1 buffers for Data Collection
    • Data Sampling Options: Implement collecting all values, every nth value, or summarized data for specific counters.
    • Real-Time Collection: Implement strategies for real-time data collection such as per-event or per-operation tracking within firmware.
    • Post-Execution Collection: Develop methods for aggregating and analyzing performance data after execution.
    • Data Transfer Mechanisms: Data transfer from L1 to DRAM and from DRAM to host applications.
  • Software Performance Counters

    • Software Performance Counter Template: Each counter will have unique identifier, measure cycle duration and provide counter value based on type of counter. Template will enable unified counter collecting.
    • Define counters: Implement software counters to measure events like first-unpack to last-pack times, unpack/math/pack runtimes, and wait times for tiles.

Milestone 3: Comprehensive Testing and Integration with Existing Toolchains

  • Testing (Advanced Phase)

    • Stress Testing: Execute stress tests under high-load conditions to assess system stability and counter reliability.
    • Validate Data Accuracy: Ensure that both hardware and software counters provide accurate and meaningful data in various system contexts.
  • Integration with Existing Toolchains:

    • DPRINT: Integrate performance counters with DPRINT for enhanced debug-level logging within firmware and kernel.
    • WATCHER: Integrate with WATCHER for real-time monitoring and observation of performance metrics.
    • TRACY/PROFILER: Integrate with TRACY for advanced performance analysis and visualization, including data from both hardware and software counters.

Milestone 4: System Optimization and Documentation Updates

Objective: Optimize the performance counters system to minimize overhead, implement compile-time options for flexibility, and update documentation to reflect best practices and performance considerations.

  • Optimizations: Analyze and optimize the performance counters system to minimize overhead and maximize efficiency.
    • Overhead Measurement: Quantify the performance overhead introduced by both hardware and software counters.
    • API and Firmware Optimization: Refine API functions and firmware code to reduce latency and resource consumption.
    • Conditional Features: Implement compile-time options to enable or disable counters to control performance impact.
  • Documentation Updates
    • Update existing documentation to include optimized practices.
    • Provide guidelines on enabling/disabling counters and the impact on performance.
    • Include performance considerations and best practices in the README and API documentation.

fyi @ttmtrajkovic

@miacim miacim added performance monitoring Feature or bug related to performance monitoring perf for issues tracking performance problems/improvements labels Sep 19, 2024
@miacim miacim self-assigned this Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf for issues tracking performance problems/improvements performance monitoring Feature or bug related to performance monitoring
Projects
None yet
Development

No branches or pull requests

1 participant