3.1.2-Tools

Performance Analysis: Counters and Timers

1. Performance Counters: PAPI

Performance counters are features of processors that record various events crucial for performance analysis, such as:

  • Number of cache misses

  • Cache hit ratio for locality insight

  • Ratio of floating point instructions for assessing floating point intensity

Challenge:

Different processors have varying recording mechanisms.

Solution:

PAPI (Performance Application Programming Interface) standardizes access to performance counters across various processors. It references events by name and organizes them into EventSets for systematic sampling. PAPI enables multiplexing events when counters are limited and supports statistical sampling using both software and hardware methods.

Reference for Further Reading: PAPI Documentation

2. How to Use PAPI

  • Initialization:Use PAPI_library_init(PAPI_VER_CURRENT) to initialize.

  • Create Event Set:Use PAPI_create_eventset(&EventSet).

  • Adding Events:Add events with PAPI_add_event(EventSet, PAPI_TOT_INS).

  • Counter Operations:

    • Start counting with PAPI_start(EventSet).

    • Read values with PAPI_read(EventSet, values).

    • Stop counting with PAPI_stop(EventSet, values).Always check function return values for success.

3. Example Program in C Using PAPI

#include <papi.h>  
int main() {  
  int events[2] = {PAPI_L2_TCM, PAPI_TOT_INS}, ret;  
  long long values[2];  
  ret = PAPI_library_init(PAPI_VER_CURRENT);  
  if (ret != PAPI_VER_CURRENT) {  
    fprintf(stderr, "PAPI library init error!\n");  
    exit(1);  
  }  
  if ((ret = PAPI_start_counters(events, 2)) != PAPI_OK) {  
    fprintf(stderr, "PAPI failed to start counters: %s\n", PAPI_strerror(ret));  
    exit(1);  
  }  
  … computation…  
  if ((ret = PAPI_read_counters(values, 2)) != PAPI_OK) {  
    fprintf(stderr, "PAPI failed to read counters: %s\n", PAPI_strerror(ret));  
    exit(1);  
  }  
}  

4. Timers

Timers are critical for performance measurement. Consider the following:

  • Select timers that minimize overhead while ensuring high resolution.

  • Be aware that clock ticks and real time can differ due to turbo boost and Dynamic Voltage and Frequency Scaling (DVFS).

  • Test and evaluate timer overhead and resolution by invoking timers in a loop repeatedly.

5. Example Code for Timing Using gettimeofday

double get_clock() {  
  struct timeval tv;  
  int ok;  
  ok = gettimeofday(&tv, (void *) 0);  
  if (ok < 0) {  
    printf("gettimeofday error");  
  }  
  return (tv.tv_sec * 1.0 + tv.tv_usec * 1.0E-6);  
}

t0 = get_clock();  
for (i = 0; i < N; i++)   
  times[i] = get_clock();  
t1 = get_clock();  
printf("time per call: %f ns\n", (1000000000.0 * (t1 - t0) / N));  

This function calculates the time taken per call in nanoseconds based on the total number of calls made.