3.1.2-Tools
Performance Analysis: Counters and Timers
1. Performance Counters: PAPI
Performance counters are features of processors that record various events crucial for performance analysis, such as:
Number of cache misses
Cache hit ratio for locality insight
Ratio of floating point instructions for assessing floating point intensity
Challenge:
Different processors have varying recording mechanisms.
Solution:
PAPI (Performance Application Programming Interface) standardizes access to performance counters across various processors. It references events by name and organizes them into EventSets for systematic sampling. PAPI enables multiplexing events when counters are limited and supports statistical sampling using both software and hardware methods.
Reference for Further Reading: PAPI Documentation
2. How to Use PAPI
Initialization:Use
PAPI_library_init(PAPI_VER_CURRENT)to initialize.Create Event Set:Use
PAPI_create_eventset(&EventSet).Adding Events:Add events with
PAPI_add_event(EventSet, PAPI_TOT_INS).Counter Operations:
Start counting with
PAPI_start(EventSet).Read values with
PAPI_read(EventSet, values).Stop counting with
PAPI_stop(EventSet, values).Always check function return values for success.
3. Example Program in C Using PAPI
#include <papi.h>
int main() {
int events[2] = {PAPI_L2_TCM, PAPI_TOT_INS}, ret;
long long values[2];
ret = PAPI_library_init(PAPI_VER_CURRENT);
if (ret != PAPI_VER_CURRENT) {
fprintf(stderr, "PAPI library init error!\n");
exit(1);
}
if ((ret = PAPI_start_counters(events, 2)) != PAPI_OK) {
fprintf(stderr, "PAPI failed to start counters: %s\n", PAPI_strerror(ret));
exit(1);
}
… computation…
if ((ret = PAPI_read_counters(values, 2)) != PAPI_OK) {
fprintf(stderr, "PAPI failed to read counters: %s\n", PAPI_strerror(ret));
exit(1);
}
} 4. Timers
Timers are critical for performance measurement. Consider the following:
Select timers that minimize overhead while ensuring high resolution.
Be aware that clock ticks and real time can differ due to turbo boost and Dynamic Voltage and Frequency Scaling (DVFS).
Test and evaluate timer overhead and resolution by invoking timers in a loop repeatedly.
5. Example Code for Timing Using gettimeofday
double get_clock() {
struct timeval tv;
int ok;
ok = gettimeofday(&tv, (void *) 0);
if (ok < 0) {
printf("gettimeofday error");
}
return (tv.tv_sec * 1.0 + tv.tv_usec * 1.0E-6);
}
t0 = get_clock();
for (i = 0; i < N; i++)
times[i] = get_clock();
t1 = get_clock();
printf("time per call: %f ns\n", (1000000000.0 * (t1 - t0) / N)); This function calculates the time taken per call in nanoseconds based on the total number of calls made.