AR

CMPSC 311 – Notes on Profiling, gprof, and Amdahl’s Law

Context & Scope of Lecture

  • Course: CMPSC 311 – Introduction to Systems Programming
  • Topic focus: Introduction to Profiling & Optimization
  • Lecturer: Prof. Suman Saha (slides adapted from Profs. Patrick McDaniel and Abutalib Aghayev)
  • Primary goals of the lecture
    • Understand why real-world programs under-perform.
    • Learn systematic, measurement-based approaches to optimization.
    • Master the basic profiling tool gprof.
    • Apply Amdahl’s Law to reason about attainable speedups.

Program Performance Problems

  • Programs "run only as well as the code you write".
  • Two broad failure modes of non-ideal code:
    • Correctness: crashes, wrong output, undefined behaviour.
    • Performance: laggy UI, jitter, sluggish data processing, poor interactivity.
  • Insight: Good systems programming is not just about making it work but making it work fast enough for realistic workloads.

Optimization: Definition & Professional Relevance

  • Optimization – post-development phase where code is altered without changing externally observable behaviour, in order to remove inefficiencies.
    • Techniques include: algorithm substitution, control-flow restructuring, data-structure redesign, refactoring for locality or concurrency.
  • Career notes (industry realities):
    • 1. Mastering optimization distinguishes novice from professional.
    • 2. School projects rarely include an explicit "polish & optimize" phase.
    • 3. A large fraction of professional life: inheriting existing (often messy) codebases and accelerating them without altering semantics.

Premature Optimization – Two "Rules"

  1. "Don’t do it." (absolutely avoid before correctness/clarity)
  2. "Don’t do it yet." (even experts must first obtain empirical evidence)
  • Motto: Measure twice, optimize once.
  • Rationale: Making a correct program fast is easier than making a fast program correct; careless speed hacks embed subtle bugs.

Illustrative Inefficiency Example (24-line C program)

uint64_t prod_one(uint64_t a, uint64_t b) {
    uint64_t out = 0;
    for(uint64_t i = 0; i < a; ++i)
        out += b;
    return out;
}
uint64_t prod_two(uint64_t a, uint64_t b) {
    return a * b;
}
int main(void){
    uint64_t a = 10000000000ULL, b = 100000000ULL, sum = 0;
    clock_t start; double elapsed;
    start = clock();
    sum = prod_one(a,b);
    elapsed = (clock() - start) / (double)CLOCKS_PER_SEC;
    printf("%ld * %ld = %ld (took %.3f seconds)\n", a,b,sum,elapsed);
    start = clock();
    sum = prod_two(a,b);
    elapsed = (clock() - start) / (double)CLOCKS_PER_SEC;
    printf("%ld * %ld = %ld (took %.3f seconds)\n", a,b,sum,elapsed);
    return 0;
}
  • Compile-time experiments: gcc -O0, -O1, -O2 to witness compiler auto-optimization.
  • Teaches two lessons:
    • Gross algorithmic inefficiencies can dwarf low-level tweaks.
    • The compiler’s optimizer already applies many micro-optimizations; higher-level algorithmic refactor often yields orders-of-magnitude gains.

Profiling: Purpose & Conceptual Model

  • Debugging pinpoints incorrectness; profiling pinpoints inefficiency.
  • Mechanism: Execute an instrumented binary that periodically samples or counts events → yields timing statistics per code region.
  • Central question: Where does the program spend its time?

The gprof Toolchain

  • gprof: GNU profiler producing non-interactive textual reports.
    • Measures: total run time, per-function self time, call relationships, frequency counts.
  • Usage workflow:
    1. Compile with profiling instrumentation:
      gcc -pg profiling.c -o profiling
    2. Run instrumented program → generates gmon.out.
      ./profiling
    3. Analyze: gprof profiling | less
    4. Inspect flat profile + call graph.
    5. Optimize hotspots, re-compile with -pg, jump to step 2. (iterate)

Flat Profile Example

%   cumulative   self            calls   s/call  name
101.31    21.87  21.87      1    21.87   prod_one
 0.00     21.87   0.00      1     0.00   prod_two
  • Column meanings:
    % time – fraction of total run time spent inside function.
    cumulative seconds – running total up to this line.
    self seconds – exclusive time for this function (sort key).
    calls & s/call – invocation count and avg self-time per call.

Call-Graph Fragment

index %time self  children called  name
[1] 100.0  21.87 0.00      1     prod_one
-------------------------------------------
<spontaneous> → main → prod_one
  • Visualises calling relationships & splits time between self and children portions.

Optimization Strategy Guided by Profile Data

  • Spend effort where it pays: modules with maximal measured run-time proportion.
  • Example pie chart in slide: Module A 45 %, B 20 %, C 20 %, D 10 %, E 5 %.
    → Work on A first; even infinite speedups on E cannot exceed 5 % global benefit.

Amdahl’s Law – Theoretical Limit on Speedup

  • Purpose: Quantitatively bound overall improvement from accelerating only a subset.
  • Parameters:
    • k – fraction of total execution time originally spent in the part(s) you will speed up.
    • s – speedup factor achieved for that part: s = \frac{t{\text{old}}}{t{\text{new}}}.
  • Overall program speedup T: T = \frac{1}{(1 - k) + \frac{k}{s}}
    • Intuition:
      • (1 - k) stays unchanged.
      • \frac{k}{s} is the scaled contribution of the optimized region.

Worked Example 1 (given in slides)

  • Module A = 45 % of run time; improved from 750\,\text{ms} \rightarrow 50\,\text{ms}.
  • k = 0.45, s = 750 / 50 = 15.
  • Speedup:
    T = \frac{1}{0.55 + \frac{0.45}{15}} = \frac{1}{0.55 + 0.03} = \frac{1}{0.58} \approx 1.724.
  • Interpretation: Even a 15× local boost yields <2× global acceleration due to the unoptimized 55 %.

Worked Example 2 – Parallelization with Overhead

  • Legacy app single-threaded. Profile shows 70 % can be parallelized. Overhead = 5 % absolute.
  • Case A: run on 5 cores.
    • k = 0.70,\; s = 5,\; \text{overhead} = 0.05
    • New time fraction:
    T_{\text{new}} = (1-k) + \text{overhead} + \frac{k}{s} = 0.30 + 0.05 + \frac{0.70}{5} = 0.30 + 0.05 + 0.14 = 0.49
    • Overall speedup: \frac{1}{0.49} \approx 2.041\times.
  • Case B: run on 10 cores.
    • T_{\text{new}} = 0.30 + 0.05 + \frac{0.70}{10} = 0.30 + 0.05 + 0.07 = 0.42
    • Speedup: \approx 2.381\times.
  • Lesson: Diminishing returns; overhead and sequential 30 % form a hard ceiling.

Practical & Ethical Implications

  • Maintainability vs. Speed: Aggressive micro-optimizations can obfuscate code; always document and prefer readability unless perf critical.
  • Energy & Cost: Faster code often consumes less CPU time → lower power bills & environmental impact.
  • User Experience: Reactive interfaces drive adoption; optimizing latency can be ethically important for accessibility.

Summary Checklist for Effective Optimization

  • [ ] Verify correctness first; write exhaustive tests.
  • [ ] Instrument with -pg (or other profiler, e.g., perf, valgrind --tool=callgrind).
  • [ ] Locate top-k hotspots via flat profile and call graph.
  • [ ] Evaluate algorithmic alternatives before low-level tweaks.
  • [ ] Use Amdahl’s Law to estimate max gains; avoid fruitless labour.
  • [ ] Iterate: measure → change → re-measure.
  • [ ] Keep clean, readable, and well-commented code; future you (or teammates) will revisit it.