CMPSC 311 – Notes on Profiling, gprof, and Amdahl’s Law

Context & Scope of Lecture

Course: CMPSC 311 – Introduction to Systems Programming
Topic focus: Introduction to Profiling & Optimization
Lecturer: Prof. Suman Saha (slides adapted from Profs. Patrick McDaniel and Abutalib Aghayev)
Primary goals of the lecture
- Understand why real-world programs under-perform.
- Learn systematic, measurement-based approaches to optimization.
- Master the basic profiling tool gprof.
- Apply Amdahl’s Law to reason about attainable speedups.

Program Performance Problems

Programs "run only as well as the code you write".
Two broad failure modes of non-ideal code:
- • Correctness: crashes, wrong output, undefined behaviour.
- • Performance: laggy UI, jitter, sluggish data processing, poor interactivity.
Insight: Good systems programming is not just about making it work but making it work fast enough for realistic workloads.

Optimization: Definition & Professional Relevance

Optimization – post-development phase where code is altered without changing externally observable behaviour, in order to remove inefficiencies.
• Techniques include: algorithm substitution, control-flow restructuring, data-structure redesign, refactoring for locality or concurrency.
Career notes (industry realities):
- 1. Mastering optimization distinguishes novice from professional.
- 2. School projects rarely include an explicit "polish & optimize" phase.
- 3. A large fraction of professional life: inheriting existing (often messy) codebases and accelerating them without altering semantics.

Premature Optimization – Two "Rules"

"Don’t do it." (absolutely avoid before correctness/clarity)
"Don’t do it yet." (even experts must first obtain empirical evidence)

Motto: Measure twice, optimize once.
Rationale: Making a correct program fast is easier than making a fast program correct; careless speed hacks embed subtle bugs.

Illustrative Inefficiency Example (24-line C program)

uint64_t prod_one(uint64_t a, uint64_t b) {
    uint64_t out = 0;
    for(uint64_t i = 0; i < a; ++i)
        out += b;
    return out;
}
uint64_t prod_two(uint64_t a, uint64_t b) {
    return a * b;
}
int main(void){
    uint64_t a = 10000000000ULL, b = 100000000ULL, sum = 0;
    clock_t start; double elapsed;
    start = clock();
    sum = prod_one(a,b);
    elapsed = (clock() - start) / (double)CLOCKS_PER_SEC;
    printf("%ld * %ld = %ld (took %.3f seconds)\n", a,b,sum,elapsed);
    start = clock();
    sum = prod_two(a,b);
    elapsed = (clock() - start) / (double)CLOCKS_PER_SEC;
    printf("%ld * %ld = %ld (took %.3f seconds)\n", a,b,sum,elapsed);
    return 0;
}

Compile-time experiments: gcc -O0, -O1, -O2 to witness compiler auto-optimization.
Teaches two lessons:
• Gross algorithmic inefficiencies can dwarf low-level tweaks.
• The compiler’s optimizer already applies many micro-optimizations; higher-level algorithmic refactor often yields orders-of-magnitude gains.

Profiling: Purpose & Conceptual Model

Debugging pinpoints incorrectness; profiling pinpoints inefficiency.
Mechanism: Execute an instrumented binary that periodically samples or counts events → yields timing statistics per code region.
Central question: Where does the program spend its time?

The gprof Toolchain

gprof: GNU profiler producing non-interactive textual reports.
- Measures: total run time, per-function self time, call relationships, frequency counts.
Usage workflow:
1. Compile with profiling instrumentation:
  gcc -pg profiling.c -o profiling
2. Run instrumented program → generates gmon.out.
  ./profiling
3. Analyze: gprof profiling | less
4. Inspect flat profile + call graph.
5. Optimize hotspots, re-compile with -pg, jump to step 2. (iterate)

Flat Profile Example

%   cumulative   self            calls   s/call  name
101.31    21.87  21.87      1    21.87   prod_one
 0.00     21.87   0.00      1     0.00   prod_two

Column meanings:
• % time – fraction of total run time spent inside function.
• cumulative seconds – running total up to this line.
• self seconds – exclusive time for this function (sort key).
• calls & s/call – invocation count and avg self-time per call.

Call-Graph Fragment

index %time self  children called  name
[1] 100.0  21.87 0.00      1     prod_one
-------------------------------------------
<spontaneous> → main → prod_one

Visualises calling relationships & splits time between self and children portions.

Optimization Strategy Guided by Profile Data

Spend effort where it pays: modules with maximal measured run-time proportion.
Example pie chart in slide: Module A 45 %, B 20 %, C 20 %, D 10 %, E 5 %.
→ Work on A first; even infinite speedups on E cannot exceed 5 % global benefit.

Amdahl’s Law – Theoretical Limit on Speedup

Purpose: Quantitatively bound overall improvement from accelerating only a subset.
Parameters:
• k – fraction of total execution time originally spent in the part(s) you will speed up.
• s – speedup factor achieved for that part: s = \frac{t{\text{old}}}{t{\text{new}}}.
Overall program speedup T: T = \frac{1}{(1 - k) + \frac{k}{s}}
- Intuition:
  • (1 - k) stays unchanged.
  • \frac{k}{s} is the scaled contribution of the optimized region.

Worked Example 1 (given in slides)

Module A = 45 % of run time; improved from 750\,\text{ms} \rightarrow 50\,\text{ms}.
k = 0.45, s = 750 / 50 = 15.
Speedup:
T = \frac{1}{0.55 + \frac{0.45}{15}} = \frac{1}{0.55 + 0.03} = \frac{1}{0.58} \approx 1.724.
Interpretation: Even a 15× local boost yields <2× global acceleration due to the unoptimized 55 %.

Worked Example 2 – Parallelization with Overhead

Legacy app single-threaded. Profile shows 70 % can be parallelized. Overhead = 5 % absolute.
Case A: run on 5 cores.
• k = 0.70,\; s = 5,\; \text{overhead} = 0.05
• New time fraction:
T_{\text{new}} = (1-k) + \text{overhead} + \frac{k}{s} = 0.30 + 0.05 + \frac{0.70}{5} = 0.30 + 0.05 + 0.14 = 0.49
• Overall speedup: \frac{1}{0.49} \approx 2.041\times.
Case B: run on 10 cores.
• T_{\text{new}} = 0.30 + 0.05 + \frac{0.70}{10} = 0.30 + 0.05 + 0.07 = 0.42
• Speedup: \approx 2.381\times.
Lesson: Diminishing returns; overhead and sequential 30 % form a hard ceiling.

Practical & Ethical Implications

Maintainability vs. Speed: Aggressive micro-optimizations can obfuscate code; always document and prefer readability unless perf critical.
Energy & Cost: Faster code often consumes less CPU time → lower power bills & environmental impact.
User Experience: Reactive interfaces drive adoption; optimizing latency can be ethically important for accessibility.

Summary Checklist for Effective Optimization

[ ] Verify correctness first; write exhaustive tests.
[ ] Instrument with -pg (or other profiler, e.g., perf, valgrind --tool=callgrind).
[ ] Locate top-k hotspots via flat profile and call graph.
[ ] Evaluate algorithmic alternatives before low-level tweaks.
[ ] Use Amdahl’s Law to estimate max gains; avoid fruitless labour.
[ ] Iterate: measure → change → re-measure.
[ ] Keep clean, readable, and well-commented code; future you (or teammates) will revisit it.