Context & Scope of Lecture
- Course: CMPSC 311 – Introduction to Systems Programming
- Topic focus: Introduction to Profiling & Optimization
- Lecturer: Prof. Suman Saha (slides adapted from Profs. Patrick McDaniel and Abutalib Aghayev)
- Primary goals of the lecture
- Understand why real-world programs under-perform.
- Learn systematic, measurement-based approaches to optimization.
- Master the basic profiling tool gprof.
- Apply Amdahl’s Law to reason about attainable speedups.
- Programs "run only as well as the code you write".
- Two broad failure modes of non-ideal code:
- • Correctness: crashes, wrong output, undefined behaviour.
- • Performance: laggy UI, jitter, sluggish data processing, poor interactivity.
- Insight: Good systems programming is not just about making it work but making it work fast enough for realistic workloads.
Optimization: Definition & Professional Relevance
- Optimization – post-development phase where code is altered without changing externally observable behaviour, in order to remove inefficiencies.
• Techniques include: algorithm substitution, control-flow restructuring, data-structure redesign, refactoring for locality or concurrency. - Career notes (industry realities):
- 1. Mastering optimization distinguishes novice from professional.
- 2. School projects rarely include an explicit "polish & optimize" phase.
- 3. A large fraction of professional life: inheriting existing (often messy) codebases and accelerating them without altering semantics.
Premature Optimization – Two "Rules"
- "Don’t do it." (absolutely avoid before correctness/clarity)
- "Don’t do it yet." (even experts must first obtain empirical evidence)
- Motto: Measure twice, optimize once.
- Rationale: Making a correct program fast is easier than making a fast program correct; careless speed hacks embed subtle bugs.
Illustrative Inefficiency Example (24-line C program)
uint64_t prod_one(uint64_t a, uint64_t b) {
uint64_t out = 0;
for(uint64_t i = 0; i < a; ++i)
out += b;
return out;
}
uint64_t prod_two(uint64_t a, uint64_t b) {
return a * b;
}
int main(void){
uint64_t a = 10000000000ULL, b = 100000000ULL, sum = 0;
clock_t start; double elapsed;
start = clock();
sum = prod_one(a,b);
elapsed = (clock() - start) / (double)CLOCKS_PER_SEC;
printf("%ld * %ld = %ld (took %.3f seconds)\n", a,b,sum,elapsed);
start = clock();
sum = prod_two(a,b);
elapsed = (clock() - start) / (double)CLOCKS_PER_SEC;
printf("%ld * %ld = %ld (took %.3f seconds)\n", a,b,sum,elapsed);
return 0;
}
- Compile-time experiments:
gcc -O0
, -O1
, -O2
to witness compiler auto-optimization. - Teaches two lessons:
• Gross algorithmic inefficiencies can dwarf low-level tweaks.
• The compiler’s optimizer already applies many micro-optimizations; higher-level algorithmic refactor often yields orders-of-magnitude gains.
Profiling: Purpose & Conceptual Model
- Debugging pinpoints incorrectness; profiling pinpoints inefficiency.
- Mechanism: Execute an instrumented binary that periodically samples or counts events → yields timing statistics per code region.
- Central question: Where does the program spend its time?
- gprof: GNU profiler producing non-interactive textual reports.
- Measures: total run time, per-function self time, call relationships, frequency counts.
- Usage workflow:
- Compile with profiling instrumentation:
gcc -pg profiling.c -o profiling
- Run instrumented program → generates
gmon.out
.
./profiling
- Analyze:
gprof profiling | less
- Inspect flat profile + call graph.
- Optimize hotspots, re-compile with
-pg
, jump to step 2. (iterate)
Flat Profile Example
% cumulative self calls s/call name
101.31 21.87 21.87 1 21.87 prod_one
0.00 21.87 0.00 1 0.00 prod_two
- Column meanings:
• % time
– fraction of total run time spent inside function.
• cumulative seconds
– running total up to this line.
• self seconds
– exclusive time for this function (sort key).
• calls
& s/call
– invocation count and avg self-time per call.
Call-Graph Fragment
index %time self children called name
[1] 100.0 21.87 0.00 1 prod_one
-------------------------------------------
<spontaneous> → main → prod_one
- Visualises calling relationships & splits time between self and children portions.
Optimization Strategy Guided by Profile Data
- Spend effort where it pays: modules with maximal measured run-time proportion.
- Example pie chart in slide: Module A 45 %, B 20 %, C 20 %, D 10 %, E 5 %.
→ Work on A first; even infinite speedups on E cannot exceed 5 % global benefit.
Amdahl’s Law – Theoretical Limit on Speedup
- Purpose: Quantitatively bound overall improvement from accelerating only a subset.
- Parameters:
• k – fraction of total execution time originally spent in the part(s) you will speed up.
• s – speedup factor achieved for that part: s = \frac{t{\text{old}}}{t{\text{new}}}. - Overall program speedup T:
T = \frac{1}{(1 - k) + \frac{k}{s}}
- Intuition:
• (1 - k) stays unchanged.
• \frac{k}{s} is the scaled contribution of the optimized region.
Worked Example 1 (given in slides)
- Module A = 45 % of run time; improved from 750\,\text{ms} \rightarrow 50\,\text{ms}.
- k = 0.45, s = 750 / 50 = 15.
- Speedup:
T = \frac{1}{0.55 + \frac{0.45}{15}} = \frac{1}{0.55 + 0.03} = \frac{1}{0.58} \approx 1.724. - Interpretation: Even a 15× local boost yields <2× global acceleration due to the unoptimized 55 %.
Worked Example 2 – Parallelization with Overhead
- Legacy app single-threaded. Profile shows 70 % can be parallelized. Overhead = 5 % absolute.
- Case A: run on 5 cores.
• k = 0.70,\; s = 5,\; \text{overhead} = 0.05
• New time fraction:
T_{\text{new}} = (1-k) + \text{overhead} + \frac{k}{s} = 0.30 + 0.05 + \frac{0.70}{5} = 0.30 + 0.05 + 0.14 = 0.49
• Overall speedup: \frac{1}{0.49} \approx 2.041\times. - Case B: run on 10 cores.
• T_{\text{new}} = 0.30 + 0.05 + \frac{0.70}{10} = 0.30 + 0.05 + 0.07 = 0.42
• Speedup: \approx 2.381\times. - Lesson: Diminishing returns; overhead and sequential 30 % form a hard ceiling.
Practical & Ethical Implications
- Maintainability vs. Speed: Aggressive micro-optimizations can obfuscate code; always document and prefer readability unless perf critical.
- Energy & Cost: Faster code often consumes less CPU time → lower power bills & environmental impact.
- User Experience: Reactive interfaces drive adoption; optimizing latency can be ethically important for accessibility.
Summary Checklist for Effective Optimization
- [ ] Verify correctness first; write exhaustive tests.
- [ ] Instrument with
-pg
(or other profiler, e.g., perf
, valgrind --tool=callgrind
). - [ ] Locate top-k hotspots via flat profile and call graph.
- [ ] Evaluate algorithmic alternatives before low-level tweaks.
- [ ] Use Amdahl’s Law to estimate max gains; avoid fruitless labour.
- [ ] Iterate: measure → change → re-measure.
- [ ] Keep clean, readable, and well-commented code; future you (or teammates) will revisit it.