The fundamental block diagram of video coding schemes exhibits similarities to earlier standards but incorporates distinct features specific to the H.264 standard. Within this framework, the inter and intra pictures are crucial components that facilitate effective encoding and compression.
Inter pictures are created by subtracting a motion-compensated prediction from original images, thus utilizing the temporal redundancy present within the video data. The resulting residuals, or discrepancies from the prediction, are transformed into the frequency domain, enhancing the efficiency of coding. During this transformation, the coefficients undergo scanning, quantization, and subsequent encoding with variable-length codes, which is vital for compressing the bitstream. Local decoders play a role in reconstructing these pictures, enabling their use for future predictions.
In contrast, intra pictures are encoded independently, serving as their own reference points without relying on prior images. While the overall structure is akin to previous standards, the techniques and specifications employed in H.264 are notably different. This note will elaborate on the structural elements, decorrelation strategies for inter frames involving motion-compensated prediction and transformation of prediction errors, as well as approaches used for intra frames, including intra prediction modes and their corresponding transforms. Additionally, various binary coding options available in the standard will be examined.
The macroblock structure retains consistency with prior standards and consists of four 8 × 8 luminance blocks alongside two chrominance blocks. This arrangement accommodates flexibility and precision in video encoding. A significant advancement within H.264 is the ability to subdivide macroblocks into smaller sub-macroblocks of dimensions 8×4, 4×8, and 4×4, enabling enhanced motion-compensated prediction that captures finer video details often overlooked by earlier methods. Furthermore, the standard also permits the partitioning of macroblocks into sizes of 8 × 16 or 16 × 8. In field mode, H.264 groups 16 × 8 blocks from each field to generate 16 × 16 macroblocks, optimizing the organization of picture data.
The motion-compensated prediction within H.264 employs tree-structured algorithms that align with innovative macroblock partitions. A key challenge in motion compensation lies in selecting the appropriate block size and shape for effective prediction. Different sections of a video scene typically exhibit varying movement patterns; thus, the ability to utilize smaller blocks significantly enhances the system’s capacity for tracking diverse movements, leading to improved prediction accuracy and reduced bit rates. However, while smaller blocks improve prediction, they also require the encoding and transmission of additional motion vectors, which can increase bit consumption. In numerous instances, the bits dedicated to encoding these vectors can comprise a substantial portion of the overall bit stream.
The H.264 algorithm adeptly balances these trade-offs by deploying smaller block sizes in dynamic areas, while opting for larger sizes in static regions to enhance prediction accuracy further. This is achieved through quarter-pixel accuracy motion compensation, which interpolates between neighboring pixels to create smoother transitions in residual data. Filtering techniques applied to block edges contribute to cleaner transitions in video playback. The algorithm is capable of searching through up to 32 reference pictures to determine the optimal matching block for motion compensation, enhancing overall prediction effectiveness. Reference picture selection occurs at the level of the macroblock partition, ensuring that all sub-macroblock partitions derive from the same reference picture for consistent encoding. Similar to the H.263 standard, H.264 employs differential encoding techniques for motion vectors, where current motion vectors are predicted using the median values from neighboring vectors. This predictive approach varies based on macroblock sizes utilized, such as 16 × 16, 16 × 8, or 8 × 16.
For B pictures, which utilize bi-prediction, two motion vectors are allowed for each macroblock or sub-macroblock partition, with pixel predictions computed as a weighted average of the corresponding prediction pixels. Notably, H.264 introduces a Pskip macroblock type that utilizes simple 16×16 motion compensation, abstaining from transmitting the prediction error. This option proves advantageous for regions demonstrating minimal motion, as well as for scenes characterized by slow panning effects.
The transform used in H.264 is distinct from its predecessors, transitioning from the traditional 8 × 8 DCT to a more efficient 4 × 4 integer DCT-like matrix. The transformation matrix is represented as H = [ 1 1 1 1; 2 1 -1 1; 2 -1 -1 1; 1 1 -2 2; -1 2 -1 ], allowing for straightforward execution using addition and shifting operations, which simplifies resource demands during encoding and decoding. Here, multiplication by 2 corresponds to a left shift, while division equates to a right shift. Although these operations yield computational efficiencies, they may introduce normalization discrepancies, which H.264 addresses with scale factors applied during the quantization process.
Opting for a smaller integer transform presents several advantages: its integer nature simplifies implementation while effectively reducing error accumulation. Smaller block sizes enable more precise representation of sequences with limited motion, as these compact dimensions can accommodate narrow value variance. In the case of sharp transitions across blocks—often problematic in larger blocks—the reduced sizes tend to localize ringing effects, thereby minimizing visual artifacts across wide pixel ranges.
H.264 supports two methods for binary coding. The first method employs exponential Golomb codes to encode various parameters in conjunction with a context-adaptive variable-length code (CAVLC) for quantizer label encoding. The second approach involves the binarization of values followed by context-adaptive binary arithmetic code (CABAC) deployment. To derive an exponential Golomb code for a positive number x, the approach uses unary coding for M = log2(x + 1), which corresponds to the M-bit natural binary representation of x + 1. For example, the unary code for any number x comprises x zeros followed by a 1, whereas the case for zero results in an exponential Golomb code of 1.
In conclusion, the H.264 standard exemplifies an advanced approach to video coding, skillfully integrating efficiency and fidelity through its innovative architecture and binary coding methodologies. Its ability to adapt to various video sequences makes it a powerful tool in modern video compression.