Improvement of CTU Split Mode Decision in H.265 by Machine Learning, Part 1

This is part 1 of my research project, mainly focused on video background knowledge.

A typical video file contains image, audio and metadata. We can compress each one of these properties. Video compression allows the efficient utilization of bandwidth and storage by reducing file size.

A video is nothing but a sequence of pictures or frames. When these frames are displayed fast enough (typically 20-25 frames per second), the audience will have the illusion of moving objects.

There are 2 major redundancies. Spatial redundancy means that pixels that are spatially close to each other typically have similar values. Temporal redundancy refers to the fact that neighboring frames are typically very similar. Then we can use these two features to compress the video.

This is the typical process of converting digital video into a format that takes up less space when it is stored or transmitted. It has two parts: encoder and decoder, usually represented as Codec as a whole. An encoder converts video into a compressed format, specifically bitstream, and a decoder converts compressed video back into the uncompressed format.

This is the same flowchart applies to every generation of video encode/decode standard.

There are some important concepts that exist in previous video encode/decode standard H.264, and they will be passed down to the next generation.

The video is composed of sequence of frames => each frame is sliced into macroblocks. The encoder processes a frame of video in units of a macroblock. Macroblock size is fixed 16 by 16; Y means luma element, meaning brightness, others mean chroma elements, meaning colorfulness;

In next generation of H.265, it inherits the important concept of macroblocks, and further improves on it by introducing CTU (coding tree unit). CTU to replace macroblocks; plus the increase intra-picture prediction modes. It is designed to handle emerging of more and more UHD videos and tackle the core questions of storage and transmission;

From a higher level view, HEVC divides the picture into CTUs; all the CTUs in a video sequence have the same size: 64 × 64, 32 × 32, or 16 × 16. CTU usually consists of three blocks: luma (Y) which is about brightness and two chroma samples which is about colorfulness (Cb and Cr), and related syntax elements.

Each block is called coding tree block (CTB). Each CTB still has the same size as CTU, but at this stage, the CTB may be still too big to decide if we want to adopt inter-picture prediction or intra-picture prediction.

Therefore, each CTB can be differently split into multiple coding blocks (CB) and each CB becomes the decision making point of prediction type. Some CTBs are split to 16×16 CBs while others are split to 8×8 CBs.

As we mentioned, CTU is basically the macroblock in previous standard. But what’s more important is it’s more flexible, not like the fixed size of macroblock, Each CTU can be further divided into 4 square CUs, and one CU can be recursively divided into 4 sub-CUs based on quad-tree structure. They do not need to have the same size, it can have various sizes within one CU.


Main reference:

HEVC – What are CTU, CU, CTB, CB, PB, and TB?


Author: Wenchan

Eager to explore all the interesting IT fields in full stack track, data integration and data science

Leave a Reply

Your email address will not be published. Required fields are marked *