There are two international organizations that formulate video codec technology. One is ITU-T, which formulates H.261, H.263, H.263 +, etc. the other is the international organization for Standardization (ISO), which formulates MPEG-1, MPEG-2, MPEG-4, etc. H.264 is a new digital video coding standard jointly formulated by the joint video group (JVT) jointly established by the two organizations, so it is not only H.264 of ITU-T, but also part 10 of MPEG-4 Advanced Video Coding (AVC) of ISO/IEC. Therefore, whether MPEG-4 AVC, MPEG-4 Part 10 or ISO/IEC 14496-10, it refers to H.264.
1. Introduction
H. 264, also the tenth part of MPEG-4, is a highly compressed digital video codec standard proposed by the Joint Video Team (JVT, Joint Video Team) jointly composed of ITU-T video coding expert group (VCEG) and ISO/IEC dynamic image expert group (MPEG). This standard is commonly referred to as H.264/AVC (or AVC/H.264 or H.264/MPEG-4 AVC or MPEG-4/H.264 AVC).
2. H264 coding hierarchy
H.264 is divided into video coding layer (VCL) and network abstraction layer (NAL). The former focuses on the coding part of VCL, focusing on the coding algorithm and its implementation on a specific hardware platform; The latter nal is responsible for data formatting and packet encapsulation to ensure data transmission on various channels and storage media.
Concepts involved in video coding layer:
Video compression technology:
- Intra prediction compression solves the problem of spatial data redundancy.
- Inter prediction compression (motion estimation and compensation) solves the problem of time-domain data redundancy.
- Integer discrete cosine transform (DCT) transforms the spatial correlation into frequency domain independent data, and then quantizes it.
- CABAC compression.
- I frame: key frame, using intra frame compression technology.
- P frame: forward reference frame. During compression, only the previously processed frames are referenced. Inter frame compression technology is adopted.
- B frame: bidirectional reference frame. During compression, it refers to both the previous frame and the subsequent frame. Inter frame compression technology is adopted.
- GOP: there is an image sequence between two I frames; There is only one I frame in an image sequence.
Concepts involved in network abstraction layer:
- SODB: String Of Data Bits, that is, the most original data after VCL coding (the length is not necessarily an integer of 8, so it needs to be supplemented).
- RBSP: Raw Byte Sequence Payload, that is, RBSP is added after SODB_ trailing_ Bits (the first bit is 1, followed by 0 until the bytes are aligned).
- Emulation Prevention Bytes: anti contention code (0x03), that is, when encountering two consecutive 0x00 bytes, insert a byte of 0x03 (e.g. 0x000001 = > 0x00000301). Remove 0x03 during decoding. Also known as shelling operation.
- EBSP: Extension Byte Sequence Payload, that is, anti contention code (0x03) is added on the basis of RBSP.
- NALU: a NAL Unit, including a NALU header and a NALU body (generally RBSP).
- NALU Start Codes: NALU package start codes. Since the NALU does not contain size / length information, you cannot directly connect the NALU package to build a code stream because you cannot know the start and end positions. 0x000001 or 0x00000001 is usually used as the start code. The four byte start code is usually used for nalus such as SPS, PPS and IDR, and the other three bytes are used.
// logical relationship SODB + rbsp_trailing_bits = RBSP mixin(RBSP, 0x03) = EBSP NALU header + NALU body(RBSP or EBSP) = NALU NALU Start Codes + NALU + NALU Start Codes + NALU + ... = H.264 Bits Stream // Annex-B
3. Video coding
TODO
4. NAL cell structure
VCL is specified to effectively represent the content of video data. Nal is specified to format data and provide header information in a format suitable for storage media or transmission on multiple communication channels. Nal cells contain all data, and each nal cell contains integer bytes. The nal unit specifies a general format suitable for both packet systems and bitstream systems. The format of NAL units used for packet transmission and byte stream is the same, but each nal unit in byte stream format can be preceded by a start code prefix and additional padding bytes.
The NAL unit (hereinafter referred to as NALU) includes the NAL header and the NAL body, in which the NAL header occupies one byte, and the structure is as follows:
+---------------+ |0|1|2|3|4|5|6|7| +-+-+-+-+-+-+-+-+ |F|NRI| Type | +---------------+
- forbidden_zero_bit: 0 forbidden bit, 1 bit;
- nal_ref_idc: 2bit, used to indicate the importance of the current NALU. The larger the value, the more important it is. When the decoder cannot decode, it can discard the NALU with importance of 0;
- Type: refers to the type of NALU, 5bit, i.e. the range is 0 ~ 31.
The types of nals are mainly divided into two categories: 1 ~ 5 nals are called VCL nalus (i.e. data after video coding), and the rest are called non VCL nalus (some configuration information). The type code of NALU is as follows:
TYPE | describe | layered | NRI | |
---|---|---|---|---|
0 | Unspecified | Not specified | non-VCL | |
1 | Coded slice of a non-IDR picture | A coding strip of non IDR image | VCL | |
2 | Coded slice data partition A | Coded striped data block A | VCL | |
3 | Coded slice data partition B | Code strip split data block B | VCL | |
4 | Coded slice data partition C | Code strip split data block C | VCL | |
5 | Coded slice of an IDR picture | Coding strip of IDR image | VCL | Not 0 |
6 | Supplemental enhancement information (SEI) | Auxiliary enhancement information | non-VCL | 0 |
7 | Sequence parameter set | Sequence parameter set | non-VCL | Not 0 |
8 | Picture parameter set | Image parameter set | non-VCL | Not 0 |
9 | Access unit delimiter | Access unit separator | non-VCL | 0 |
10 | End of sequence | End of sequence | non-VCL | 0 |
11 | End of stream | End of stream | non-VCL | 0 |
12 | Filler data | Fill data | non-VCL | 0 |
13 | Sequence parameter set extension | Sequence parameter set extension | non-VCL | 0 |
14..18 | Reserved | retain | non-VCL | 0 |
19 | Coded slice of an auxiliary coded picture without partitioning | Coding strip of undivided auxiliary coded image | non-VCL | 0 |
20...23 | Reserved | retain | non-VCL | 0 |
24...31 | Unspecified | Not specified | non-VCL | 0 |
The common RBSP data structure types are IDR_SLICE, SPS, PPS and SEI. Their common values of NAL header are 0x65, 0x67, 0x68 and 0x06. Through specific document examples:
- Split the H264 code stream of the local MP4 file through ffmpeg: ffmpeg -i xx.mp4 xx.h264;
- The obtained xx.h264 is RBSP (or EBSP) code stream;
- The yellow part is 0x00000001 start code, the blue part is NALU header, 0x67 is SPS information, and the red part is NALU body (including two anti competition codes).
5. RTP based transmission structure of H.264
+----------------------+ | RTP Packet | +--------+-------------+ | Header | Payload | +--------+-------------+ ⬇️ +----------------------+ | RTP Payload | +----------------+-----+ | Payload Header | ... | +----------------+-----+ ⬇️ +-------------------------------+ | Payload Header | +-------------------------------+ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+ | F | NRI | Type | +-------------------------------+
RTP is a real-time transmission protocol. When packaging H.264 code stream through this protocol, three different types of load structures are defined. The receiver can identify the load structure through the first byte of RTP payload (payload header). The payload header is formatted as NALU header, which is consistent with its structure and the meaning of each field. TYPE values of 24 ~ 31 are used in RTP packaging, which are defined as follows:
TYPE | NALU type in RTP load | Load structure |
---|---|---|
0 | Undefined | |
1~23 | Single NAL unit package | Single NAL unit package |
24 | STAP-A, single time aggregation package | Aggregate package |
25 | STAP-B, single time aggregation package | Aggregate package |
26 | MTAP16, multi time aggregation package | Aggregate package |
27 | MTAP32, multi time aggregation package | Aggregate package |
28 | FU-A, slicing unit | Slicing unit |
29 | FU-B, slicing unit | Slicing unit |
30~31 | Undefined |
The three load types are:
5.1,Single NAL Unit Packet
Single NALU package That is, each NALU will be independently encapsulated into an RTP package. The RTP load structure is as follows:
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |F|NRI| type | | +-+-+-+-+-+-+-+-+ | | | | Bytes 2..n of a Single NAL unit | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | :...OPTIONAL RTP padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ RTP payload format for single NAL unit packet
5.2,Aggregation Packets
Aggregate package . When the size of several nalus in the H264 code stream is particularly small, it is necessary to package multiple nalus into one RTP packet. The load structure of RTP is as follows:
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |F|NRI| type | | +-+-+-+-+-+-+-+-+ | | | | one or more aggregation units | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | :...OPTIONAL RTP padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ RTP payload format for aggregation packets
Aggregation packages are divided into two categories:
Single time aggregation package (STAP): that is, aggregate NAL units with the same NALU time. Two types of staps are defined. One does not include DON (STAP-A), there are two bytes of NALU szie before each NALU; The other category includes DON (STAP-B), a two byte DON field is added on the basis of STAP-A. give an example:
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | RTP Header | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |STAP-A NAL HDR | NALU 1 Size | NALU 1 HDR | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | NALU 1 Data |
NALU 2 Size NALU 2 HDR NALU 2 Data : :
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+:...OPTIONAL RTP padding 1.An example of an RTP packet including an STAP-A and two
single-time aggregation units
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 RTP Header STAP-B NAL HDR DON NALU 1 Size NALU 1 Size NALU 1 HDR NALU 1 Data +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +
: :NALU 2 Size NALU 2 HDR NALU 2 Data : :
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+:...OPTIONAL RTP padding 2.An example of an RTP packet including an STAP-B and two
single-time aggregation units
Multi time aggregation packets (MTAPs): that is, aggregate NAL units with different NALU times. The structure of the NALU is: a 16 bit DONB (coding order number base) and one or more multi time aggregation units, as shown in Figure 1. MTAP is divided into MTAP16 (Figure 2) and MTAP24 (Figure 3). Their structures are similar. The difference is that MTAP16 is followed by a 16 bit timestamp offset (TS offset) in DODN (coding order number difference), while MTAP24 is 24 bits.
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : decoding order number base | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | | multi-time aggregation units | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 1 MTAPs NALU Load format for 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : NAL unit size | DOND | TS offset | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TS offset | | +-+-+-+-+-+-+-+-+ NAL unit | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 2 MTAP16 Multi time aggregation unit format 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : NALU unit size | DOND | TS offset | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TS offset | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | NAL unit | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 4 MTAP24 Multi time aggregation unit format
5.3,Fragmentation Units (FUs)
Slicing unit. When the length of NALU in H264 code stream exceeds the MTU size limit, fragmentation is required. FUs can be divided into two categories: FU-A (Fig. 5) and FU-B (Fig. 6). The difference is that FU-B carries DON. The structure of FU indicator is consistent with that of NULU header, and the type is 28 or 29; The composition of FU header is shown in Figure 7. The first bit S (start) indicates whether it is the start partition, E (end) indicates whether it is the end partition, R (reserved) must be 0, and the 5-bit type indicates the type of NALU unit.
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | FU indicator | FU header | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | | FU payload | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | :...OPTIONAL RTP padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 5 FU-A of RTP Load format 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | FU indicator | FU header | DON | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-| | | | FU payload | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | :...OPTIONAL RTP padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 6 FU-B of RTP Load format +---------------+ |0|1|2|3|4|5|6|7| +-+-+-+-+-+-+-+-+ |S|E|R| Type | +---------------+ Figure 7 FU header structure