Week 13 Project Stage 2

Hello everyone, welcome to the week 13 project blog, this is the second phase of the SPO600(Software Portability and Optimization) project, click here to check phase 1. I this blog we are going to pick an open-source software and locate the SIMD code from the source code of the software and then determine the SIMD code usage in a certain program.

Instruction

For my blog, I choose FFmpeg. FFmpeg is a cross-platform framework to record, convert, and stream video and audio, it provides an easy way to convert video and audio to other formats.

FFmpeg is a very active open-source package with daily updates and debugging on Github. This is a short log of the recent updates of FFmpeg below, or you can click here to catch more updates history.

Single Instruction, Multiple Data Units

SIMD(Single Instruction, Multiple Data) is a kind of high-performance embedded computing. The concept of SIMD is a single instruction that does one operation but we could improve it to process multiple data in parallel. To get more information about SIMD, you can also check my last blog for week 11.

SIMD Usage

To get the source code of FFmpeg, you can simply go to FFmeg’s github and then clone the git onto AArch64 or x86_64 system. Let's take a look at the code snippet I found from /FFmpeg/libavcodec/neon/mpegvideo.c file.
This SIMD implementation is only working on AArch64 systems:

static void inline ff_dct_unquantize_h263_neon(int qscale, int qadd, int nCoeffs,
                                               int16_t *block)
{
    int16x8_t q0s16, q2s16, q3s16, q8s16, q10s16, q11s16, q13s16;
    int16x8_t q14s16, q15s16, qzs16;
    int16x4_t d0s16, d2s16, d3s16, dzs16;
    uint16x8_t q1u16, q9u16;
    uint16x4_t d1u16;

    dzs16 = vdup_n_s16(0);
    qzs16 = vdupq_n_s16(0);

    q15s16 = vdupq_n_s16(qscale << 1);
    q14s16 = vdupq_n_s16(qadd);
    q13s16 = vnegq_s16(q14s16);

    if (nCoeffs > 4) {
        for (; nCoeffs > 8; nCoeffs -= 16, block += 16) {
            q0s16 = vld1q_s16(block);
            q3s16 = vreinterpretq_s16_u16(vcltq_s16(q0s16, qzs16));
            q8s16 = vld1q_s16(block + 8);
            q1u16 = vceqq_s16(q0s16, qzs16);
            q2s16 = vmulq_s16(q0s16, q15s16);
            q11s16 = vreinterpretq_s16_u16(vcltq_s16(q8s16, qzs16));
            q10s16 = vmulq_s16(q8s16, q15s16);
            q3s16 = vbslq_s16(vreinterpretq_u16_s16(q3s16), q13s16, q14s16);
            q11s16 = vbslq_s16(vreinterpretq_u16_s16(q11s16), q13s16, q14s16);
            q2s16 = vaddq_s16(q2s16, q3s16);
            q9u16 = vceqq_s16(q8s16, qzs16);
            q10s16 = vaddq_s16(q10s16, q11s16);
            q0s16 = vbslq_s16(q1u16, q0s16, q2s16);
            q8s16 = vbslq_s16(q9u16, q8s16, q10s16);
            vst1q_s16(block, q0s16);
            vst1q_s16(block + 8, q8s16);
        }
    }
    if (nCoeffs <= 0)
        return;

    d0s16 = vld1_s16(block);
    d3s16 = vreinterpret_s16_u16(vclt_s16(d0s16, dzs16));
    d1u16 = vceq_s16(d0s16, dzs16);
    d2s16 = vmul_s16(d0s16, vget_high_s16(q15s16));
    d3s16 = vbsl_s16(vreinterpret_u16_s16(d3s16),
                     vget_high_s16(q13s16), vget_high_s16(q14s16));
    d2s16 = vadd_s16(d2s16, d3s16);
    d0s16 = vbsl_s16(d1u16, d0s16, d2s16);
    vst1_s16(block, d0s16);
}

This code snippet is for AArch64 systems, it is using the Neon instruction set at the beginning of this code. It defines a couple of registers as integer or unsigned integer types.

int16x8_t q0s16, q2s16, q3s16, q8s16, q10s16, q11s16, q13s16;
    int16x8_t q14s16, q15s16, qzs16;
    int16x4_t d0s16, d2s16, d3s16, dzs16;
    uint16x8_t q1u16, q9u16;
    uint16x4_t d1u16;

Next, load the values to the registers that we discussed before.

dzs16 = vdup_n_s16(0);
    qzs16 = vdupq_n_s16(0);

    q15s16 = vdupq_n_s16(qscale << 1);
    q14s16 = vdupq_n_s16(qadd);
    q13s16 = vnegq_s16(q14s16);

Afterward, SIMD is being used from here:

if (nCoeffs > 4) {
        for (; nCoeffs > 8; nCoeffs -= 16, block += 16) {
            q0s16 = vld1q_s16(block);
            q3s16 = vreinterpretq_s16_u16(vcltq_s16(q0s16, qzs16));
            q8s16 = vld1q_s16(block + 8);
            q1u16 = vceqq_s16(q0s16, qzs16);
            q2s16 = vmulq_s16(q0s16, q15s16);
            q11s16 = vreinterpretq_s16_u16(vcltq_s16(q8s16, qzs16));
            q10s16 = vmulq_s16(q8s16, q15s16);
            q3s16 = vbslq_s16(vreinterpretq_u16_s16(q3s16), q13s16, q14s16);
            q11s16 = vbslq_s16(vreinterpretq_u16_s16(q11s16), q13s16, q14s16);
            q2s16 = vaddq_s16(q2s16, q3s16);
            q9u16 = vceqq_s16(q8s16, qzs16);
            q10s16 = vaddq_s16(q10s16, q11s16);
            q0s16 = vbslq_s16(q1u16, q0s16, q2s16);
            q8s16 = vbslq_s16(q9u16, q8s16, q10s16);
            vst1q_s16(block, q0s16);
            vst1q_s16(block + 8, q8s16);
        }
    }

This part of the code snippet is using if-condition and for-loop to load values, multiply two values that were stored in the registers. For example,

q0s16 = vld1q_s16(block); //load
q2s16 = vmulq_s16(q0s16, q15s16); //pub unsafe fn vmulq_s16(a: int16x8_t, b: int16x8_t) -> int16x8_t

Looking at this way the SIMD is being used in this part of code, we could find that the intrinsic is also being used:

q3s16 = vreinterpretq_s16_u16(vcltq_s16(q0s16, qzs16));

Looking at the inside part first, vcltq_s16 is a method to compare each vector element in the first register q0s16 with the corresponding vector element in the second register qzs16 and if the first signed integer value is greater than the second signed integer value, sets every bit of the corresponding vector element into the destination register to 1 otherwise sets every bit of the corresponding vector element into the destination register to 0. And the vreinterpretq_s16_u16 method is a reinterpret-cast operation as an outer function for the intrinsics.
After calculation, use loop condition swap algorithm to store the result to the register:

if (nCoeffs <= 0)
        return;

    d0s16 = vld1_s16(block);
    d3s16 = vreinterpret_s16_u16(vclt_s16(d0s16, dzs16));
    d1u16 = vceq_s16(d0s16, dzs16);
    d2s16 = vmul_s16(d0s16, vget_high_s16(q15s16));
    d3s16 = vbsl_s16(vreinterpret_u16_s16(d3s16),
                     vget_high_s16(q13s16), vget_high_s16(q14s16));
    d2s16 = vadd_s16(d2s16, d3s16);
    d0s16 = vbsl_s16(d1u16, d0s16, d2s16);
    vst1_s16(block, d0s16);

After my examination, the SIMD implementation for mpegvideo.c file is for compression and decompression. It is selected during compile-time and then calculates tons of elements without macros in different vectors at the same time and also puts all of the SIMD methods into a for loop to execute. The ff_dct_unquantize_h263_neon function saved a lot of time for the package to calculate.

Looking at the SIMD implementation, FFmpeg explicitly creates a folder for the Neon instruction set:

In the Neon directory we got this:

We could see that the SIMD implementation for this part of the functionality is separated from other SIMD implementations.

The SIMD is also working on x86 systems. I found the code snippet from /FFmpeg/libavcodec/x86/lpc.c :

static void lpc_compute_autocorr_sse2(const double *data, int len, int lag,
                                      double *autoc)
{
    int j;

    if((x86_reg)data & 15)
        data++;

    for(j=0; j<lag; j+=2){
        x86_reg i = -len*sizeof(double);
        if(j == lag-2) {
            __asm__ volatile(
                "movsd    "MANGLE(pd_1)", %%xmm0    \n\t"
                "movsd    "MANGLE(pd_1)", %%xmm1    \n\t"
                "movsd    "MANGLE(pd_1)", %%xmm2    \n\t"
                "1:                                 \n\t"
                "movapd   (%2,%0), %%xmm3           \n\t"
                "movupd -8(%3,%0), %%xmm4           \n\t"
                "movapd   (%3,%0), %%xmm5           \n\t"
                "mulpd     %%xmm3, %%xmm4           \n\t"
                "mulpd     %%xmm3, %%xmm5           \n\t"
                "mulpd -16(%3,%0), %%xmm3           \n\t"
                "addpd     %%xmm4, %%xmm1           \n\t"
                "addpd     %%xmm5, %%xmm0           \n\t"
                "addpd     %%xmm3, %%xmm2           \n\t"
                "add       $16,    %0               \n\t"
                "jl 1b                              \n\t"
                "movhlps   %%xmm0, %%xmm3           \n\t"
                "movhlps   %%xmm1, %%xmm4           \n\t"
                "movhlps   %%xmm2, %%xmm5           \n\t"
                "addsd     %%xmm3, %%xmm0           \n\t"
                "addsd     %%xmm4, %%xmm1           \n\t"
                "addsd     %%xmm5, %%xmm2           \n\t"
                "movsd     %%xmm0,   (%1)           \n\t"
                "movsd     %%xmm1,  8(%1)           \n\t"
                "movsd     %%xmm2, 16(%1)           \n\t"
                :"+&r"(i)
                :"r"(autoc+j), "r"(data+len), "r"(data+len-j)
                 NAMED_CONSTRAINTS_ARRAY_ADD(pd_1)
                :"memory"
            );
        } else {
            __asm__ volatile(
                "movsd    "MANGLE(pd_1)", %%xmm0    \n\t"
                "movsd    "MANGLE(pd_1)", %%xmm1    \n\t"
                "1:                                 \n\t"
                "movapd   (%3,%0), %%xmm3           \n\t"
                "movupd -8(%4,%0), %%xmm4           \n\t"
                "mulpd     %%xmm3, %%xmm4           \n\t"
                "mulpd    (%4,%0), %%xmm3           \n\t"
                "addpd     %%xmm4, %%xmm1           \n\t"
                "addpd     %%xmm3, %%xmm0           \n\t"
                "add       $16,    %0               \n\t"
                "jl 1b                              \n\t"
                "movhlps   %%xmm0, %%xmm3           \n\t"
                "movhlps   %%xmm1, %%xmm4           \n\t"
                "addsd     %%xmm3, %%xmm0           \n\t"
                "addsd     %%xmm4, %%xmm1           \n\t"
                "movsd     %%xmm0, %1               \n\t"
                "movsd     %%xmm1, %2               \n\t"
                :"+&r"(i), "=m"(autoc[j]), "=m"(autoc[j+1])
                :"r"(data+len), "r"(data+len-j)
                 NAMED_CONSTRAINTS_ARRAY_ADD(pd_1)
            );
        }
    }

This function, it is using an inline assembler at runtime because it is located in a .c file. This is for x86 systems to use SSE2 to do the calculation and data transformation.

Conclusion

To conclude the usage of SIMD in FFmpeg, I only used two examples from AArch64 and x86 systems but I would say that FFmpeg takes full advantage of SIMD implementation. It saves a lot of time for this package when the software converts videos and audio to another format. It could calculate multiple data simultaneously by using the SIMD instruction set. For the code structure, after this project, I got a brief picture of the structure management of FFmpeg. I would say FFmpeg is well structured because FFmpeg decided to put the SIMD implementation in separate folders, it provides me with an idea which is that the separate code implementation is very helpful to debug and locate the error position when the program gets into an issue. Also, it provides a chance to reuse some of the code that they developed before. To improve the code structure, I would say maybe use more loop condition swaps to avoid some unnecessary loop procedures.