Shichao's SPO600

Featured

My First Blog Post

Be yourself; Everyone else is already taken.
— Oscar Wilde.

This is the first post on my new blog. I’m just getting this new blog going, so stay tuned for more. Subscribe below to get notified when I post new updates.

Project solution

In spo600 course, I took FLIF as my project.

FLIF is a picture format that is able to lossless compress pictures.

There are 3 stages for the project: Benchmarking, Profiling and optimizing.

In the first stage, I found the project and benchmarked the program using the use case that built by my self. If you are interested in that, please go to my previous posts.
To identify the effects of compiler optimization flags, I tried to compile and run the program with different optimization levels by modifying the Makefile.

One thing surprised me was that O3 and Ofast does not provide the fastest performance, O2 does.

Then I moved to second stage, the profiling. In this stage, I found the function that takes most of time to run and tried to understand it.

That was a clean code. It does not contains too much high-end C++ feature, but only used template and class from C++, other codes looks like C code.

That allows me to understand the code easily. After I looked at the code, I thought that it is hard to optimize the code because it is clear enough: there are almost no obvious logic and syntax issues that result in performance issues.

In the final stage, I tried to use multiple-threading to speed-up the program. It actually work. The function that takes most time to run was complete so that it is hard to move it to multiple-threading – actually that logic should not be done with different threading. Just like render a image, each thread should render only one part, instead of one part being rendered by multiple thread. So I cut the image into 4 different fragments and compress the 4 fragments by the program at the same time. the program takes less time in the real.(not the cpu time, the cpu time should be used a little more).

Lab 6 – Algorithm selection

In this lab, I will look at 3 different approaches that deals with audio volume.

This is the 3 approaches that the course provided:

Three approaches to this problem are provided:
1. The basic or Naive algorithm (vol1.c). This approach multiplies each sound sample by 0.75, casting from signed 16-bit integer to floating point and back again. Casting between integer and floating point can be expensive operations.
2. A lookup-based algorithm (vol2.c). This approach uses a pre-calculated table of all 65536 possible results, and looks up each sample in that table instead of multiplying.
3.A fixed-point algorithm (vol3.c). This approach uses fixed-point math and bit shifting to perform the multiplication without using floating-point math.
https://wiki.cdot.senecacollege.ca/wiki/SPO600_Algorithm_Selection_Lab

First, I will try to predict which is the fastest one.

After I looked at all three solutions, I prefer to say solution 3 is the fastest one because it used bitwise operation to make the division.

        for (x = 0; x < SAMPLES; x++) {
                data[x] = (data[x] * vf) >>16;
        }

This will be extremely fast to the CPU, but the accuracy may be affected.

To identify the time of every solution used, I decided to use linux system call to record the time.

Time of sample 500000000 on israel(in microsecond, us):

vol1 1460840

vol2 2579299

vol3 754098

If I ran the programs again, the time will be a little bit different but most likely the differences are very tiny so that we can ignore it.

The reason why there is such differences is that the OS may dealing with other stuff so that the program cannot take 100% CPU resources.

On israel, vol2 is the slowest one. However, when I try to run the same programs on xerxes, vol1 is the slowest one. Maybe the different instruction set resulted in this. vol3 is the fastest on both israel and xerxes.

Memory usage:

I use ‘top’ to monitor the memory usages.

According to the data, the memory usages of the 3 solutions were almost same.

The things that shocks me is that the same code(logic) might have different performance on different architectures. The CPU is general use chip so that it can do all the stuff, but the hardware architecture does affect the program performance. So that if there are some specific usage logic and the code is hard to be optimized, create a architecture that only process this logic may be able to improve the performance effectively.

Stage 3 – optimization

Examine the code:

    for (int i = 0; i < plane_zoomlevels(images[0], beginZL, endZL); i++) {
    
      std::pair<int, int> pzl = plane_zoomlevel(images[0], beginZL, endZL, i, ranges);
      int p = pzl.first;
      int z = pzl.second;
      if (options.chroma_subsampling && p > 0 && p < 3 && z < 2) continue;
      if (!default_order) metaCoder.write_int(0, nump-1, p);
      if (ranges->min(p) >= ranges->max(p)) continue;
      int predictor = (the_predictor[p] < 0 ? find_best_predictor(images, ranges, p, z) : the_predictor[p]);
      //if (z < 2 && the_predictor < 0) printf("Plane %i, zoomlevel %i: predictor %i\n",p,z,predictor);
      if (the_predictor[p] < 0) metaCoder.write_int(0, MAX_PREDICTOR, predictor);
      if (endZL == 0) v_printf_tty(2,"\r%i%% done [%i/%i] ENC[%i,%ux%u]  ",(int)(100*pixels_done/pixels_todo),i,plane_zoomlevels(images[0], beginZL, endZL)-1,p,images[0].cols(z),images[0].rows(z));
      Properties properties((nump>3?NB_PROPERTIESA[p]:NB_PROPERTIES[p]));
      if (z % 2 == 0) {
        // horizontal: scan the odd rows, output pixel values
          for (uint32_t r = 1; r < images[0].rows(z); r += 2) {
            pixels_done += images[0].cols(z);
            if (endZL == 0 && (r & 257)==257) v_printf_tty(3,"\r%i%% done [%i/%i] ENC[%i,%ux%u]  ",(int)(100*pixels_done/pixels_todo),i,plane_zoomlevels(images[0], beginZL, endZL)-1,p,images[0].cols(z),images[0].rows(z));
            for (int fr=0; fr<(int)images.size(); fr++) {
              const Image& image = images[fr];
              if (image.seen_before >= 0) { continue; }
              uint32_t begin=(image.col_begin[r*image.zoom_rowpixelsize(z)]/image.zoom_colpixelsize(z)),
                         end=(1+(image.col_end[r*image.zoom_rowpixelsize(z)]-1)/image.zoom_colpixelsize(z));
              for (uint32_t c = begin; c < end; c++) {
                    if (alphazero && p<3 && image(3,z,r,c) == 0) continue;
#ifdef SUPPORT_ANIMATION
                    if (FRA && p<4 && image(4,z,r,c) > 0) continue;
#endif
                    ColorVal guess = predict_and_calcProps(properties,ranges,image,z,p,r,c,min,max,predictor);
                    ColorVal curr = image(p,z,r,c);
#ifdef SUPPORT_ANIMATION
                    if (FRA) {
                        if (p==4 && max > fr) max = fr;
                        if (guess>max || guess<min) guess = min;
                    }
#endif
                    assert (curr <= max); assert (curr >= min);
                    coders[p].write_int(properties, min - guess, max - guess, curr - guess);
              }
            }
          }
      } else {
        // vertical: scan the odd columns
          for (uint32_t r = 0; r < images[0].rows(z); r++) {
            pixels_done += images[0].cols(z)/2;
            if (endZL == 0 && (r&513)==513) v_printf_tty(3,"\r%i%% done [%i/%i] ENC[%i,%ux%u]  ",(int)(100*pixels_done/pixels_todo),i,plane_zoomlevels(images[0], beginZL, endZL)-1,p,images[0].cols(z),images[0].rows(z));
            for (int fr=0; fr<(int)images.size(); fr++) {
              const Image& image = images[fr];
              if (image.seen_before >= 0) { continue; }
              uint32_t begin=(image.col_begin[r*image.zoom_rowpixelsize(z)]/image.zoom_colpixelsize(z)),
                         end=(1+(image.col_end[r*image.zoom_rowpixelsize(z)]-1)/image.zoom_colpixelsize(z))|1;
              if (begin>1 && ((begin&1) ==0)) begin--;
              if (begin==0) begin=1;
              for (uint32_t c = begin; c < end; c+=2) {
                    if (alphazero && p<3 && image(3,z,r,c) == 0) continue;
#ifdef SUPPORT_ANIMATION
                    if (FRA && p<4 && image(4,z,r,c) > 0) continue;
#endif
                    ColorVal guess = predict_and_calcProps(properties,ranges,image,z,p,r,c,min,max,predictor);
                    ColorVal curr = image(p,z,r,c);
#ifdef SUPPORT_ANIMATION
                    if (FRA) {
                        if (p==4 && max > fr) max = fr;
                        if (guess>max || guess<min) guess = min;
                    }
#endif
                    assert (curr <= max); assert (curr >= min);
                    coders[p].write_int(properties, min - guess, max - guess, curr - guess);
              }
            }
          }

      }
      if (endZL==0 && io.ftell()>fs) {
          v_printf_tty(3,"    wrote %li bytes    ", io.ftell());
          v_printf_tty(5,"\n");
          fs = io.ftell();
      }
      
    }

This is the loop that takes most of time of the program. To examine the loop, I need to know the run time of every loop so that I added timer to the beginning and the ending of the loop. the result is that with the increasing of value ‘i’, the loops takes more time.

Just go though the code, the thing that most likely be effected by value ‘i’ is the

std::pair<int, int> pzl = plane_zoomlevel(images[0], beginZL, endZL, i, ranges);

function.

This function returns a std::pair<int, int>. The function signature is:

std::pair<int, int> plane_zoomlevel(const Image &image, const int beginZL, const int endZL, int i, const ColorRanges *ranges)

it returns a std::pair<int, int> and the tried to assign it’s value to a variable. This may cause the copy constructor and copy assignment operator to be called over and over again. To save the time here, we can allocate the memory on the heap and make the function returns a reference or pointer to avoid call copy constructor or copy assignment operator of std::pair;

I tested it, the program speeded up less than 1% so that not a signification part that affecting performance.

The most important part is the inside loop.

      if (z % 2 == 0) {
        // horizontal: scan the odd rows, output pixel values
          for (uint32_t r = 1; r < images[0].rows(z); r += 2) {
            pixels_done += images[0].cols(z);
            if (endZL == 0 && (r & 257)==257) v_printf_tty(3,"\r%i%% done [%i/%i] ENC[%i,%ux%u]  ",(int)(100*pixels_done/pixels_todo),i,plane_zoomlevels(images[0], beginZL, endZL)-1,p,images[0].cols(z),images[0].rows(z));
            for (int fr=0; fr<(int)images.size(); fr++) {
              const Image& image = images[fr];
              if (image.seen_before >= 0) { continue; }
              uint32_t begin=(image.col_begin[r*image.zoom_rowpixelsize(z)]/image.zoom_colpixelsize(z)),
                         end=(1+(image.col_end[r*image.zoom_rowpixelsize(z)]-1)/image.zoom_colpixelsize(z));
              for (uint32_t c = begin; c < end; c++) {
                    if (alphazero && p<3 && image(3,z,r,c) == 0) continue;
#ifdef SUPPORT_ANIMATION
                    if (FRA && p<4 && image(4,z,r,c) > 0) continue;
#endif
                    ColorVal guess = predict_and_calcProps(properties,ranges,image,z,p,r,c,min,max,predictor);
                    ColorVal curr = image(p,z,r,c);
#ifdef SUPPORT_ANIMATION
                    if (FRA) {
                        if (p==4 && max > fr) max = fr;
                        if (guess>max || guess<min) guess = min;
                    }
#endif
                    assert (curr <= max); assert (curr >= min);
                    coders[p].write_int(properties, min - guess, max - guess, curr - guess);
              }
            }
          }
      } else {
        // vertical: scan the odd columns
          for (uint32_t r = 0; r < images[0].rows(z); r++) {
            pixels_done += images[0].cols(z)/2;
            if (endZL == 0 && (r&513)==513) v_printf_tty(3,"\r%i%% done [%i/%i] ENC[%i,%ux%u]  ",(int)(100*pixels_done/pixels_todo),i,plane_zoomlevels(images[0], beginZL, endZL)-1,p,images[0].cols(z),images[0].rows(z));
            for (int fr=0; fr<(int)images.size(); fr++) {
              const Image& image = images[fr];
              if (image.seen_before >= 0) { continue; }
              uint32_t begin=(image.col_begin[r*image.zoom_rowpixelsize(z)]/image.zoom_colpixelsize(z)),
                         end=(1+(image.col_end[r*image.zoom_rowpixelsize(z)]-1)/image.zoom_colpixelsize(z))|1;
              if (begin>1 && ((begin&1) ==0)) begin--;
              if (begin==0) begin=1;
              for (uint32_t c = begin; c < end; c+=2) {
                    if (alphazero && p<3 && image(3,z,r,c) == 0) continue;
#ifdef SUPPORT_ANIMATION
                    if (FRA && p<4 && image(4,z,r,c) > 0) continue;
#endif
                    ColorVal guess = predict_and_calcProps(properties,ranges,image,z,p,r,c,min,max,predictor);
                    ColorVal curr = image(p,z,r,c);
#ifdef SUPPORT_ANIMATION
                    if (FRA) {
                        if (p==4 && max > fr) max = fr;
                        if (guess>max || guess<min) guess = min;
                    }
#endif
                    assert (curr <= max); assert (curr >= min);
                    coders[p].write_int(properties, min - guess, max - guess, curr - guess);
              }
            }
          }

      }

This takes most of the time.

The code is simple enough so that it is hard to improve the performance by dealing with the code.

However, there are still some other’s way to improve the performance.

The program run on only one thread, so that the CPU frequency is the most important part which affecting performance.

So optimize it, we can cut the picture into frags and run multiple instances at the same time.

Let’s think about this script.

#include <thread>
#include <stdlib.h>
#include <iostream>
#include <string>
#include "timer.h"
using namespace std;

void func(int i){
    Timer t;
    std::string cmd = "./flif test4k"+to_string(i)+".png out" + to_string(i) +".flif";
    t.reset();
    t.start();
    system(cmd.c_str());
    t.stop();
    cout << i << " finished in " << t.currtime() << endl;
}

int main(){
    Timer t;
    t.start();
    std::thread t1(func, 1);
    std::thread t2(func, 2);
    std::thread t3(func, 3);
    std::thread t4(func, 4);
    t1.join();
    t2.join();
    t3.join();
    t4.join();
    t.stop();
    cout << "finished in " << t.currtime() << endl;
    //combine picture;
    //need another program to finish this operation
    //...
}

I cut an 8K picture into four 4K pictures, and try to run four FLIF program to cut the four picture at the same time.

If the machine contains 4 psychical cores, the program should be finished with time that is similar as compressing a 4K picture.

The result is:

This is actually much faster than previous and match our prediction.

Because this is a lossless compression software so that there are no problems like the issue of calculating the pixel value at the edge of the picture if we cut one picture into fragments and combine them.

Conclusion:

This is my project of SPO600, improving the performance of this open source software. What I learned from the project is that, we should consider the optimization level as early as possible so that the cost of modifying code will not be unacceptable. Also, multiple threading is another effective tool to improve the CPU sensitive programs.

I will post a conclusion post later.

Project Stage 2 – Profiling FLIF

Introduction

In this stage, I will try to find out with function takes most CPU resources and see if there is possibility to reduce the runtime without effect or with a tiny effect to the output.

I choose “gprof” as the profiler on Linux machine. The reason why I choose it is that it gives me a option to see the call times of the function.

First, I need to add -pg argument when I am compiling the program:

This allows “gprof” to count the call times of the function calls.

Then I compile and run the program, I get a file called “gmon.out”.

Run “gprof flif gmon.out > result.txt”, I redirect the output of the result to the file so I can clearly see the report:

In FLIF project, most of CPU time of costed by encode functions: flif_encode_FLIF2_inner and void flif_encode_FLIF2_pass
They take approximately 87% of the run time when I try to compress a 8K image.

Then I locate the function and see how it works.

This function contains many nested loops and run the loop in a single thread. This could be a point that how can I speed up the project – multiple threading and inline assembly.

Lab 5 – X86_64

In the last blog, we finished the program that prints “Hello World” 30 times on Aarch64 machine, now we are going to code the same program on X86_64 machine using assembly language.

Step 1: start with a basic hello world program

.text
.globl  _start

_start:
        movq    $len,%rdx                       /* message length */
        movq    $msg,%rsi                       /* message location */


        movq    $1,%rdi                         /* file descriptor stdout */
        movq    $1,%rax                         /* syscall sys_write */
        syscall

        movq    $0,%rdi                         /* exit status */
        movq    $60,%rax                        /* syscall sys_exit */
        syscall


.section .data
msg:    .ascii      "Hello, world!: 0\n"
        len = . - msg

Step 2: loop 30 times output

.text
.globl	_start

_start:
	movb 	$'0', %r12b
	movq	$min, %r10

loop:

	movq	%r10, %r9
	movq	$len,%rdx			/* message length */
	movq 	$msg,%rsi			/* message location */
	
	movq 	$msg+15, %r11
	movb 	%r12b, (%r11)
	
	movq	$1,%rdi				/* file descriptor stdout */
	movq	$1,%rax				/* syscall sys_write */
	syscall
	
	inc		%r12b
	
	inc		%r10
	cmp		$max, %r10
	jne		loop

	movq	$0,%rdi				/* exit status */
	movq	$60,%rax			/* syscall sys_exit */
	syscall


.section .data
msg:	.ascii      "Hello, world!: 0\n"
	len = . - msg
min = 0
max = 30
division = 10

Step 3: deal with the second digit output

Now the already looped the output 30 times, but the we have not dealt with second digit output, so the output is :

On X86_64 machine, getting reminder is a little different:

        movq    $0, %rdx         /*initialize rdx to 0 for division*/
        movq    %r10, %rax     /*store the r10 value into rax for division*/
        div     %r9              /*divide rax value by 10(r9)*/

        movq    %rax, %r14    /*store the rax value(quotient) into r14*/
        movq    %rdx, %r15    /*store the rdx value(remainder) int r15*/

After dividing, the quotient and remainder will be stored into rax and rdx.

Lab 5 on X86_64

.text
.globl  _start

_start:
        movq    $min, %r10      /*store the min vlaue into r10 as a loop index*/
        movq    $division, %r9   /*store the division value(10) into r9*/

loop:
        cmp     %r9, %r10      /*compare r10(loop index) with 10*/
        jl      digit_1          /*if r10 is less than 10, go to the subroutine digit_1*/
        jmp     digit_2          /*if r10 is greater or equal to 10, go to the subroutine digit_2*/


digit_1:
        movq    %r10, %r15      /*store the r10 value into r15*/
        add     $'0', %r15       /*add '0' to r15 so the value will be ascii number character value*/

        movq    $msg+15, %r11   /*the digit location within string*/
        movb    %r15b, (%r11)   /*store the digit at the location*/

        jmp     print           /*go to the subroutine print*/

digit_2:
        movq    $0, %rdx         /*initialize rdx to 0 for division*/
        movq    %r10, %rax     /*store the r10 value into rax for division*/
        div     %r9              /*divide rax value by 10(r9)*/

        movq    %rax, %r14    /*store the rax value(quotient) into r14*/
        movq    %rdx, %r15    /*store the rdx value(remainder) int r15*/

        add     $'0', %r14       /*add '0' to r14 so the value will be ascii number character value*/
        add     $'0', %r15       /*add '0' to r15 so the value will be ascii number character value*/

        movq    $msg+14, %r11   /*the digit location within string*/
        movb    %r14b, (%r11)   /*store the digit at the location*/

        movq    $msg+15, %r12   /*the digit loctaion within string*/
        movb    %r15b, (%r12)   /*store the digit at the location*/

        jmp     print   /*go to the subroutine print*/


print:
        movq    $len,%rdx                       /* message length */
        movq    $msg,%rsi                       /* message location */


        movq    $1,%rdi                         /* file descriptor stdout */
        movq    $1,%rax                         /* syscall sys_write */
        syscall
        inc             %r12b

        inc             %r10      /*increment loop index*/
        cmp             $max, %r10  /*compare the max with r10*/
        jne             loop      /*if the max value and the r10(loop index) value is not equal,*/
                                  /*   redo the loop subroutine*/

        movq    $0,%rdi                         /* exit status */
        movq    $60,%rax                        /* syscall sys_exit */
        syscall


.section .data
msg:    .ascii      "Hello, world!:   \n"
        len = . - msg
min = 0
max = 30
division = 10

In X86 part, I encountered many seg faults. To deal with that, I tried to use gdb to deal with that. But unfortunately, the gdb barely gives me useful information.

I will write another blog to describe how to use gdb to debug assembly code.

Lab 5 – Aarch 64

In Lab 5, we will try to disassembly some C programs to look at how their assembly code is generated, and write a simple “Hello Word” program using assembly language.

Disassembly hello programs on Aarch64 machine

The <main> section is the code that inside of int main(){…}

hello:

hello2:

hello3:

Lab 5 on Aarch64

step 1: we need a program which is able to print “Hello Word”

.text
.globl _start
_start:

        mov     x0, 1           /* file descriptor: 1 is stdout */
        adr     x1, msg         /* message location (memory address) */
        mov     x2, len         /* message length (bytes) */

        mov     x8, 64          /* write is syscall #64 */
        svc     0               /* invoke syscall */

        mov     x0, 0           /* status -> 0 */
        mov     x8, 93          /* exit is syscall #93 */
        svc     0               /* invoke syscall */

.data
msg:    .ascii      "Hello, world!\n"
len=    . - msg

We are on Linux, so we need to perform system call “write” to output “Hello World” to screen.

write prototype:

ssize_t write(int fd, const void *buf, size_t count);

We can see the write function has 3 parameters: file descriptor, a pointer that points to where the data is in the memory, and the length of the data.

We declared a message in .data section and calculated the length of the message. Now we have all the data we need to perform “write” system call.

We move 1(stdout) to x0 as file descriptor, move address of msg to x1 as the pointer of the buffer, and len to x2 as the length of the message.

After all, we need to indicate the system call number to x8, which should be 64(write) here.

svc 0 to invoke the system call on Aarch64.

Now we can see the message “Hello World” on the screen.

After the output, we need to exit the program, so we performed another system call – exit. the parameter is 0.

Step 2:

We need to loop the output 30 times

.text

.globl _start



_start:

        mov     x19,min

loop:
        mov     x0, 1           /* file descriptor: 1 is stdout /
        adr     x1, msg         / message location (memory address) /
        mov     x2, len         / message length (bytes) /

        mov     x8, 64          / write is syscall #64 /
        svc     0               / invoke syscall /

        add     x19, x19, 1
        cmp     x19, max
        b.ne    loop

        mov     x0, 0           / status -> 0 /
        mov     x8, 93          / exit is syscall #93 /
        svc     0               / invoke syscall */


.data
msg:    .ascii      "Hello, world!\n"
len=    . - msg
min = 0
max = 30

We indicated a counter in x19, and loop the out put until the counter is max, which is 30.

Step 3: print the digits

For the first 10 loops is easy, we just need to output the char 0-9, which are 30-39 in ascii code.

         add x20, x19, 30

then print it to the screen by replacing the char at 14th char of msg.

        adr     x30, msg+14
        strb    w20, [x30]

But the problem is print 10-30. We need to deal with the second digit:

        udiv    x25, x19, x18
        msub    x26, x25, x18, x19

We store 10 in x18, and divide the number by 10, then store the remainder in x26. Then we can use the way that we display the 1-digit number to display 2-digit number.

Lab 5 on Aarch64:

.text
.globl _start

_start:
        mov     x19,min         /*store the min value into x19 as a loop index*/
        mov     x18,division    /*store the division value(10) into x18*/
loop:
        mov     x0, 1           /* file descriptor: 1 is stdout */

        cmp     x19, 9          /* compare x19(loop index value) with 9*/
        b.gt    digit_2         /* if the value is greater than 9(2-digit), go to the subroutine digit_2*/
        bl      digit_1         /* if the value is less or equal than 9(1-digit), go to the subroutine digit_1*/

digit_1:
        add     x20, x19, '0'   /* ascii number character*/

        adr     x30, msg+14     /* the digit location within string */
        strb    w20, [x30]      /* store the digit at the location */
        bl      print           /* go to the print subroutine */


digit_2:
        udiv    x25, x19, x18   /* divide the value by 10 and store the value into the x25 */
        msub    x26, x25, x18, x19 /* store the remainder into x26 */

        add     x21, x25, '0'   /* ascii number character */
        add     x20, x26, '0'   /* ascii number character */

        adr     x25, msg+14     /* the digit location within string */
        strb    w21, [x25]      /* store the digit at the location */
        adr     x26, msg+15     /* the digit location within string */
        strb    w20, [x26]      /* store the digit at the location */

        bl      print

print:
        adr     x1, msg         /* store the locatio of message */
        mov     x2, len         /* store the string length into x2 */

        mov     x8, 64          /* write is syscall #64 */
        svc     0               /* invoke syscall */

        add     x19, x19, 1     /* increment x19 value which is loop index */
        cmp     x19, max        /* compare x19 with max value */

        b.ne    loop            /* if the value is not equal to max value, loop it again */

        mov     x0, 0           /* status -> 0 */
        mov     x8, 93          /* exit is syscall #93 */
        svc     0               /* invoke syscall */

.data
msg:    .ascii      "Hello World!:   \n"
len= .- msg
min = 0
max = 30
division=10

After we finished the program on Aarch64, we will code the program again on x86_64

Project stage 1 – profiling

Project: FLIF

https://github.com/FLIF-hub/FLIF

“

FLIF is a lossless image format based on MANIAC compression. MANIAC (Meta-Adaptive Near-zero Integer Arithmetic Coding) is a variant of CABAC (context-adaptive binary arithmetic coding), where the contexts are nodes of decision trees which are dynamically learned at encode time.

FLIF outperforms PNG, FFV1, lossless WebP, lossless BPG and lossless JPEG2000 in terms of compression ratio.

Moreover, FLIF supports a form of progressive interlacing (essentially a generalization/improvement of PNG’s Adam7) which means that any prefix (e.g. partial download) of a compressed file can be used as a reasonable lossy encoding of the entire image.

For more information on FLIF, visit https://flif.info

“

To profile the project, I decided to write a script to store the run time.

I used Linux system calls record the run time of the program. To control the variables that may affect the run time and to get a more accurate time, I run the program 5 times and get the average run time of the 5 as the run time of the program.

To build a benchmark test case that is able to take 4 minutes, I created a 4k, 8k, and 16k picture as the test samples.

The result of 8k image on Xerxes server:

./flif test8k.png out.flif
Run 1/5 time: 112.959
Run 2/5 time: 113.265
Run 3/5 time: 113.794
Run 4/5 time: 112.956
Run 5/5 time: 113.011
Avg: 113.197

Then I need to run the test using the sample that I created on different optimization level. Therefore, fixing the test script is needed.

First, I created 4 different Makefile to allow my script to build the program with different optimization level:

Then in my script, after every test case, I just clean and compile it again, then run the test case.

To test the script, I selected a small image as a sample.

Seems like it works well. Then I just need to modify the script to keep only the output that I need – the times and some other critical information.

The script code is:

#include <stdlib.h>
#include <iostream>
#include <string>
#include "timer.h"
using namespace std;
int main(){
	Timer t;
	system("make clean > /dev/null"); //useless output
	std::string cmd = "./flif seneca.png out.flif";
	int c = 0;
test:
	string builds[4] = {string("cp MakefileO1 Makefile"), string("cp MakefileO2 Makefile"), string("cp MakefileO3 Makefile"), string("cp MakefileOf Makefile")};
	cout << builds[c] << endl;
	system((string(builds[c]) + string(" > /dev/null")).c_str()); //useless output
	system("make all >> buildinfo.txt");//record the build info into a file to log the building procedure
	cout << cmd << endl;
	double cul = 0;
	for(int i = 1; i <= 5; i++){
            t.reset();
	    t.start();
	    system(cmd.c_str());
	    t.stop();
	    system("rm out.flif");
	    cout << "Run " << i << "/5 time: " << t.currtime() << endl; 
	    cul += t.currtime();
	}
	cout << "Avg: " << cul/5 << endl;
	system("make clean > /dev/null"); //useless output
	c++;
	if(c != 4) goto test;	
}

output:

Finally, I just need to use the correct sample image and redirect the output into a file, then I can do others staff.

./swtest.out > 8k_result.txt

This blog will be updated when I get the result on both Xerxes server(x86) and Israel(AArch86).

Result on Xerxes(8K Image):

O1:

Run 1/5 time: 136.447
Run 2/5 time: 136.021
Run 3/5 time: 136.356
Run 4/5 time: 136.181
Run 5/5 time: 136.28
Avg: 136.257

O2:

Run 1/5 time: 113.303
Run 2/5 time: 113.05
Run 3/5 time: 113.058
Run 4/5 time: 113.007
Run 5/5 time: 113.434
Avg: 113.17

O3:

Run 1/5 time: 115.2
Run 2/5 time: 115.248
Run 3/5 time: 115.254
Run 4/5 time: 115.537
Run 5/5 time: 115.56

Ofast

Run 1/5 time: 114.524
Run 2/5 time: 114.434
Run 3/5 time: 114.959
Run 4/5 time: 114.584
Run 5/5 time: 114.871
Avg: 114.675

Result on Israel:

O1:

Run 1/5 time: 480.112
Run 2/5 time: 482.835
Run 3/5 time: 487.211
Run 4/5 time: 481.755
Run 5/5 time: 486.157
Avg: 483.614

O2:

Run 1/5 time: 342.601
Run 2/5 time: 343.368
Run 3/5 time: 343.536
Run 4/5 time: 343.573
Run 5/5 time: 350.155
Avg: 344.647

O3:

Run 1/5 time: 354.89
Run 2/5 time: 358.898
Run 3/5 time: 359.501
Run 4/5 time: 354.217
Run 5/5 time: 349.274
Avg: 355.356

Ofast:

Run 1/5 time: 351.706
Run 2/5 time: 350.837
Run 3/5 time: 345.919
Run 4/5 time: 350.003
Run 5/5 time: 346.357
Avg: 348.964

According to the result, the optimization level affects the runtime of the program.

Usually, the O3 is the fastest optimization level, but in this testing case, the O2 and O3 does not have big differences.

Complier Bootstrapping

Compiler bootstrapping means that the a compiler is coded by it’s own language.

For example, a C compiler may write in C.

??????????????????

This is an Chicken-And-Egg question right???

Where is the first version of the language’s complier from? More exactly, which language implemented first Compiler?

No question, it is written in Machine Code.

After that, we can use a coding language to program the computers. After we got the first compiler of a language, for example, first version of C compiler, we will be able to compile C code, and we can write another C compiler using C, and write a better C compiler base on the compiler we have.

For example:

I have a compiler call mcc, all c code compiler by mcc is 100% faster than gcc. How can I make it work?

First, I need to compile my mcc source code using gcc, so I have mcc 1.0.

All code that compiled by mcc is faster than gcc. So C programs 100 faster!

Is that end? Using mcc to compile code is still slow be cause mcc 1.0 is compiled by gcc. we know that C code that compiled by mcc runs faster, so what if I compiler mcc source code again using mcc1.0?

Yes! mcc 2.0(by mcc 1.0) runs faster 100% than mcc 1.0(by gcc)!

This is compiler bootstrapping.

Nowadays, we have so many great coding languages so seems like there is no reason to implement a compiler using it’s own language?

Well there are many reasons to do that.

When a new language is innovated, people need to validate that the language is easy to use.

Bootstrapping is a validation. Also, Bootstrapping helps the innovator to improve their languages to make it better.

But this does not means that the all common compiler are bootstrapped.

For example: Python and Java.

Python is a bootstrapped language, there is a Python implementation called PyPy. But most of people are using CPython(Implemented by C). The reason is efficiency. Also, Java Virtual Machine also has a Java implemented version, but it is barely used.

However, bootstrapping is still necessary to a language because it is able to validate the language and helps the language to be improved!

LAB 3

Lab 3 Code:

define 	POINTER         $10      
define 	POINTER_H	$11
define 	ROW_COUNT 	$13 
define 	ROW          	$14 
define 	COLUMN  	$15
define	CHECK_LINE	$16 
define	ROWFLIP_FLAG	$17
define	COLFLIP_FLAG	$18  
define	PTR_CALCULATE	$19
define	DATA_INDEX	$20
define	SCREENCOL_INDEX	$21
define	KEY		$22

define	WIDTH		5 
define	HEIGHT		5

 	lda  	#$04          	; Set a row value 1f
 	sta 	ROW
 	lda 	#$04		; Set a column value 5
 	sta 	COLUMN

	lda	#$00
	sta	ROWFLIP_FLAG
	sta	COLFLIP_FLAG

; ====== Calculation ======
; ====== Initialize for calculation ======
cal_initialize:

 	lda 	#$00  		; create a pointer at $10
 	sta 	POINTER
 	lda 	#$02
 	sta 	POINTER_H
	
	lda	#$00  
	sta	ROW_COUNT

	lda	#$00   
	sta	PTR_CALCULATE 

	cmp	ROW
	beq	column_count
	bne	row_count

;====== row count =======
row_count:
	lda	PTR_CALCULATE
	clc
	adc	#$20
	sta	PTR_CALCULATE
	
	lda	POINTER_H
	adc	#$00
	sta 	POINTER_H

	inc	ROW_COUNT 
	
	lda	ROW
	cmp	ROW_COUNT
	bne	row_count
	beq	column_count

;===== column_count =====
column_count:
 	lda	PTR_CALCULATE
 	clc
 	adc 	COLUMN
	sta	PTR_CALCULATE

;===== store the value to pointer (calculation done) =====
 	sta 	POINTER

;===== draw graphic =====
; initializing for drawing graph
	lda	#$00
	sta	ROW_COUNT

	ldx	#$00		; index for data
	ldy	#$00		; index for screen column

; draw graph
draw:	lda	ball,x

	sta	(POINTER),y 

	inx
	iny
	cpy	#WIDTH
	bne	draw

	inc	ROW_COUNT

	lda	#HEIGHT 
	cmp	ROW_COUNT
	beq	getkey_initialize
  
	lda	POINTER
	clc
	adc	#$20 
	sta	POINTER 
	lda	POINTER_H 
	adc	#$00
	sta	POINTER_H  

	ldy	#$00 
	beq	draw

; ======= get key =======
getkey_initialize:
	txa		   
	sta	DATA_INDEX
	tya	
	sta	SCREENCOL_INDEX	  
getkey:	lda 	$ff		; get a keystroke 
 
	ldx	#$00	; clear out the key buffer
	stx	$ff

 
 	cmp 	#$80	; if not a cursor key, ignore
 	bmi 	getkey_done
 	cmp 	#$84
 	bpl 	getkey_done 
 
 	cmp 	#$80		; check key == up
 	bne 	check1 
 
 	dec 	ROW		; ... if yes, decrement ROW
 	jmp 	getkey_done
 
 check1:	
	cmp 	#$81		; check key == right
 	bne 	check2
 
 	inc 	COLUMN  	; ... if yes, increment COL
 	jmp 	getkey_done
 
 check2:	
	cmp 	#$82		; check if key == down 
 	bne 	check3
 
 	inc 	ROW		; ... if yes, increment ROW
 	jmp 	getkey_done
 	
 check3:	
	cmp 	#$83		; check if key == left
 	bne 	getkey_done
 
 	dec 	COLUMN  	; ... if yes, decrement COL
 	clc
 	bcc 	getkey_done 

getkey_done:
	ldx	DATA_INDEX
	ldy	SCREENCOL_INDEX
	jmp	check_location

; ======= Check the location ======
check_location: 
	jsr 	check_top
	jsr	check_bottom
	jsr	check_right 
	jsr 	check_left
	jsr	move_pointer

check_initialize: 
	lda	#$00
	sta	CHECK_LINE
	rts
 
check_top:	
	jsr	check_initialize 
	lda	ROW  
	cmp	#$01
	lda	CHECK_LINE
	adc	#$00
	cmp	#$00		
	beq	flip_rowFlag
	rts

check_bottom:	
	jsr	check_initialize
	lda	ROW
	cmp	#$1b
	lda	CHECK_LINE
	adc	#$00
	cmp	#$01
	beq	flip_rowFlag 
	rts

check_left:
	jsr	check_initialize
	lda	COLUMN
	cmp	#$01
	lda	CHECK_LINE
	adc	#$00
	cmp	#$00
	beq	flip_colFlag
	rts

check_right:
	jsr	check_initialize
	lda	COLUMN
	cmp	#$1b
	lda	CHECK_LINE
	adc	#$00
	cmp	#$01
	beq	flip_colFlag
	rts

; ======= Flip Row Flag ====== 
flip_rowFlag:
	lda	ROWFLIP_FLAG
	cmp	#$00 
	beq	inc_rowFlag
	bne	dec_rowFlag
	rts

inc_rowFlag:
	inc	ROWFLIP_FLAG
	rts

dec_rowFlag:
	dec	ROWFLIP_FLAG
	rts


; ======= Flip Col Flag ======
flip_colFlag:
	lda	COLFLIP_FLAG
	cmp	#$00
	beq	inc_colFlag
	bne	dec_colFlag
	rts

inc_colFlag:
	inc	COLFLIP_FLAG
	rts

dec_colFlag:
	dec	COLFLIP_FLAG
	rts

; ======= move the graph ======== 
move_pointer:
	jsr	row_check
	jsr	col_check
	jmp	clear

row_check:
	lda	ROWFLIP_FLAG
	cmp	#$01
	beq	dec_row
	bne	inc_row

col_check:
	lda	COLFLIP_FLAG
	cmp	#$01
	beq	dec_col
	bne	inc_col

inc_row:
	inc	ROW
	rts

dec_row:
	dec	ROW
	rts

inc_col:
	inc	COLUMN
	rts

dec_col:
	dec	COLUMN 
	rts

; ======= move the graph ======= 
;move:	inc	ROW
;	inc	COLUMN

; ======= clear the screen before redraw the graph =====
clear:	lda	ball
	sta	POINTER 
	lda	#$02
	sta	POINTER_H
	
	ldy	#$00
	tya

clear_loop:
	sta	(POINTER),y
	iny
	bne	clear_loop
 
	inc	POINTER_H
	ldx	POINTER_H 
	cpx	#$06
	bne	clear_loop

	jsr 	cal_initialize

; data constant byte
ball:
 dcb 00,03,03,03,00
 dcb 07,00,00,00,07
 dcb 07,00,00,00,07
 dcb 07,00,00,00,07
 dcb 00,03,03,03,00

The lab we created a shape that will bounce around the screen.

To do this, we created a x/y coordinator location function to locate the screen using X and y:

code:


get_ptr:
	lda row
	sta counter
start:
	lda counter
	cmp #$00
	beq draw_return
	ldy counter
	dey
	sty counter
	lda index
	clc	
	adc #$20
	sta index
	bcs increase_page
	jmp start

draw_return:
	clc
	lda index
	adc col
	sta index
	lda #$00
	clc
	clv
	sta counter
	rts
 
increase_page: 
	inc page
	jmp start

This function takes col and row as X and Y value. After processed, it stores the pointer that point to the address that is map to the screen pixel to index and page.

By using this function, we are able to locate a pixel on the screen using X and Y coordinator value.

After we can locate a pixel on the screen, we need to make the pixel move around the screen.

Code:

move: 
	jsr slower 
	lda x_flag
	cmp #$00
	bne dec_col
	inc col
	lda col
	cmp #$1a
	bne goon
	jsr set_x_flag
	jmp goon
dec_col:
	dec col
	lda col
	cmp #$00
	bne goon
	jsr clr_x_flag
	jmp goon
goon:
	lda y_flag
	cmp #$00
	bne dec_row
	inc row
	lda row
	cmp #$1a
	bne done_mv
	jsr set_y_flag
	jmp done_mv
dec_row:
	dec row
	lda row
	cmp #$00
	bne done_mv
	jsr clr_y_flag
	jmp done_mv
done_mv:
	jsr backup_ptr
	jsr reset_ptr 
	jsr get_ptr
	jsr clr_px_at_backup_ptr
	lda #$00
	sta row_counter
	ldy #$00
	ldx #$00
	lda page
	sta draw_page
	lda index
	sta draw_index
	jsr draw
	
	jmp move
        brk

This function simply increase or decrease the X and Y value. We have 2 indicators to indicate if the X or Y needs to be increased or decreased. After the pixel reaches the edge of the screen, we reverse the indicators

After we got this 2 functions, we almost got the lab!

We just draw the shape from pixel, then clear the screen, and move the pixel, then draw it again. finally, we got the code that we have at the beginning of the blog

Modern CPU and OS – System call

Think about a situation, program A wants to send an message to screen.

As we know, the modern CPUs are running on protected mode, which means that program A is not able to perform that task by itself – program A cannot access the hardware, OS, other program memory area, etc.

To finish this task, the program needs some help from the OS, which is able to access hardware, or has the fully control of the CPU so that enables the CPU to work in non-protected mode.

Here is helloworld.c:

#include <stdio.h>
int main(){
    printf("hello world!");
    return 0;
}

The program send “hello world!” to the screen.

To do this, it invoked a function called printf() from c standard library.

But is that possible that we can send a message without using printf(), or without including <stdio.h>?

/* Hello World with a direct write to stdout (file descriptor 1) */

#include <unistd.h>

int main() {
	write(STDOUT_FILENO,"Hello World!\n",13);
}

This still works.

We write “Hello world!” to STDOUT_FILENO, which is 1 in Linux. In this case, we give our string to Linux system, can Linux put this string in a file of STDOUT_FILENO. Linux monitoring the STDOUT_FILENO, if there is anything in there, the system just print it to screen.

But, more directly?

#include <unistd.h>
#include <sys/syscall.h>

int main() {
	syscall(__NR_write,1,"Hello World!\n",13);
}

This still works!

Here, we performed a system call. When system received the system call, the system will do the same thing that it did for us during the last example.

but we still included some libraries.

Is there any possible that we can do it without including any libraries?

int main() {
   char* c = "HELLO WORLD!\n";
   int len = 13;
   int result = 0;
   register int    syscall_no  asm("rax") = 1;
   register int    arg1        asm("rdi") = 1;
   register char*  arg2        asm("rsi") = c;
   register int    arg3        asm("rdx") = 14;
   __asm__ volatile(
      "syscall" :
      "=a" (result)
   );
}

This is it.

We did not include any libraries, but it still output HELLO WORLD to screen. This time, can we say that our program directly output the string to the screen?

The answer is no.

If we analysis the program, we can see there is a word: syscall.

After we passed the necessary information to the registers, we performed a system call.

In this case, the CPU still moves the control from our program to the Linux system.

How? We did not tell the CPU to do anything with Linux system involved.

In fact, the relationship between CPU and the system is created when the system is booting.

Before the CPU enters protected mode, the Linux system tells the CPU a table that contains many system calls.

In the table, read() is 0, write() is 1.(X86_64)

when the CPU execute system call, the CPU get a number from %rax, then the CPU open the table that it got when the system is booting.

In %rax is 1, CPU found it in the table, it is write()! Then the CPU move the control to the system, execute the code of write() that provided by the system. After it’s done, the system move the control of CPU back to our program!

In other word, the system call is a bridge between our program and other program area. If our program want’s to access other memory, the program can only perform system calls and the system calls follows the pre-defined behaviors.

This how the CPUs and Memory protected our PCs.