OpenRCE Blog Entry

trufae <trufae

gmail

com>

Thursday, April 24 2008 10:25.05 CDT

I have managed to write a small ugly and obfuscated hacky program to test the execution cost of different opcodes to create a small table containing the execution time cost for each opcode.

I run 20K times the opcode defined in the input file (one per line) and calculates the mid time of 100 runs.

Not all opcodes can be tested (think that branches, traps, interrupts and pipeline/caches cannot be tested in this way), but it's quite useful for writing optimizations.

I think that GCC has some of these cost execution tables for their own optimizations but I prefer to recreate them on the real hardware.

This program is obviously targeted to run on UNIX boxes with gnu the compiler. But should be effort-less to port it to windows.

Here's a sample session and the source code:



$ ./opcost opcodes.txt

     srl $t0, $t0, 8:    0 45

      lb $t0, 8($sp):    0 25

      lw $t1, 0($sp):    0 25

     lw $t1, 16($sp):    0 31

    sll $t0, $t1, 16:    0 18

       move $v0, $v1:    0 18

          li $v0, 33:    0 18

   add $v0, $v0, $v0:    0 45

       mult $v0, $v0:    0 45

                 nop:    0 18

$ cat opcost.c

/**

 * Opcode cost payload calculation

 *

 * pancake <pancake/A/youterm/D/com>

 */



#include <stdio.h>

#include <string.h>



#define TIMES "20560"

#define LAUNCHER ""



char *prg = "echo '#include <sys/time.h>\n#include <time.h>\n#include <stdio.h>\n"

"int main() { int i, a=0,b=0; struct timeval tv,tv2;"

"  for(i=0;i<100;i++) {  gettimeofday(&tv, NULL);"

"  asm(\"\\n.rept "TIMES"\\n%s\\n.endr\\n\"); gettimeofday(&tv2, NULL);"

"  if (i==0) continue; /* skip first iteration */"

"  /* printf(\"%%d %%d\\n\", tv2.tv_sec-tv.tv_sec, tv2.tv_usec - tv.tv_usec); */"

"  if (a == 0) a = tv2.tv_sec-tv.tv_sec; if (b == 0) b = tv2.tv_usec-tv.tv_usec;"

"  else { a=(a+tv2.tv_sec-tv.tv_sec)>>1; b=(b+tv2.tv_usec-tv.tv_usec)>>1; }"

"  } printf(\"%20s:\t %%d %%d\\n\", a, b); return 0;}' > .tst.c && gcc %s .tst.c -o .a.out"

"&& "LAUNCHER" ./.a.out && rm -f .a.out .tst.c";



int main(int argc, char **argv)

{

        FILE *fd;

        char buf[1024];

        char line[1024];

        const char *cflags;

        if (argc<2) {

                printf("Usage: opcost [file]\n");

                return 1;

        }

        fd = fopen(argv[1], "r");

        if (fd == NULL) {

                printf("Cannot open '%s'\n", argv[1]);

                return 1;

        }

        cflags = (const char *)getenv("CFLAGS");

        if (!cflags) cflags="";

        while(1) {

                line[0] = '\0';

                fgets(line, 1024, fd);

                if (!*line||line[1]=='\0') break;

                line[strlen(line)-1] = '\0';

                sprintf(buf, prg, line, line, cflags);

                if (system(buf))

                        break;

        }

        fclose(fd);

}

Have fun!

tagetora	Posted: Thursday, April 24 2008 10:49.27 CDT
Why are you doing it in C and not in assembler? I mean, using RDTSC <opcode> RDTSC or an equivalent timestamping function if you are not on x86.

trufae

Posted: Thursday, April 24 2008 10:56.23 CDT

because i want something portable and i don't want to spend my -precious- time rewriting the same 'shit' of test for every one.

This is mips and there is no rdtsc like opcode for it.

Writing it for a specific architecture/os it will get better performance. I know that embedding C and shellscript inside a C program is not that clean.

BTW i'm not interested on the "real" time of execution because this is impossible to achieve. I just want a proportion time of execution between all the analyzed opcodes (a ratio) which can be calculated parsing the output.

btw nice to read you ! :)

tagetora	Posted: Thursday, April 24 2008 11:11.56 CDT
Okay, no rdtsc, but what about using clock() instead of gettimeofday()? It seems more reliable to me and, oh! it's portable too ;D

trufae	Posted: Thursday, April 24 2008 11:39.32 CDT
Uhm, gettimeofday() is a syscall, no idea how is implemented clock() but I prefer to directly call the kernel :)

RolfRolles

Posted: Thursday, April 24 2008 16:30.27 CDT

The problem with calculating an instruction's "execution cost" is that, due to pipelining (and other modern processor features), the speed of execution of an instruction is not only dependent upon the instruction itself, but also on the instructions executed beforehand. Hence, the term "instruction execution cost" is meaningless. This is why the Intel manuals no longer list cycle-level timing counts, and this is also why compilers beat hand-written assembly code in many cases (optimization is no longer as simple as "use a faster instruction"). Though you are talking about MIPS and not x86, the concept of pipelining applies to both RISC and CISC processors.

This is a complicated subject, but this article does a good job in explaining it.

MazeGen	Posted: Friday, April 25 2008 01:55.45 CDT
This is why the Intel manuals no longer list cycle-level timing counts Intel lists the number of clock cycles in Optimization Reference Manual, Appendix C

trufae

Posted: Friday, April 25 2008 04:00.15 CDT

[i]Though you are talking about MIPS and not x86, the concept of pipelining applies to both RISC and CISC processors.[i]

The pipeline usage in mips and x86 is completely different. And the way for optimizing code for it is cpu specific, (not all mips have the same pipeline rules).

I know that cost of execution per opcode is not a always-valid rule for choosing one opcode or another, but using a bottom-top point of view, you can think in per instruction cost - per context cost and per block cost.

The instruction cost will be different by shuffling instructions, because of the payload in dependencies on the next opcode. On mips for example your jumps must be aligned to %32+24 to avoid refilling the pipeline in an sub-optimal way (for example).

So, these costs have nothing to do with the reality (unless your program is composed by a single instruction :P).

Will be interesting (and not very hard) to code a program to check per instruction, per loop and per block execution costs of an algorithm. This way it will be possible to see the pipeline and instruction payloads in a context-sense environment. (just for analysis and fun)

RolfRolles

Posted: Tuesday, April 29 2008 03:26.45 CDT

Sorry for the slow response -- GTA IV, enough said. I stand corrected on the point about cycle counts being included in the manual, and thank you for pointing this out. Nonetheless my overall point is borne out by the footnotes in appendix C, which indicate that the cycle counts are for the worst-case scenario, due to pipelining and OOO, and are "not meant to be used ... for instruction-level performance benchmarks". By the way, MazeGen, thanks for geek32 -- I have found it handy for disassembling disassemblers.

MazeGen	Posted: Wednesday, May 7 2008 08:26.03 CDT
thanks for geek32 -- I have found it handy for disassembling disassemblers. Good to hear it is useful for you. Just one question, can you actually read geek's cryptic operand codes, or you need merely the mnemonics? I mean, coder32 is more useful in some cases.