This is another post about a modern Linux based performance measurement utility. It is called perf, it is included in the Linux kernel sources and it entered the kernel in v2.6.31-rc1. In many ways it is obsoleting OProfile, in fact for many architectures oprofile is just a wrapper around the perf support in the kernel. perf comes with a few nice application. perf top provides a statistics about which symbols in user and in kernel space are called, perf record to record an application or to start an application to record it and then perf report to browse this report with a very simple CLI utility. There are tools to bundle the record and the application in an archive, a diff utility.
For the last year I was playing a lot with GNU Smalltalk and someone posted the results of a very simplistic VM benchmark ran across many different Smalltalk implementations. In one of the benchmarks GNU Smalltalk is scoring last among the interpreters and I wanted to understand why it is slower. In many cases the JavaScriptCore interpreter is a lot like the GNU Smalltalk one, a simple direct-threaded bytecode interpreter, uses computed goto (even is compiled with -fno-gcse as indicated by the online help, not that it changed something for JSC), heavily inlined many functions.
There are also some differences, the GNU Smalltalk implementation is a lot older and in C. The first notable is that it is a Stack Machine and not register based, there are global pointers for the SP and the IP. Some magic to make sure that in the hot loop the IP/SP is 'local' in a register, depending on the available registers also keep the current argument in one, the interpreter definition is in a special file format but mostly similar to how Interepreter::privateExecute is looking like. The global state mostly comes from the fact that it needs to support switching processes and there might be some event during the run that requires access to the IP to store it to resume the old process. But in general the implementation is already optimized and there is little low hanging fruits and most experiments result in a slow down.
The two important things are again: Having a stable benchmark, having a tool to help to know where to look for things. In my case the important tools are perf stat, perf record, perf report and perf annotate. I have put a copy of the output to the end of this blog post. The stat utility provides one with number of instructions executed, branches, branch misses (e.g. badly predicted), L1/L2 cache hits and cache misses.
The stable benchmark helps me to judge if a change is good, bad or neutral for performance within the margin of error of the test. E.g. if I attempt to reduce the code size the instructions executed should decrease, if I start putting __builtin_expect.. into my code the number of branch misses should go down as well. The other useful utility is to the perf report that allows one to browse the recorded data, this can help to identify the methods one wants to start to optimize, it allows to annotate these functions inside the simple TUI interface, but does not support searching in it.
Because the codebase is already highly optimized any of my attempts should either decrease the code size (and the pressure on the i-cache), the data size (d-cache), remove stores or loads from memory (e.g. reorder instructions), fix branch predictions. The sad truth is that most of my changes were either slow downs or neutral to the performance and it is really important to undo these changes and not have false pride (unless it was also a code cleanup or such).
So after about 14 hours of toying with it the speed ups I have managed to make come from inlining a method to unwind a context (callframe), reordering some compares on the GC path and disabling the __builtin_expect branch hints as they were mostly wrong (something the kernel people found to be true in 2010 as well). I will just try harder, or try to work on the optimizer or attempt something more radical...
$ perf stat gst -f Bench.st
219037433 bytecodes/sec; 6025895 sends/sec
Performance counter stats for 'gst -f Bench.st':
17280.101683 task-clock-msecs # 0.969 CPUs
2076 context-switches # 0.000 M/sec
123 CPU-migrations # 0.000 M/sec
3925 page-faults # 0.000 M/sec
22215005506 cycles # 1285.583 M/sec (scaled from 70.02%)
40593277297 instructions # 1.827 IPC (scaled from 80.00%)
5063469832 branches # 293.023 M/sec (scaled from 79.98%)
70691940 branch-misses # 1.396 % (scaled from 79.98%)
27844326 cache-references # 1.611 M/sec (scaled from 20.02%)
134229 cache-misses # 0.008 M/sec (scaled from 20.03%)
17.838888599 seconds time elapsed
PS: The perf support probably works best on Intel based platforms and the biggest other problem is that perf annotate has some issues when the code is included from other c files.
Sunday, February 13, 2011
Friday, February 11, 2011
Nokia and Windows Phone 7
I had the opportunity to play with MeeGo for Handsets 1.1 and the upcoming 1.2 in December of last year on a Nokia N900. I very much felt reminded of Openmoko before I had joined them. There were a lot of promises, dreams but the reality looked differently. The Handset 1.1 release was not working at all besides being very slow the xterm and the window manager couldn't quiet agree, the pre releases of 1.2 worked a bit better but is still far away from being usable. The SDK situation is not much better. The madde approach is very promising, the tools to recreate the SDK actually work but the Qt installation is different depending on if you are using the X86 or the ARM SDK. E.g. the WebKit headers are only in the X86 SDK but were not in the ARM SDK. The reason is that both SDK package descriptions are in different files and they do not get synced.
Nokia announced today to enter a strategic partnership with one of the biggest software companies in the world, Microsoft. I am glad Nokia doesn't try to turn Symbian into a real operating system, I understand that with throwing out the GTK+ platform and restarting with Qt MeeGo is not close to challenge Android/iPhone. Nokia has missed out on purchasing WebOS from Palm and the best alternative is really to go with Microsoft Windows 7 Phone. On the other hand it is scary that Windows 7 Phone is the best alternative.
Nokia announced today to enter a strategic partnership with one of the biggest software companies in the world, Microsoft. I am glad Nokia doesn't try to turn Symbian into a real operating system, I understand that with throwing out the GTK+ platform and restarting with Qt MeeGo is not close to challenge Android/iPhone. Nokia has missed out on purchasing WebOS from Palm and the best alternative is really to go with Microsoft Windows 7 Phone. On the other hand it is scary that Windows 7 Phone is the best alternative.
Wednesday, February 02, 2011
Wireshark dissector for TETRA
The Professional Mobile Communication Research Group of Beijing Institute of Technology (BIT) was kind enough to send us their TETRA Wireshark dissector. They went through the specification and created ASN1 files out of the tables, I helped with the integration and cleaning to get the code into wireshark and the wireshark developers were kind enough to do a fast review and the code is now merged.
The next part is to extend the GSMTAP dissector to pass on the tetra bits to the TETRA decoder, I have already written the code, now I need to get it merged into the wireshark codebase too.
My TETRA agenda includes figuring out majority logic decision decoding with syndromes which is required for the shorted Reed Muller code used in the AACH, create a proper GNU Radio block for the decoder emitting soft bits that we can feed into our viterbi decoder... maybe I will manage to do this during my vacation in Taiwan... who knows.
The next part is to extend the GSMTAP dissector to pass on the tetra bits to the TETRA decoder, I have already written the code, now I need to get it merged into the wireshark codebase too.
My TETRA agenda includes figuring out majority logic decision decoding with syndromes which is required for the shorted Reed Muller code used in the AACH, create a proper GNU Radio block for the decoder emitting soft bits that we can feed into our viterbi decoder... maybe I will manage to do this during my vacation in Taiwan... who knows.
Subscribe to:
Posts (Atom)