Analytics

Sunday, February 28, 2010

Things I didn't know about RTP and AMR

Oh my god... Looking at an AMR payload wrapped in RTP, wrapped in UDP, wrapped in IP... one will recognize that there is a +70% overhead on payload vs. header... The other thing is... the AMR payload can have a CRC or not but it is not indicated in the RTP header and must be signalized out of band... it took someone a while to figure it out. Hurray on having clever colleagues.

Dealing with Performance Improvements

I hope this post is educational and help the ones among us doing performance optimisations without any kind of measurement. If you do these things without a benchmark you are either a genuis or very likely your application is going to run slower. I'm not going to talk about performance analysis right now but tools like OProfile, callgrind, sysprof and speedprof are very handy utilities. The reason I'm writing this up is that I saw a performance regression in one of my testcase reductions and this is something which I don't appreciate and in general I see a lot of claims about performance tuning but little bit in regard to measurements and this part is very worrying.

For QtWebKit we have the performance repository with utilities, high level tests and something I labeled reductions. In detail we do have the following things:


  1. Macros for benchmarking. I started with the QBENCHMARK macros but they didn't really provide what I needed and changing them turned out to be a task I didn't have time for. I create WEB_BENCHMARK macros that work the same as the QBENCHMARK macros. One of the benefits is to provide better statistics, it prints the mean, std deviation and these things at the end of the run. And it has a different metric for measuring time. I'm using the setitimer(2) syscall to measure the CPU time we are executing in userspace and kernelspace on behalf of the application. This metric is a robust way to avoid issues like CPU scheduling and such. It would be the wrong metric to measure latency and such though, as we are not executing anything when waiting.


  2. Pick the area you want to optimize. With the QtWebKit performance repository we do have a set of reductions. These reductions consist of real code, a test pattern and test data. The real code is coming from WebCore and is driving Qt, the test pattern comes from loading real webpages. It is created by adding printf and such to the code and the test data is the data that was used when creating the test pattern. We do have these reductions for all image decoding operations we are doing on the webpages, for our font usage, for QTextLayout usage.
    The really awesome bit about these reductions is that they generate stable timings, are/should be fully deterministic. This allows to really measure any change I'm doing to let's say QImageReader and the decoders.



Using the setitimer(2) syscall we will have pretty accurate CPU usage of the benchmark, using the /lib/libmemusage.so of GLIBC we should have an accurate graph of the memory usage of the application. It is simple to create a benchmark, it is simple to run the benchmark, it is simple to run the benchmark with memory profiling. By looking both at CPU and Memory usage it will become pretty clear if and where you have tradeoffs between memory and CPU.

And I think that is the key of a benchmark. It must be simple so people can understand what is going on and it must be simple to execute so everyone can do their own measurements and verify your claims. And specially having a benchmark and having people verify your measurements is keeping you honest.

Finally the commit message should state that you have measured the change, it should show the result of the measurement and it should contain some interpretation. e.g. you are optimizing for memory usage and then a small CPU usage hit is acceptable...

Friday, February 26, 2010

Explorations in the field of GSM

Something like 14 months ago I had no idea about GSM protocols, 12 months ago I was implementing paging for OpenBSC, beginning from last summer I explored SS7 and SCCP, wrote a simple SCCP stack for On-Waves. Started to implement the GSM A Interface for OpenBSC, the last week I saw myself learning more about MTP Level3. With the Osmocom I start to explore GSM Layer 1 (TDMA, bursts, syncing), GSM Layer 2 (LAPDm) and on GSM Layer3 we mostly see the counterpart of OpenBSC.

I feel like I am back to school (in the positive way) and I have learned a lot in the recent year and looking forward I will learn more about protocols used at the MSC side and such. I'm very excited about what the future is going to be like. Will we have a complete GSM Network (BTS, BSC, MSC, MS, SMSC, GPRS gateway(s)) with GPL software by the end of the year?

Thursday, February 11, 2010

Conclusions of my QtWebKit performance work

My work on QtWebKit performance came to a surprising end late last month. It might be interesting for others how QtWebKit compares to the various other WebKit ports, where we have some strong points and where we have some homework left todo and where to pickup from where I had to leave it.

Memory consumption


Before I started our ImageDecoderQt was decoding every image as soon as the data was complete. The biggest problem with that is that the ImageSource we are embedded into does not tell the WebCore::Cache about the size of the images we already have decoded.

In this case there was no need to decode the whole image as soon as the date comes in but wait for the ImageSource to request the image size and the image data. This makes a noticable difference on memory benchmarks and allows us to have the WebCore::Cache control the lifetime of decoded image data.

We still have one case where we have more image data allocated than the WebCore::Cache thinks. This is the case for GIF images as we are decoding every frame to figure out how many images we have there.

To fix that we should patch the ImageSource to ask the ImageDecoder for "extra" allocated data, and we should fix/verify the GIF Image Reader so we can jump to a given GIF frame and decode it. This means we should remember where certain frames begin...

Performance


Networking


Markus Götz and Peter Hartmann are busy working on the QNetworkAccessManager stack. Their work includes improving the parsing speed of HTTP headers, making sure to start HTTP connections after the first iteration of the mainloop instead of the third.

In one of my tests wget is still twice as fast as the Qt stack to download the same set of files. And wget is using one connection at a time, no pipelining... and Qt is attempting to have up to 6 connections in parallel. This means there is still some work to do in reducing latency and improving scheduling of requests. I'm pretty confident that Markus and Peter will work on this!

Images


The biggest limitation of the Qt Image decoders is that in general progressive loading is not possible and unless I have messed up my reduction the Qt Image decoders are faster than the ones we have in WebCore.

With some of my reductions I can make some stuff twice as fast for the pattern QtWebKit is having on QImageReader. Currently when asking the QImageReader for the size, the GIF decoder will decode the full frame (size + image data). For the GIF decoder we start the JPEG decompression separately for getting the size, the image and the image format.

A proof of concept patch for the JPEGReader to reuse the decompression handler showed that I can cut the runtime of the image_cycling reduction by 50%.

Misc


One misc. performance goal is to remove temporary allocations. E.g. remove QString::detach() calls from the paint path, to not copy data when moving from QString to WebCore::String, QByteArray to WebCore::String. Some of these include not using WebCore::String::utf8(), but have a zero cost conversion of WebCore::String to QString and use Qt's utf8()...

Text


But the biggest problem of QtWebKit performance is text and I statzed to work on this. For Qt we always have to go through the complex text path of WebCore which means we will end at QTextLayout, which will ask harfbuzz to shape the text.

There are two things to consider here. For QtWebKit we are using Lars's QTextBoundaryFinder instead of ICU. I'm not sure if we have ever compared how ICU and QTextBoundaryFinder split text. We might do more work than is necessary, at least it would be good to know. Specially for Japanese and Korean we might split words too early creating more work for our complex text layout path.

The second part is to look at our QTextLayout usage pattern and start to optimize for it... the quick solutions of asking QFont to not do kerning, and not to do font merging (to not use the QFontEngingeMulti) didn't really make a noticable difference... To get an idea of the size of the problem, on loading pages like the Wikipedia Article of the Maxwell Equations we are spending so much time in WebCore::Font::floatWidthForComplexText that other ports like WebKit/GTK+ takes to load the entire page. This also seems to be the case for sites like google news.


And this is exactly where I would have loved to continue to work on it, but that is now pushed back to my spare time where it needs to compete with the other hobby projects.