Thursday, April 22, 2010
Recipe for desaster
I really hate closed source systems, every time I have to deal with it, it turns out to be a waste of time dealing with it. Right now I'm dealing with a closed source BSS (I wish I could call names), I can make it fail by sending a very simple message, a message that is crucial for everything (SMS, Call) to terminate on the MS. Of course an expensive support contract was signed but apparently it does not buy more than the right to create a ticket in a bug tracker and let it rot there. Never ever make your business depend on a black box!
Monday, April 19, 2010
Thoughts on Fedora 12 and 13
I was a bit unhappy with the performance of Ubuntu Karmic and Ubuntu Lucid (mostly the start scripts to handling filesystem errors at all...) and decided it is time to try a RPM based Distribution again. I really don't like to wait for a download of a DVD and then to upgrade the whole system with new software after the installation. Coming from Debian I totally love the slick network installer. This means I have to install my RPM based distro like this or I will stay with Ubuntu or go back to Debian. I have tried doing with OpenSUSE some time ago and it failed miserable, so now it was time for Fedora. So here is a random list of notes I hate about Fedora.
Besides that Fedora seems to be a robust and well maintained distribution which contains really recent versions and sometimes even the future... like systemtap with a utrace enabled kernel to trace user applications.
- Netinstaller needing a 200MB download. One can build a whole GNOME image with that...
- The guided partitioner is nice and fails. It picks a 500MB boot partition which is not enough if you want to use pre-upgrade later to upgrade your system (which silently fails when going out of disk space...), so I want to increase the boot partitions size, but I need to delete the LG and PV of LVM before.. So at the end my guided partitioning was all manual as I want to change the size of one partition... this could be handled way better.
- YUM is awfully slow and one is not encouraged to do the equivalent of a "apt-get dist-upgrade" with it, one needs to use a installer CD to do the upgrade. This seems to be really backward.
- I had to learn the wonders of rpm --rebuilddb during a failed upgrade...Something upgraded the berkeley db version or such.. and the package database was corrupted.
- Removing pulseaudio (it takes like 50% of CPU time on my netbook) removes bluez..
Besides that Fedora seems to be a robust and well maintained distribution which contains really recent versions and sometimes even the future... like systemtap with a utrace enabled kernel to trace user applications.
Got a new MSI X340 Laptop
After the break down of my beloved Macbook I decided it is time for a new notebook. So I went out to the various markets and searched around. I was in search for something slim but still powerful. So it could be a Macbook AIR? Looking at the specs revealed it contains NVIDIA GFX, so that is a no go. Then I saw a Sony X-Series... very very slim... then my math failed and i thought it is cheap... well Sony's are never cheap and the repair is always outrages... I stumbled across the MSI X-Series and i really liked it...
So my X-340 has a Intel Chipset and is a Centrino2, it is one of these ultra low voltage CPUs... and the Linux supports rocks. Wireless, Bluetooth, Camera (not that I need it), Sound, Touchpad, Graphics (intel gma4500) all work out of the box. The CPU seems to be powerful enough for what I want to do.
So far I seem to have made a good pick.
So my X-340 has a Intel Chipset and is a Centrino2, it is one of these ultra low voltage CPUs... and the Linux supports rocks. Wireless, Bluetooth, Camera (not that I need it), Sound, Touchpad, Graphics (intel gma4500) all work out of the box. The CPU seems to be powerful enough for what I want to do.
So far I seem to have made a good pick.
Sunday, April 18, 2010
GSM RACH Bursts and Paging Requests
Yesterday I had the pleasure of trying OpenBSC on a real network and the result was desaster, but honestly what else to expect when trying it the first time. It is not that OpenBSC was crashing, leaking memory, or not recovering from failure it is just the load of the network was differrent than what I assumed and that leads to problems.
What happens is one is seeing a lot of location updating requests, which will load the SDCCH but that is really fine and we have seen such things at the Chaos Congress, what is different is the result of location updating requests, the network will flood us with paging requests... Right now we are sending up to 20 paging requests every two seconds... The first thing to notice is that this too much for the nanoBTS, it is sending us a nice CCCH/ACCH/BCCH overload warning which we do not handle (we should start two timers and throttle the amount of messages we send) the other part is... if we are out of SDCCHs and ask 20 more phones a second to get one... We have created the RACH DoS that Dieter Spaar has done with a Mobile Station.
The Random Access Request contains the channel type and one to IIRC four bits of random numbers, so even if we have a free channel... it can happen that two phones believe that we have assigned a channel to it... and then we see RF Failures, which in turn will trigger the phone to try again (or we page it again)... and then nothing will work....
The other observation is that if our cell is really busy we should start to assign TCHs to fullfill location updating requests....
So the changes I need to make is to change the paging to not page as much as we physically can stuff into the PACCH but as to how much of the responses we can handle (pretty obvious?) and the other is to allocate a "bigger" channel in case we have no smaller channel... E.g. use number of free channels divided by X for paging requests...
What happens is one is seeing a lot of location updating requests, which will load the SDCCH but that is really fine and we have seen such things at the Chaos Congress, what is different is the result of location updating requests, the network will flood us with paging requests... Right now we are sending up to 20 paging requests every two seconds... The first thing to notice is that this too much for the nanoBTS, it is sending us a nice CCCH/ACCH/BCCH overload warning which we do not handle (we should start two timers and throttle the amount of messages we send) the other part is... if we are out of SDCCHs and ask 20 more phones a second to get one... We have created the RACH DoS that Dieter Spaar has done with a Mobile Station.
The Random Access Request contains the channel type and one to IIRC four bits of random numbers, so even if we have a free channel... it can happen that two phones believe that we have assigned a channel to it... and then we see RF Failures, which in turn will trigger the phone to try again (or we page it again)... and then nothing will work....
The other observation is that if our cell is really busy we should start to assign TCHs to fullfill location updating requests....
So the changes I need to make is to change the paging to not page as much as we physically can stuff into the PACCH but as to how much of the responses we can handle (pretty obvious?) and the other is to allocate a "bigger" channel in case we have no smaller channel... E.g. use number of free channels divided by X for paging requests...
Tuesday, April 13, 2010
Traveling woes..
I'm just back to the flat. On this trip (in order) my luggage didn't arrive witht the same airplane, my cellphone broke during the flight (the headset speaker is dead), my laptop backlight broke and my luggage didn't arrive at the final destination... So after doing a backup of the data and removing the disk I will bring my laptop to an Apple dealer, and for the weekend I try to get my phone replaced... any my luggage should be home later today as well..
Besides that the traveling was fine...
Besides that the traveling was fine...
Sunday, April 11, 2010
Hacking on OpenBSC
I was invited to visit the On-Waves (they have a shiny new website) office in Paris this week and I was quite busy hacking away on the OpenBSC codebase. On-Waves allows me to play a bit with their MSC and learn more about GSM and in exchange OpenBSC gains a more and more complete and stable GSM A-Interface.
When developing code for OpenBSC we are mostly sitting very close to the BTS, only have one active subscriber, test one thing, restart, test another thing, restart but with any piece of software I'm writing, I want OpenBSC to be rock solid, run unattended for years, have no memory leaks, deal with the nanoBTS going away and coming back, the MSC going away and coming, all this at any point in time. So far events like Hacking at Random and the Congress are the ideal testing ground as many different handsets, subscribers, etc are the ideal playground.
My testing was limited to a small set of handsets connected via USB and executing AT commands for call handling and sending SMS. I'm addressing subscribers on the same cell. That means whenever I do a call I have mobile originated and mobile terminated testing covered and this is done by funny chat scripts that work most of the time. The next thing is to simulate failure, for some stuff where a specific layer3 message would be send, we have to wait for a more complete OsmocomBB, so what I can easily do is to cut off TCP connections. I have done this with another piece of weird shell magic. I use the output of $RANDOM and treat it as seconds and then use a kill -SUGUSR2 `pidof bsc_msc_ip` to close the MSC connection at a random time. And then I let it running and wait for failures.
I have fixed a bug/issue in the way we do release a channel. There are multiple things involved. First of all is instructing the BTS that a given channel on a timeslot is open or closing it (RF Channel Release of RSL), the other part is that on the channel one can have logical applications running (SAPI), this can be call control (SAPI=0) and SMS (SAPI=3). When opening a connection to a Mobile Station (MS) the SAPI=0 is always established, when attempting to deliver a SMS we need to open SAPI=3 first. Now our issue with bringing this down was that whenever we got a SAPI release confirm (we asked for the release and it was released) or release indication (the MS closed it) and we used to respond with a RF Channel Release. Now when trying to bringdown a connection were we delivered a SMS we would issue a RF Channel Release twice and the nanoBTS ACKed it twice! To make matter worse, whenever we get a RF Channel Release ACK we mark it as free. We had this small window when we got the first RF Channel Release ACK, allocated the channel again, and then get the second RF Channel Release ACK. I have fixed this issue in multiple ways. The first is to use the T3111 timer to wait until we issue the RF Channel Release, the second is to handle (RF) failures by "blocking" the lchan for a short second to receive multiple errors and release acks and the last bit is to properly bring down the channel. When we have SAPI!=0 we bring that down first, then we send SACH deactivate, followed by SAPI=0 release and then finally we send the RF Channel Release. This makes things more reliable on our side but we need to fix some more things. There is a FIXME inside the gsm_04_08_utils.c that mentions the start of a T3109 timer. In any case when sending a SAPI release the BTS will answer with success or a timeout and we handle both.
Today I addressed losing the RSL or OML connection to the nanoBTS and making sure we are reconnecting and not leaking any memory. This took me most of the day to get stable and I have found a bug or such inside the osmocore/select.c when releasing a bsc_fd that is the last one of the list. The difficulty here is making sure we do not leak memory, close all file descriptors, close all channels that take place on the RSL connection and make sure that when the BTS is up again we can use the channels that were allocated during the failure. To help with testing I added two commands to our vty interface to drop the OML or the RSL connection on a given BTS. The other part that was helpful is to use Linux's Netfilter and drop packets on a TCP connection and to wait for a failure. Now I can simulate most of the network failures easily and could build some trust.
And my final wishlist item would be to have like 16 GTA02 boards, use FS0 on each and run a simple script to dial, send SMS, pickup phonecalls this would allow me to heavily test the networking in an automated way. On top of that would be to have a OsmocoreBB enabled Calypso or C123 and then I could even send messages that are normally not send at all. And thanks to FreeSoftware development I'm sure we are going to reach that goal.
When developing code for OpenBSC we are mostly sitting very close to the BTS, only have one active subscriber, test one thing, restart, test another thing, restart but with any piece of software I'm writing, I want OpenBSC to be rock solid, run unattended for years, have no memory leaks, deal with the nanoBTS going away and coming back, the MSC going away and coming, all this at any point in time. So far events like Hacking at Random and the Congress are the ideal testing ground as many different handsets, subscribers, etc are the ideal playground.
My testing was limited to a small set of handsets connected via USB and executing AT commands for call handling and sending SMS. I'm addressing subscribers on the same cell. That means whenever I do a call I have mobile originated and mobile terminated testing covered and this is done by funny chat scripts that work most of the time. The next thing is to simulate failure, for some stuff where a specific layer3 message would be send, we have to wait for a more complete OsmocomBB, so what I can easily do is to cut off TCP connections. I have done this with another piece of weird shell magic. I use the output of $RANDOM and treat it as seconds and then use a kill -SUGUSR2 `pidof bsc_msc_ip` to close the MSC connection at a random time. And then I let it running and wait for failures.
I have fixed a bug/issue in the way we do release a channel. There are multiple things involved. First of all is instructing the BTS that a given channel on a timeslot is open or closing it (RF Channel Release of RSL), the other part is that on the channel one can have logical applications running (SAPI), this can be call control (SAPI=0) and SMS (SAPI=3). When opening a connection to a Mobile Station (MS) the SAPI=0 is always established, when attempting to deliver a SMS we need to open SAPI=3 first. Now our issue with bringing this down was that whenever we got a SAPI release confirm (we asked for the release and it was released) or release indication (the MS closed it) and we used to respond with a RF Channel Release. Now when trying to bringdown a connection were we delivered a SMS we would issue a RF Channel Release twice and the nanoBTS ACKed it twice! To make matter worse, whenever we get a RF Channel Release ACK we mark it as free. We had this small window when we got the first RF Channel Release ACK, allocated the channel again, and then get the second RF Channel Release ACK. I have fixed this issue in multiple ways. The first is to use the T3111 timer to wait until we issue the RF Channel Release, the second is to handle (RF) failures by "blocking" the lchan for a short second to receive multiple errors and release acks and the last bit is to properly bring down the channel. When we have SAPI!=0 we bring that down first, then we send SACH deactivate, followed by SAPI=0 release and then finally we send the RF Channel Release. This makes things more reliable on our side but we need to fix some more things. There is a FIXME inside the gsm_04_08_utils.c that mentions the start of a T3109 timer. In any case when sending a SAPI release the BTS will answer with success or a timeout and we handle both.
Today I addressed losing the RSL or OML connection to the nanoBTS and making sure we are reconnecting and not leaking any memory. This took me most of the day to get stable and I have found a bug or such inside the osmocore/select.c when releasing a bsc_fd that is the last one of the list. The difficulty here is making sure we do not leak memory, close all file descriptors, close all channels that take place on the RSL connection and make sure that when the BTS is up again we can use the channels that were allocated during the failure. To help with testing I added two commands to our vty interface to drop the OML or the RSL connection on a given BTS. The other part that was helpful is to use Linux's Netfilter and drop packets on a TCP connection and to wait for a failure. Now I can simulate most of the network failures easily and could build some trust.
And my final wishlist item would be to have like 16 GTA02 boards, use FS0 on each and run a simple script to dial, send SMS, pickup phonecalls this would allow me to heavily test the networking in an automated way. On top of that would be to have a OsmocoreBB enabled Calypso or C123 and then I could even send messages that are normally not send at all. And thanks to FreeSoftware development I'm sure we are going to reach that goal.
Thursday, April 08, 2010
GSM Fail...
Hi,
okay... I have a simple task... dial some numbers, drop some SMS. Right now I'm using a whacky mix of shell and chat script to do it but it is not as nice and reliable as it should be. So I was trying ofono and fso to use my serial device to do some calls for me... Or at least I tried to. FSO was relatively easy to install on Debian, having a nice readme, picking a GSM driver... well and then... how do I specify the config value... looking at the source, finding stuff... taking an old example from Charlie... nothing... hmm.. Okay, on to ofono...
So ofono is written to make the life of designers more easy, it must be so easy that there is no documentation required... after finding some examples in the test directory of the code... I send one MSG to enable the Power of my modem and hope the VoiceCallManager comes up... instead I do see a segfault in the ofonod... *sigh*
I think I will write a custom application to send and parse some simple AT commands for now... how frustrating.
okay... I have a simple task... dial some numbers, drop some SMS. Right now I'm using a whacky mix of shell and chat script to do it but it is not as nice and reliable as it should be. So I was trying ofono and fso to use my serial device to do some calls for me... Or at least I tried to. FSO was relatively easy to install on Debian, having a nice readme, picking a GSM driver... well and then... how do I specify the config value... looking at the source, finding stuff... taking an old example from Charlie... nothing... hmm.. Okay, on to ofono...
So ofono is written to make the life of designers more easy, it must be so easy that there is no documentation required... after finding some examples in the test directory of the code... I send one MSG to enable the Power of my modem and hope the VoiceCallManager comes up... instead I do see a segfault in the ofonod... *sigh*
I think I will write a custom application to send and parse some simple AT commands for now... how frustrating.
Subscribe to:
Posts (Atom)