Interested in working with us on improving the performance of Linux on ARM? We’re looking for motivated engineers to work in our toolchain team on compiler technology, developer tools, and low level performance libraries. You will use your specialised knowledge to work in the open, work upstream, and make ARM flavoured improvements to a range of tools that are fundamental to making the latest mobile and server products.
We’re currently looking for:
We’re a distributed team. Working from the home office is an option.
Please see the careers page on our website for more information and how to contact us.
Raspbian is a hard float build of Debian wheezy for the ARMv6K processor in the Raspberry Pi. Here’s yet another set of instructions for running the image under QEMU, this time using the pre-built Linaro goodness that comes with Ubuntu Precise. This is a hack that happens to work – see Peter’s comment below for more.
The cliff notes are:
The kernel is a ugly hack based on CNXSoft‘s notes. Basically:
- Install Linaro GCC: sudo apt-get install gcc-arm-linux-gnueabihf
- Grab the upstream 3.2 kernel. I used git and the v3.2 tag.
- Apply a hack to build an ARMv6 Versatile PB
- Use this .config
- Build: ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- make -j4 zImage
The ARM Realview emulation would be a better choice but I was short on time and a plain build didn’t boot.
One of my after-hours projects is to get my Traxxas Rustler XL-5 to take itself for a drive. The first steps are measuring the open loop response, including how the PWM drive into the motor controller turns into a no-load wheel RPM.
I don’t have a tacho but I do have a web cam, an open source vision library, and too much spare time. Here’s a video of the result:
Each scene is a step in the process. First is the colour image, then the image in HSV (to pick out the red no matter what the brightness), then the squared distance of each pixel from red (to highlight the tape), then thresholding (needed for contouring), then the contours, and finally a circle on the middle of the largest rotated rectangle.
Running the motor and plotting the X and Y of the centre gives a nice sin/cos plot. A FFT will give the frequency/RPM.
Thanks to the OpenCV guys for their library!
I’m looking into profile guided optimisation (PGO) in GCC as a future topic for the Linaro Toolchain team. PGO works by having you build your program twice: once to instrument and record what the program actually does and then again using that profile to better optimise.
One optimisation is to track the values used in a function and special case the most frequent one. I was quite impressed with what GCC currently does:
- Rewrite divides and modulos: change a = b / c to if c == N then a = b / N else a = b / c
- Rewrite modulo a power of two: change a = b % c to if c == N and N is a power of 2 then a = b % N else a = b % c
- Rewrite an indirect call to direct: change (*callback)() to if callback == N then N() else (*callback)()
- Rewrite string operations of known length: change memcpy(a, b, c) to if c == N then memcpy(a, b, N) else memcpy(a, b, c)
GCC's later optimisations can then improve the special cases even further, such as changing a divide by a power of two to a shift or inlining the memcpy() completely instead of doing a function call.
Is your ARM Linux kernel not booting when building with Linaro GCC or FSF GCC 4.7? Does it halt shortly after showing ‘Uncompressing Linux’? You may have run into an interaction between older kernels and the new unaligned access support in GCC. This affects Linaro GCC from 4.6-2011.11 onwards, GCC from 4.7.0 on, and kernels earlier than 3.2 including the Galaxy Nexus Icecream Sandwich release.
The work-around is to add -mno-unaligned-access to KBUILD_CFLAGS in the top level kernel Makefile or to backport 8428e84d42179c2a00f5f6450866e70d802d1d05 from the current kernel tree.
ARMv6K and later processors have hardware support for doing unaligned loads and stores which is faster than the old byte-by-byte/recombine that was done in software. Later versions of GCC use this to do quicker loads when working on known unaligned data, such as when working on a protocol buffer or a packed structure.
The CPU can be configured to trap on unaligned access. This trap is off at reset, but pre 3.2 kernels turn this on during the initial boot. An interaction between -fconserve-stack and -munaligned-access on a char buffer lead to an unaligned access, which causes a trap, which causes the kernel to halt.
This does not affect userspace programs as they run with the trap turned off.
We squirrel away the results of each Linaro GCC auto build so that they can be used for later benchmarking, testing, or regression hunting. This was taking around 25 minutes on a PandaBoard which, even on a 16 hour build, is too long.
The old method was:
- Install to $build/install
- Copy $build/install to gcc-linaro-$version-$buildid so the tarball had a unique top directory
- Tar up without compression
- Use xz at a non-default level to make the final .tar.xz
The new method skips the intermediate tarball by setting the compression level through the environment, and skips the copy by using tar’s transform rules to rewrite the top level directory:
tar cJf $(B)/$(ARCHIVE_BASE)-$(SNAME)$(SUFFIX).tar.xz \
-C $(VBUILD)/install \
--transform "s,^\./,$(ARCHIVE_BASE)-$(SNAME)$(SUFFIX)/," \
This cuts the archive down to 4:20 by skipping a lot of disk I/O, reducing the time spent compressing, and compressing and tarring in parallel.
xz is impressive. At the lowest ‘-1′ compression level it takes the same time as gzip but produces an archive 65 % of the size. I settled on ‘-2′ which gives an archive 15 % bigger than the default ‘-6′ but takes a quarter of the time.
As part of our development process, we take each merge request or commit and build it natively on all of our supported architectures. It’s a bit painful on ARM as GCC is properly big, so a three stage quad language bootstrap plus the testsuite can take 19 hours.
I thought I’d profile one of the new PandaBoard ES hard float builders to see where we’re bound. Here’s the result:
The load looks good. Both cores are fully utilised for most of the build. The memory usage is OK. I’ve seen spikes when running the testsuite in the past but that seems to have cleared. The disk usage is fine at around 5 GB peak.
The spike in temperature is when the sun came through the window. Must fix that. The drop off to the right matches the overnight drop in ambient.
The good news is that we’re CPU bound. This suggests that a spiffy quad core i.MX6 Sabre Lite could halve the build time.
We use QEMU to test programs built by the toolchain binary release for correctness. I’ve written up the instructions for spinning up your own at:
It’s focused on simplicity – getting a running, SSH only Cortex-A9 up and going as soon as possible. It’s not the latest, not graphical, and doesn’t replace the deeper documentation at:
We host Linaro GCC up on Launchpad. It’s a bit tricky uploading a ~70 MB tarball over a non-resuming web form from New Zealand so I use VNC, Chromium, and an EC2 instance with a nice fat pipe instead. Here’s how.
- SSH in with a bonus tunnel for the VNC server using ssh -L 5902:localhost:5902 ec2-host-name
- Ensure vnc4server and chromium-browser are installed
- Run vncserver :2. This spawns the VNC server in the background
- Launch Chromium using DISPLAY=:2 chromium-browser
From your laptop:
- Ensure remmina is installed
- Start remmina
- Create a new VNC connection to the server :2
You should see a new window on your laptop with a full screen Chromium. There’s no window manager but we don’t need one in this case.
Remmina seems to have SSH tunnel support built in but I’m happy with my method.
To close down on EC2:
- Close Chromium
- Run vncserver -kill :2
A bonus about going to Linaro Connect is I can satisfy my gadget urges without paying for the extra postage to get things to NZ.
This trip included a BeagleBone from Farnell’s. I’ve written some terse notes on using the Linaro LEB on it and setting up the USB network gadget for easy networking.