ARM performance

I recently purchased a couple of KwikByte KBOC_BB2 OMAP3530 based development boards. I was a bit surprised how slow it was to do a native build of FFmpeg but, after a bit of reflection, it does all add up.

The following tests are for a sample size of one, and therefore aren’t very scientific, but here goes:

Test	Host	Device	Host’s relative performance	Normalised
Compile ffmpeg	140 s	2690 s	19.2 x (9.6 x per core)	2.9 x
BogoMips	4787	494	9.8 x	2.9 x
Python looper	12.3 million/s	1.32 million/s	9.3 x	2.8 x

The normalised column is corrected for the 3.33 times difference in MHz. Incidentally, distcc did a great job of speeding up the ffmpeg build. Sharing the build across two boards brought the compilation down to 1493 s for a 1.8 x improvement.

The configuration was:

The host is a Lenovo R500 laptop with a 2.4 GHz P8600 Core 2 Duo
The device is a KwikByte with a 720 MHz OMAP3530
Ubuntu 10.04 LTS on both
gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5)
Python 2.6.5-0ubuntu1
ffmpeg 0.6

The ‘Compile ffmpeg’ test involves compiling ffmpeg using the default target configuration. This may have lead to more or less code being compiled between the two machines. The host build involved a ‘make -sj4′ to keep both cores taxed. The device involved a ‘make -sj3′ to keep the CPU busy.

The ‘BogoMips’ test is the BogoMips value reported by the Linux kernel at startup.

The ‘Python looper’ test is a simple ‘for i in range(500000) …’ test that counts the number of loops performed in 10 s.

I’m fairly sure the tests were CPU bound. Note that the TDP of the P8600 is 25 W and the OMAP3530 is (as best as I can tell) 1.5 W. The 2.9 x drop in performance is made up for by a 16.7 x drop in power. I wonder how much power the host was actually using?

Update: I had a quick go with the same tests using qemu-maemo to simulate a Cortex-A8 based board. My P8600 does 0.54 million loops/s while a triple-core AMD Phenom 8650 achieved 0.35. Running a qemu instance on each core on both machines gives me the equivalent of a 1.2 GHz board, which isn’t really worth the effort.

Update: I had a go with the Python test on a Atom N450 1.66 GHz machine in 64 bit mode. It scored 3.48 million loops/s or 2.45 x slower per clock than the host. On this (very poor) benchmark, the Atom is 1.14 x faster per clock than the ARM. Not surprising when you consider that the Python test probably has poor memory locality and the Atom has a 512 k cache vs the ARMs 32 k.

Leave a Reply

Contact

Recent Posts

Archives

Blogroll

Search