WQUXGA – IBM T221 3840×2400 204dpi Monitor – Part 5: When You Are Really Stuck With a SL-DVI

I recently had to make one of these beasts work bearably well with only a single SL-DVI cable. This was dictated by the fact that I needed to get it working on a graphics card with only a single DVI output, and my 2xDL-DVI -> 2xLFH-60 adapter was already in use. As I mentioned previously, I found the standard 1xSL-DVI’s worth 13Hz to be just too slow when it comes to a refresh rate (I could see the mouse pointer skipping along the screen), but the default 20Hz from 2xSL-DVI was just fine for practically any purpose.

So, faced with the need to run with just a single SL-DVI port, it was time to see if a bit of tweaking could be applied to reduce the blanking periods and squeeze a few more FPS out of the monitor. In the end, 17.1Hz turned out to be the limit of what could be achieved. And it turns out, this is sufficient for the mouse skipping to go away and make the monitor reasonably pleasant to use.

(Note: My wife disagrees – she claims she can see the mouse skipping at 17.1Hz. OTOH, she is unable to read my normal font size (MiscFixed 8-point) on this monitor at full resolution. So how you get along with this setup will largely depend on whether your eyes’ sensitivity is skewed toward high pixel density or high frame rates.)

The xorg.conf I used is here:

Section "Monitor"
  Identifier    "DVI-0"
  HorizSync    31.00 - 105.00
  VertRefresh    12.00 - 60.00
  Modeline "3840x2400@17.1"  165.00  3840 3848 3880 4008  2400 2402 2404 2406 +hsync +vsync
EndSection

Section "Device"
  Identifier    "ATI"
  Driver        "radeon"
EndSection

Section "Screen"
  Identifier    "Default Screen"
  Device        "ATI"
  Monitor        "DVI-0"
  DefaultDepth    24
  SubSection "Display"
    Modes    "3840x2400@17.1"
  EndSubSection
EndSection

The Modeline could easily be used to create an equivalent setting in Windows using PowerStrip or a similar tool, or you could hand-craft a custom monitor .inf file.

In the process of this, however, I have discovered a major limitation of some of the Xorg drivers. Generic frame buffer (fbdev) and VESA (vesa) drivers do not support Modelines, and will in fact ignore them. ATI’s binary driver (fglrx) also doesn’t support modelines. Linux CCC application mentions a section for custom resolutions, but there is no such section in the program. So if you want to use a monitor in any mode other than what it’s EDID reports, you cannot use any of these drivers. This is a hugely frustrating limitation. In the case of fbdev driver, it is reasonably forgiveable because it relies on whatever modes the kernel frame buffer exposes. In the case of the VESA driver it is understandable that it only supports standard VESA modes. But ATI’s official binary driver lacking this feature is quite difficult to forgive – it has clearly be dumbed down too far.

Getting the Best out of the MacBook Pro Retina 15 Screen in VMware Fusion

I make no secret of the fact that I am neither a fan of Apple nor a fan of virtualization. But sometimes they make for the best available option. I have recently found myself in such a situation. My current employer, mercifully, allows employees a choice of something other than vanilla Windows machines to work on, and there was an option of getting a MacBook Pro. As you can probably guess from some of the previous articles here, I find the single most important productivity feature of a computer to be the screen resolution, an opinion I appear to share with Linus Torvalds. So I opted for the 15″ MacBook Pro Retina.

Unfortunately, the native Linux support on that machine still isn’t quite perfect. Since speed is not a concern in this particular case, I opted to run Linux using VMware Fusion on OSX. Unfortunately, VMware Fusion cannot handle full 2880×1800 resolution of the display and with lower resolutions running in full screen mode the quality is badly degraded by blurring and aliasing. The solution is to create a custom 2880×1800 mode in /etc/X11/xorg.conf that fits within VMware virtual graphic driver’s capabilities. This took a bit of working out since the mode had to fit within horizontal and vertical refresh rates of the driver and the total pixel clock the driver allows. The following are the settings that work for me:

Section "Monitor"
        Identifier "MacBookPro"
        HorizSync 30.0 - 90.0
        VertRefresh 30.0 - 60.0
        ModeLine "2880x1800C" 358.21 2880 2912 4272 4304 1800 1839 1852 1891
EndSection

Section "Screen"
        Identifier "Default Screen"
        Monitor "MacBookPro"
        DefaultDepth 24
        SubSection "Display"
                Modes "2880x1800C"
        EndSubSection
EndSection

The result is being able to run a full screen 2880×1800 mode, and it looks absolutely superb.

Virtual Performance – Or Lack Thereof

People always seem very shocked when I suggest that virtualization comes with a very substantial performance penalty even when virtualization hardware extensions are used. Concerningly, this surprise often comes from people who have already either committed their organization’s IT infrastructure to virtualization, or have made firm plans to do so. The only thing I can conclude in these cases, unbelievable as it may appear, is that they haven’t done any performance testing of their own to assess the solution they are planning to adopt.

So I decided to document some basic performance tests that show just how substantial the performance hit of virtualization is.

Test Setup

Hardware:
Core2 Quad 3.2GHz
8GB of RAM
2x500GB 7200rpm SATA DM RAID1 for the main system
1x250GB 7200rpm SATA for testing

Virtual Test Configuration (VMware Player 4.0.4, Xen 4.1.2 (PV and HVM), KVM (RHEL6), VirtualBox 4.1.18):
CPU Cores: 4 (all)
RAM: 6GB
Disk: System booting off the 2×500 RAID1. Raw 250GB SATA disk passed to the VM.

Disk write caching was enabled in the VMware configuration. You may think that this unfairly gives the VM configuration an advantage, but as you will see from the results, even with this “cheat”, the performance is still very disappointing compared to bare metal. In any case, the amount of disk I/O is negligible – the caches and the working set always fit into memory.

Physical Test Configuration:
CPU Cores: 4 (all)
RAM: 6GB (limited using mem=6G boot parameter)
Disk: Booting directly off the same 250GB SATA disk used for VM testing, with the same kernel and configuration.

The Test

The test performed is the compile of the vanilla 2.6.32.59 Linux kernel. This is the script used for testing:

#!/bin/bash

echo Cleaning...
make clean > /dev/null 2>&1
make mrproper > /dev/null 2>&1
sync
echo 3 > /proc/sys/vm/drop_caches
echo Configuring...
make allmodconfig > /dev/null 2>&1
echo Syncing...
sync
find . -type f -print0 | xargs --null cat > /dev/null 
echo "Timing build..."
time (make -j16 all > /dev/null 2>&1)

The source tree is cleaned and all caches dropped. The allmodconfig configuration is used to get some degree of testing of disk I/O by creating the maximum number of files. Caches are then primed by pre-loading all the source files. This is done in order to more accurately measure the CPU and RAM subsystems without bottlenecking on disk I/O. The CPU in the system has 4 cores, and 16 build threads are used to ensure the CPU and memory I/O are saturated, but without causing enough memory pressure to cause swapping.

On the host and in the guest, all unnecessary services and processes were stopped (especially crond which could theoretically cause additional load on the system that would distort the results).

All tests were carried out 3 times in a row, and the best result for each is considered here (the differences between the runs were minimal).

This is very much a redneck, brute-force test. There isn’t much finesse to it. But I like tests like this because they cannot be cheated with the sort of smoke and mirrors illusions that virtualization software is very good at applying.

Results

Bare metal:1,042.523s(100%)
Xen 4.1.2 (PV):1,316.984s(79.16%)
VMware ESXi 5.0.0:1,361.321s(76.58%)
VMware Player 5.0.0:1,478.732s(70.50%)
VMware Player 4.0.4:1,520.023s(68.59%)
KVM (RHEL6):1,691.849s(61.62%)
Xen 4.1.2 (HVM):2,839.442s(36.72%)
VirtualBox 4.1.18:8,876.945s(19.06%)

Note: No, this is not a typo – VirtualBox really is that bad.

To make this difference easier to visualise, here it is on graphs

Virtualization Performance – Time in Seconds

To give a better idea of relative performance, here it is in % points, with bare metal being 100%.

Virtualization Performance – Relative Difference

The difference is substantial even with the least poorly performing hypervisor. Virtualization performance is over a 5th (21%) down with paravirtualized Xen down compared to bare metal, and nearly a quarter (24%) lower than bare metal with VMware ESXi, and even worse with KVM. Or if you prefer to look at it the other way around, bare metal is more than a quarter as fast again (26.32%) as the best performing hypervisor on the same hardware.

Don’t get me wrong – virtualization is handy for all sorts of low-performance tasks. In cases where it is used to consolidate a number of mostly idle systems into one mostly idle system, it brings clear benefits. (Except maybe in the case of VirtualBox – the performance there is just too appalling for anything, and HVM Xen is pretty poor, too.) But for uses where performance is important, thoughts of virtualizing need to undergo a serious reality check. Even if your system is designed to scale completely horizontally, requiring 26%+ of extra hardware (best case scenario, it could be a lot worse depending on which hypervisor you use) is likely to put a significant strain on your budget and running costs.

Note: It is worth stressing that these tests are carried out on hardware with VT-x, and support for this is enabled and used for all the tested hypervisors. So the results here are based on optimal hardware support.

Here is a link to an excellent paper on virtualisation performance overheads with similar findings to my brief research.

Our servers are performance optimized by MySQL experts at Shattered Silicon.

RedSleeve Linux Public Alpha

Here is something that I have been working on of late.

RedSleeve Linux is a 3rd party ARM port of a Linux distribution of a Prominent North American Enterprise Linux Vendor (PNAELV). They object to being referred to by name in the context of clones and ports of their distribution, but if you are aware of CentOS and Scientific Linux, you can probably guess what RedSleeve is based on.

RedSleeve is different from CentOS and Scientific Linux in that it isn’t a mere clone of the upstream distribution it is based on – it is a port to a new platform, since the upstream distribution does not include a version for ARM.

The reason RedSleeve was created is because ARM is making inroads into mainstream computing, and although Fedora has supported ARM for a while, it is a bleeding edge distribution that puts the emphasis on keeping up with the latest developments, rather than long term support and stability. This was not an acceptable solution for the people behind this project, so we set out to instead port a distribution that puts more emphasis on long term stability and support.

More/Better Internal Storage on the Toshiba AC100 – Part 2

Following my research for the previous article about the performance of SD/CF/USB flash modules, the only conclusion I could reach is that most of them are pretty dire. The only notable exception among the SD cards seems to be the latest generation of the SanDisk Extreme Pro (95MB/s) cards that just about managed to squeeze out enough performance on random writes to match a 7200rpm disk. Still, this is pretty dire compared to any reasonable SSD, so I wanted to see what else could be done about installing extra storage with good performance into an AC100.

What I came across is this: SuperTalent RC8 USB stick. It may look like a USB stick, but it is actually a full-on SSD, featuring a SandForce 1200 flash controller. I figured this was worth a shot, even though the4 specifications indicate it is rather large (far too large to fit inside an AC100 in it’s standard form). Stripped out of the casing, however, it looks like RC8 might just be fittable inside the AC100.

This is what I ended up with. There appears to be only one place inside an AC100 where a bare RC8 circuit board could be fitted. You will need the following:

1) P3MU mini-PCIe USB break-out module

2) SuperTalent RC8 USB stick

3) Custom made USB cable (male and female type A USB connectors, some single core wire, and some skill with a soldering iron)

Measure out exactly how long you need the cable to be – there is no room to tuck away excess able inside an AC100. Here is what my cable layout ended up looking like.

AC100 motherboard with P3MU and custom USB cable fitted

AC100 motherboard with P3MU and custom USB cable fitted

This is what it looks like with the top panel fitted. Note the large cut-out that has been made below the mini-PCIe slot access hole.

AC100 modified to receive RC8 USB SSD

AC100 modified to receive RC8 USB SSD

And again with the screws fitted. Note that one of the screw holes is in the area that had to be cut out. This shouldn’t affect the structural integrity of the AC100, though. Also note that the right speaker cable has been re-routed slightly to now go over the LED ribbon cable.

AC100 modified to receive RC8 SSD

AC100 modified to receive RC8 SSD

This is what it looks like with the RC8 attached. Now you can see why the cut-out in the top panel was exactly the shape it was – I specifically cut out the minimum possible amount to allow the RC8 to fit.

Toshiba AC100 with the SuperTalent RC8 USB SSD installed

Toshiba AC100 with the SuperTalent RC8 USB SSD installed

I also put a piece of thin transparent sticky tape over it to hold in in place, just to make sure nothing can short out against the underside of the keyboard.

Toshiba AC100 with the SuperTalent RC8 SSD

Toshiba AC100 with the SuperTalent RC8 SSD

And that is pretty much it. Put the keyboard back in and bolt it all together. The metal part of the USB connector will sit a tiny bit above the line of the panel, but the only way you’ll notice it once you put the keyboard back on is by knowing that there is a tiny bulge there.

Your AC100 should now be able to handle ~ 2000 IOPS on both random reads and random writes, along with much better life expectancy that having proper flash management brings.

At this point I would like to point out just how impressed I am with the SuperTalent RC8 USB SSD. Not only is the performance fenomenal (for a USB stick at least), but it really behaves like a SATA SSD – to the point where you can use tools like hdparm and smartctl on it (yes, it even supports SMART).

Flash Module Benchmark Collection: SD Cards, CF Cards, USB Sticks

Having spent a considerable amount of time, effort, and ultimately money trying to find decently performing SD, CF and USB flash modules, I feel I really need to ensure that I make the lives of other people with the same requirements easier by publishing my findings – especially since I have been unable to find a reasonable comprehensive data source with similar information.

Unfortunately, virtually all SD/microSD (referred to as uSD from now on), CF and USB flash modules have truly atrocious performance for use as normal disks (e.g. when running the OS from them on a small, low power or embedded device), regardless of what their advertised performance may be. The performance problem is specifically related to their appalling random-write performance, so this is the figure that you should be specifically paying attention to in the tables below.

As you will see, the sequential read and write performance of flash modules is generally quite good, as is random-read performance. But on their own these are largely irrelevant to overall performance you will observe when using the card to run the operating system from, if the random-write performance is below a certain level. And yes, your system will do several MB of writing to the disk just by booting up, before you even log in, so don’t think that it’s all about reads and that writes are irrelevant.

For comparison, a typical cheap laptop disk spinning at 5400rpm disk can typically achieve 90 IOPS on both random reads and random writes with typical (4KB) block size. This is an important figure to bear in mind purely to be able to see just how appalling the random write performance of most removable flash media is.

All media was primed with two passes of:

 dd if=/dev/urandom of=/dev/$device bs=1M oflag=direct

in order to simulate long term use and ensure that the performance figures reasonably accurately reflect what you might expect after the device has been in use for some time.

There are two sets of results:

1) Linear read/write test performed using:

dd if=/dev/$device of=/dev/null    iflag=direct
dd if=/dev/zero    of=/dev/$device oflag=direct

The linear read-write test script I use can be downloaded here.

2) Random read/write test performed using:

iozone -i 0 -i 2 -I -r 4K -s 512m -o -O +r +D -f /path/to/file

In all cases, the test size was 512MB. Partitions are aligned to 2MB boundaries. File system is ext4 with 4KB block size (-b 4096) and 16-block (64KB) stripe-width (-E stride=1,stripe-width=16), no journal (-O ^has_journal), and mounted without access time logging (-o noatime). The partition used for the tests starts at half of the card’s capacity, e.g. on a 16GB card, the test partition spans the space from 8GB up to the end. This is in done in order to nullify the effect of some cards having faster flash at the front of the card.

The data here is only the first modules I have tested and will be extensively updated as and when I test additional modules. Unfortunately, a single module can take over 24 hours to complete testing if their performance is poor (e.g. 1 IOPS) – and unfortunately, most of them are that bad, even those made by reputable manufacturers.

The dd linear test is probably more meaningful if you intend to use the flash card in a device that only ever performs large, sequential writes (e.g. a digital camera). For everything else, however, the dd figures are meaningless and you should instead be paying attention to the iozone results, particularly the random-write (r-w). Good random write performance also usually indicates a better flash controller, which means better wear leveling and better longevity of the card, so all other things being similar, the card with faster random-write performance is the one to get.

Due to WordPress being a little too rigid in it’s templates to allow for wide tables, you can see the SD / CF / USB benchmark data here. This table will be updated a lot so check back often.

More/Better Internal Storage on the Toshiba AC100

One of the unfortunate things about the AC100 is that the internal storage isn’t removable, and thus isn’t easily upgradable or replaceable. The latter could be an issue in the longer term because it is flash memory, so it will eventually wear out, and I since it is relatively basic eMMC, I don’t expect the flash controller to be particularly advanced when it comes to wear leveling and minimizing write amplification. Using the SD slot is an option, but if we are running the operating system from it, we cannot use it for removable media, which could be handy. We could use a USB stick instead, but then we lose the only USB port on the machine. There is no SATA controller inside the AC100.

What can be done about this? Well, models that have a 3G modem have it on a mini-PCIe USB card. Even though Tegra 2 has a PCIe controller built into it, the mini-PCIe slot isn’t fully wired up – only USB lines are connected. Since most of us can tether a data connection via our phones, and since this is more cost effective than paying for two separate mobile connections, the 3G module isn’t particularly vital. The main issue that the slot only has USB wired up. So what we would need is a USB mini-PCIe SSD. Is there such a thing? It turns out that there is. I have been able to find two:

  1. EMPhase Mini PCIe USB S1 SSD
  2. InnoDisk miniDOM-U SSD

The specification of the two modules is virtually identical (both use SLC flash among other similarities), so I decided to investigate both of them. Unfortunately, having contacted an EMPhase re-seller, they called me back having spoken to the manufacturer and talked me out of buying one, citing unspecified issues.

My local InnoDisk re-seller was more interested in selling me a product, but there were two reasons why despite very good pre-sales service I ultimately decided against buying one of these. The first and foremost was the performance specification. According to the manufacturer’s own figures, the random access performance with 4KB blocks is 1440 random read IOPS and 30 random write IOPS. Considering the price per GB of these modules is approximately 4x that of similarly performing SLC SD cards, this module was discarded on the basis of cost effectiveness.

Having discarded the above modules, there are still a few alternative options available. The low risk, tidy options include an SD mini-PCIe USB adapter and a micro-SD mini-PCIe USB adapter. They are very reasonably priced so I got one of each for testing, and I am pleased to say that they work absolutely fine in the AC100. Here is what they look like fitted into the AC100.

Dual micro-SD mini-PCIe USB Adapter

Dual micro-SD mini-PCIe USB Adapter

SD mini-PCIe USB Adapter

SD mini-PCIe USB Adapter

The SD cards will appear as USB disks. If you use the dual micro-SD adapter you can RAID the two cards together.

Unfortunately, I have found that the best results are achieved using a single SD card, purely because I haven’t found any micro-SD cards that have reasonable performance when it comes to random-write IOPS. SD cards fare a little better, but the best SD card I have found in terms of random write IOPS still tops out at a mere 19 random write IOPS using 4KB blocks. Still, it is 2/3 of the marketed figures for the InnoDisk SSD at 4x lower price per GB, and the performance just about scrapes past what I would consider minimal requirements for reasonable use.

I am currently putting together a list of SD, micro SD and USB flash devices and consistent benchmark performance figures for them, which should hopefully help you to choose the ones most suitable for your application. I hope to have the article up reasonably soon, but don’t expect it too soon – benchmarking SD cards takes a long time to do properly.

Alleviating Memory Pressure on Toshiba AC100

After all the upgrades and tweaks to the AC100 (screen upgrade to 1280×720, cooling improvements and boosting the clock speed by over 40%), only one significant issue remains: it only has 512MB of RAM. Unfortunately, the memory controller initialization is done by the closed-source boot loader, so even if we were to solder in bigger chips (Tegra2 can handle up to 1GB of RAM), it is unlikely in the extreme that it would just work.

So, other than increasing the physical amount of memory, can we actually do anything to improve the situation? Well, as a matter of fact, there are a few things.

Clawing Back Some Memory

By default, the GPU gets allocated a hefty 64MB of RAM out of 512MB that we have. This is quite a substantial fraction of our memory, and it would be nice to claw some of it back if we are not using it. I find the Nvidia’s Tegra binary accelerated driver to be too buggy to use under normal circumstances, so I use the basic unaccelerated frame buffer driver instead. There are two frame buffer allocations on the AC100: the internal display and the HDMI port. The latter is only intended for use with TVs which means we shouldn’t need a resulition of more than 1920×1080 on that port. The highest resolution display we can have on the internal port is 1280×720. That means that the maximum amount of memory used by those two frame buffers is 8100KB + 3600KB 11700KB. To be on the safe side, let’s call that 16MB. That still leaves us 48MB that we should be able to safely reclaim. We can do that by telling the kernel that there is extra memory at certain addresses using the following boot parameters:

mem=448M@0M mem=48M@464M

Make sure the accelerated binary Tegra driver is disabled in your xorg.conf, reboot and you should now have 496MB of usable RAM instead of 448MB. It’s just over an extra 10%, which should make a noticeable difference given how tight the memory is to begin with.

If you aren’t using the HDMI interface, my tests show that it is in fact possible to reduce the GPU memory to just 2MB with no ill effects, when using the 1280×720 display panel, because the frame buffer seems to operate in 16-bit mode by default:

mem=448M@0M mem=62M@450M

That leaves a total of 510MB of for applications.

Memory Compression

In the recent kernels, there are two modules that are very useful when we have plenty of CPU resources but very little memory – just the case on the AC100. They are zcache and zram. On the 3.0 kernels instead of zram we can use frontcache which is similar but has the advantage that it is aware and cooperates with zcache. Since at the time of writing this 3.0 isn’t quite as polished and stable for the AC100 as 2.6.38, let’s focus on zram instead.

Assuming you have compiled zcache support into your kernel, all you need to do to enable it is add the kernel boot paramter “zcache”. From there on, your caches should be compressed, thus increasing the amount they can store.

zram provides a virtual block device backed by RAM, but the contents are compressed, so it should always end up using less than the amount of memory it presents as a block device (unless all of the data is uncompressible, which is very unlikely). To err on the side of caution we shouldn’t set this to more than half of the total memory across all the zram devices. To ensure optimal performance, we should also set the number of zram devices to be the same as the number of CPUs cores in the system to make sure that all CPUs end up being used (each zram device handler is a single thread).

To set the number of zram devices to 2 (Tegra2 has 2 CPU cores), we need to create the file /etc/modprobe.d/zram.conf containing the following line:

options zram num_devices=2

Then once we load the zram module (modprobe zram), we should see device nodes called /dev/zram*. We can configure the devices:

echo > /sys/block//disksize

The amount of memory assigned to each zram device should be such that their total combined size doesn’t exceed half of the total physical memory in the system.

Then we can create swap headers on those zram devices using mkswap (e.g. mkswap /dev/zram0) and enable swapping to them (swapon -p100 /dev/zram0).

We should now have some compressed RAM for swapping to instead of swapping to a slow SD card.

Tweaks

It turns out that some of the default settings on Linux distributions aren’t as sensible as they could be. By default the amount of stack space each thread is allocated is 8MB. This is unnecessarily large and results in more memory consumption than is necessary. Instead we can set the soft limit to 256KB using “ulimit -s 256”. Ideally we should make this happen automatically at startup by creating a file /etc/security/limits.d/90-stack.conf containing the following:

* soft stack 256

Some users have reported that this can increase the amount of available memory after booting by a a rather substantial amount. Since this is a soft limit, programs that require more stack space can still allocate it by asking for it.

Choice of Software

One of the most commonly used types of software nowdays is a web browser, and unfortunately, most web browsers have become unreasonably bloated in recent years. This is a problem when the amount of memory is as limited as in it is on most ARM machines. Firefox and to a somewhat lesser extent Chrome require a substantial amount of memory. However, there is another reasonably fully featured alternative that works on ARM – Midori. Midori is based on the Webkit rendering engine, the same one that is used by Chrome and Safari. However, it’s memory footprint is approximately half of the other browsers. Unfortunately, it’s JavaScript support isn’t quite as good as on Firefox and Chrome yet, but it is sufficiently good for most things, and if memory pressure is a serious issue, you might want to try it out.