Wednesday, September 24, 2008

The fastest loop in C

Sure, practically nobody knows assembler nowadays. But if you use a certain construct very often in time critical code, you might get curious what's the fastest way to do it. One example are loops. I tested this with a loop of the following properties:
  • Loop count is unknown at compile time, so the loop cannot be unrolled
  • The loop index i is not used within the loop body. In particular it doesn't matter it it's incremented or decremented
This kind of loop occurs very often in gavl colorspace conversion routines. The following C-file has 3 functions, which do exactly the same:

void do_something();

static int num;

void loop_1()
int i;
for(i = 0; i < num; i++)

void loop_2()
int i = num;

void loop_3()
int i = num+1;

This code can be compiled with gcc -O2 -S. The resulting loop bodies are the following:

loop 1:

xorl %eax, %eax
addl $1, %ebx
call do_something
cmpl %ebp, %ebx
jne .L17
loop 2:
xorl %eax, %eax
addl $1, %ebx
call do_something
cmpl %ebp, %ebx
jne .L11
loop 3:
xorl %eax, %eax
call do_something
subl $1, %ebx
jne .L5
As you see, the winner is loop 3. Here, the whole logic (without initialization) needs 2 machine instructions:
  • Decrement ebx
  • Jump is result is nonzero
I doubt it can be done faster. You also see, that loop 2 looks simpler than loop 1 in C, but the compiler produces exactly the same code. I changed the innermost gavl pixelformat conversion loops to loop 3 and could even measure a tiny speedup for very simple conversions.

Monday, September 22, 2008

Fun with the colormatrix

Some years ago, when I wrote lemuria, I was always complaining that OpenGL doesn't have a colormatrix. Later I found, that some extensions provide this, but not the core API. Today I think the main problem was that I used a graphics system (OpenGL) which is optimized to look as realistic as possible for a visualization which should look a surrealistic as possible :)

Later when I concentrated on more serious work like developing high quality video filters, I rediscovered the colormatrix formalism. Now what is that all about and what's so exciting about that? If you assume each pixel to be a vector of color channels, multiplying the vector by a matrix is simply a linear transformation of each pixel. In homogenous coordinates and in RGBA or YUVA (i.e. 4 channel) colorspace, it can be completely described by a 4x5 matrix. Now there are lots of commonly used video filters, which can be described by a colormatrix multiplication:

Brightness/Contrast (Y'CbCrA)

[c000b - (1+c)/2]
b and c are between 0.0 and 2.0.

Saturation/Hue rotation (Y'CbCrA)

h is between -pi and pi (0 is neutral). s is between 0.0 (grayscale) and 2.0 (oversaturated).

Invert single RGB channels (RGBA)

This inverts only the first (red) channel. Inverting other channels is done by changing any of the other lines accordingly.

Swap RGB channels (RGBA)

This swaps the first (red) and 3rd (blue) channel. It can be used to rescue things if some buggy routine confused RGB and BGR. Other swapping schemes are trivial to do.

RGB gain (RGBA)


gr, gg and gb are in the range 0.0..2.0.

There are of course countless other filters possible, like generating an alpha channel from (inverted) luminance values etc. Now, what do you do if you want to make e.g. a brightness/contrast filter working on RGBA images? The naive method is to transform the RGBA values to Y'CbCrA, do the filtering and transform back to RGBA. And this is what can be optimized in a very elegant way by using some simple matrix arithmetics. Instead of performing 2 colorspace conversions in addition to the actual filter for each pixel, you can simply transform the colormatrix (M) from Y'CbCrA to RGBA:

Untransformed pixel in RGBA:(p)
Untransformed pixel in Y'CbCrA:(RGBA->Y'CbCrA)*(p)
Transformed pixel in Y'CbCrA:(M)*(RGBA->Y'CbCrA)*(p)
Transformed pixel in RGBA:(Y'CbCrA->RGBA)*(M)*(RGBA->Y'CbCrA)*(p)
The matrices (RGBA->Y'CbCrA) and (Y'CbCrA->RGBA) are the ones which convert between RGBA and Y'CbCrA. Since matrix multiplication is associative you can combine the 3 matrices to a single one during initialization:

(M)RGBA = (Y'CbCrA->RGBA)*(M)Y'CbCrA*(RGBA->Y'CbCrA)

If you have generic number-crunching routines, which to the vector-matrix multiplication for all supported pixelformats, you can reduce conversion overhead in your processing pipeline significantly.

Another trick is to combine multiple transformations in one matrix. The gmerlin video equalizer filter does this for brightness/contrast/saturation/hue rotation.

A number of gmerlin filters use the colormatrix method. Of course, for native pixelformats (image colorspace = matrix colorspace), the transformation is done directly, since it usually needs much less operations than a generic matrix-vector multiplication. But for foreign colorspaces, the colormatrix is transformed to the image colorspace like described above.

Some final remarks:
  • The colormatrix multiplication needs all channels for each pixel. Therefore it doesn't work with subsampled chroma planes. An exception is the brightness/contrast/saturation/hue filter, because luminance and chroma operations are completely separated here.
  • For integer pixelformats the floating point matrix is converted to an integer matrix, where the actual ranges/offsets for Y'CbCr are taken into account.
  • In practically all cases, the color values can over- or underflow. The processing routines must do proper clipping.

Sunday, September 21, 2008

Back from California

My job sometimes requires me to attend international conferences and workshops. It's pretty exciting, but also very hard work. You have to prepare presentations, posters and articles for the conference digest. You must concentrate on others presentations (ignoring your jet lag) and the brain gets much more input per day than usual. Some people, who wish me a nice holiday before I leave, don't really know what this is about :)

This year, I had the honor to make a 2 weeks trip to Southern California, first to San Diego and then to Pasadena. It was my first visit to the US so here are some differences I noticed with respect to Old Europe. Of course these are highly subjective and based on what I saw in two weeks in just a tiny corner of a huge country.

Shops, Restaurants
If you are used to the unfriendlyness of German shop employees and waiters, you'll be positively surprised in the US. They always have some friendly words for you. They might not be serious with that, but they do it well enough so it works. This is definitely something, where Germans can learn from the US.

Public transport
As a resident of a big German city, I can live perfectly without owning a car. Of course people here always complain about the local trains being too expensive, finishing operation too early in the night etc. But this is still paradise compared the US. At the bus stops in San Diego I saw mostly people who didn't really look prosperous. It seems that everyone who can afford a car, buys one. For a reason.

International news
If you want to learn about a foreign society it's a good idea to watch their TV programs. I learned that American TV news (aside from being extremely hysteric) mostly deal with domestic issues. I talked to an American colleague about that. I was quite surprised that he told me exactly, what I already thought: Americans are self-centered. Maybe a bit of information about foreign countries and societies (especially the ones you plan to bomb) would make some things go more smoothly.

Both Western European societies and the US appreciate personal freedom. But the definitions of freedom seem to be somewhat different. In Europe you can usually drink alcohol in public and on many beaches you can decide yourself how much you wear. Americans want to be able to buy firearms, drive big cars and put their dishes into the trashcan after eating. In Europe, I don't miss any of the American freedoms. Not sure about the vice versa.

Shows of any kind in the US have definitely another dimension. Germans seem to be way too modest to do something like the killer-whale show in the San Diego Seaworld or the shows in the Universal studios in Hollywood.

Sunday, September 7, 2008

Image transformation

Lots of graphics and video effects can be described by a coordinate transformation: There is a rule for calculating destination coordinates from the source coordinates. The rule completely describes the type of transformation.

The most prominent family of transformations are the affine transforms, where the transform rule can be described by a vector matrix multiplication:

(xdst, ydst) = (A) (xsrc, ysrc)

With this one can implement e.g. scaling and rotation. If the vectors and matrix are given in homogeneous coordinates, one can also shift the image by a vector. Other transforms can be Lens distortion, perspective distortion, wave effects and much more.

Now the question: how can such a transform be implemented for digital images? The algorithm is straightforward:
  • Find the inverse transform. For affine transforms, it will simply be the inverse matrix. For other transforms it might not be that easy. The inverse transform will give you the source coordinates as a function of the destination coordinates
  • For each destination pixel, find the coordinates in the source image
  • If the source coordinates are fractional (they usually will be), interpolate the destination pixel from the surrounding source pixels

Since the interpolation is always the same I decided to implement this in a generic way in gavl (gavl_image_transform_t). With this we can implement lots of different transforms by just defining the inverse coordinate transform as a C-function and passing it to the interpolation engine. While the algorithm is simple in theory the implementation has to take care for some nasty details:

1. Y'CbCr formats with subsampled chroma planes
In the video area, these are the rule rather than the exception. Many filters don't support them and use an RGB format instead (causing some conversion overhead). They can, however, be handled easily if you set up separate interpolation engines for each plane. For the subsampled planes, you do the following:
  • Transform the coordinates of the chroma location to image coordinates (i.e. multiply by the subsampling factors and shift according to the chroma placement)
  • Call the function to get the source coordinates the usual way
  • Transform the source coordinates back to the coordinates of the chroma plane
2. Destination pixels which are outside the source image
This is e.g. the case where an image is downscaled. Gavl handles these by not touching the destination pixel at all. Then you can fill the destination frame with a color before the transformation and this color will be the background color later on.

3. Destination pixel is inside the source image, but surrounding pixels (needed for interpolation) are not
Here, one can discuss a lot what should be done. For gavl, I decided to assume, that the "missing pixels" have the same color as the closest border pixel. The reason is, that instead of handling all possible cases inside the conversion loop for each pixel (which will slow things down due to the additional branches), one can simply shift the source indices and modify the interpolation coefficients once during initialization. The following figure illustrates, how this is done:

The start index n and the interpolation coefficients are saved in the interpolation table. After shifting the table, the interpolation routine works without branches (and without crashes). Due to the way the interpolation coefficients are modified we assume that the missing pixels at -2 and -1 are the same color as the border pixel at 0. Of course this is done for x and y directions and also for the case that indices are larger than the maximum one.


The image transformation is very easy to use, just get the gavl from CVS and read the API documentation. There is also a gmerlin filter in CVS (fv_transform) which can be used as reference. Some questions might however arise when using this:

1. How exactly are coordinates defined?
Gavl scaling and transformation routines work with subpixel presision internally. This is necessary, if one wants to handle chroma placement correctly. To make everything correct one should think a bit how coordinates are exactly defined. This is an example for a 3x3 image:

The sample values for each pixel are taken from the pixel center. This means, the top-left pixel has a color value corresponding to the location (0.5, 0.5). For chroma planes, the exact sample locations are considered as described here.

2. Handling of nonsquare pixels
These must be handled by the coordinate transform routine provided by you. Basically, you have a "sample aspect ratio" (sar = pixel_width / pixel_height). In your transformation function, you do something like:

x_dst *= sar; /* Distorted -> undistorted */

/* Calculate source coordinate assuming undistorted image */

x_src /= sar; /* Undistorted -> distorted */

3. Image scaling
One is tempted to think, that this all-in-one solution can be used for scaling as well. It is, of course true, but it's a stupid thing to do. Scaling can be highly optimized in many ways. The gavl_video_scaler_t does this. It's thus many times faster than the generic transform.

4. Downsampling issues
The image transform makes no assumptions about the type of the transform. Especially not if the transform corresponds to downsampling or not. And this is where some issues arise. While for upsampling it's sufficient to just interpolate the destination pixels from the source pixels, for downsampling the image must be low-pass filtered (i.e. blurred) first. This is because otherwise the sampling theorem is violated and aliasing occurs. A very scary example for this is discussed here. One more reason to use the gavl_video_scaler_t wherever possible because it supports antialiasing filters for downscaling. The good news is that usual video material is already a bit blurry and aliasing artifacts are hardly visible.

After this much theory, finally some examples. These images were made with the gmerlin transform filter (the original photo was taken in the South Indian ruin city of Hampi).

Perspective effect. Coordinate transform ported from the Gimp.


Whirl/pinch (Coordinate transform ported from cinelerra)

Lens distortion (Coordinate transform ported from EffecTV)

Generic affine (enter single matrix coefficients)

Music from everywhere - everywhere

Project goal was simple. Available audio sources are
  • Vinyl records
  • Analog tapes
  • CDs
  • Radio (analog and internet)
  • Music files in all gmerlin supported formats on 2 PCs
All these should be audible in stereo in all parts of the apartment including balcony and bathroom. Not everywhere at the same time though and not necessarily in high-end quality.

This had been on my wish list for many years, but I was too lazy to lay cables through the whole apartment (especially through doors, which should be lockable). And since there are lots of signal paths, the result would have been a bit messy. Dedicated wireless audio solutions didn't really convince me, since they are mostly proprietary technology. I never want to become a victim of Vendor lock-in (especially not at home).

When I first read about the WLAN-radios I immediately got the idea, that those are the key to the solution. After researching a lot I found one radio which has all features I wanted:
  • Stereo (if you buy a second speaker)
  • Custom URLs can be added through a web interface (many WLAN radios don't allow this!)
  • Ogg Vorbis support (so I can use Icecast2/ices2 out of the box)
The block diagram of the involved components is here:

Now I had to set up the streaming servers. The icecast server itself installs flawlessly on Ubuntu 8.04. It's started automatically during booting. For encoding and sending the stream to icecast, I use ices2 from the commandline. 2 tiny problem had to be solved:
  • Ubuntu 8.04 uses PulseAudio as the sound server, while ices2 only supports Alsa. Recording from an Alsa hardware device while PulseAudio is running doesn't work.
  • For grabbing the audio from a running gmerlin player the soundcard and driver need to support loopback (i.e. record what's played back). This is the case for the Audigy soundcard in the Media PC, but not for the onboard soundcard in the desktop machine.
Both problems can be solved by defining pulse devices in the ~/.asoundrc file:
type pulse

type pulse
device alsa_output.pci_8086_293e_sound_card_0_alsa_playback_0.monitor

type pulse
The Alsa devices to be written into the ices2 configuration files are called pulse and pulsemon. The device line is hardware dependent. Use the PulseAudio Manager (section Devices) to find out the corresponding name for your card. If your card and driver support loopback, the pulsemon device isn't necessary.

Some fine-tuning can be done regarding encoding bitrate, buffer sizes and timeouts. When I optimized everything for low latency, icecast considered the WLAN radio too slow and disconnected it. More conservative settings work better. Encoding quality is always 10 (maximum), the corresponding bitrate is around 500 kbit/s.

Mission accomplished

Saturday, September 6, 2008

gcc visibility

Newer versions of gcc (e.g. gcc-4.2.3 coming with Ubuntu-8.04) support a feature called visibility. It is explained in detail here. In essence, it lets you declare functions only within the scope of a shared library. The advantages are:

  • Since these functions are not callable from outside, the dynamic symbol table becomes smaller. This speeds up loading (especially in the debugger I assume) and makes a smaller library.

  • If gcc knows at compile-time, that a called function will be in the same shared lib, it can optimize the function call.

My test candidate for this was libquicktime, because it contains lots of private functions (e.g those in lqt_funcprotos.h). Supporting visibility is easy, this is what I did:

  • Let the configure script append -fvisibility=hidden to the CFLAGS if that's supported by gcc.

  • Put the function declarations in public headers between

    #pragma GCC visibility push(default)


    #pragma GCC visibility pop.

  • If there are functions called from outside (e.g. from libquicktime codecs) declared in private headers, add

    __attribute__ ((visibility("default")))

    to their declarations.

In the libquicktime case, this change also revealed some dirty corners (which I cleaned up after compiling with -Wmissing-declarations).

Let's see, what actually changed:

1. Entries in the dynamic symbol table:
# nm -D /opt/gmerlin/lib/ | grep " T " | wc -l
Before: 866
After: 278

2. Size of the shared library (compiled with standard settings):
Before: 622K
After: 564K

I did some rough speed tests with time qtinfo, but the differences were too little and the measurement method too inaccurate to be significant.

Let's see how this settles down. Doing the same for the gmerlin libraries will be much easier, because they are much cleaner already :)

Friday, September 5, 2008

Make an analog 5.1 switch using scart components

The project had 3 clear goals:

1. Remove the disastrous cable mess around the Hifi-system and TV

2. Avoid repeatedly creeping into dark corners to reconnect the cables (always increasing the mess mentioned in 1)

3. Do 1 and 2 without decreasing functionality of the whole system

After considering all necessary signal-paths, it turned out, that one component was missing: An analog 5.1 switch. My 5.1 receiver (from Yamaha) is almost 10 years old now, but it still works perfectly. Unfortunately, it has only one 5.1 input (and lots of stereo inputs). Now, I have 2 sources of 5.1 audio: The DVD player and the Media PC.

And to make things worse: Many stereo movies on the PC (e.g. rips from VHS tapes) are encoded with the old Dolby Surround method. The Yamaha receiver handles them perfectly, but only when the signal comes into a stereo input. This means that the stereo channels form the 5.1 signal must also be routed to a stereo input of the receiver.

So I was searching for something, which does the same as the green box in the following figure (the AUX channel is reserved for future extensions):

Googling around revealed only one solution, which is unacceptable because it only switches between 2 sources, needs a power supply (violating goal 1) and costs € 258. All other search results were people saying that this sucks.

The question was, if some other device can be tricked into doing exactly what I want, and the answer was yes. The idea is to (ab)use the video wires of scart-cables for the additional audio channels. So I got 3 adapters with one scart connector and 6 cinch connectors each:

And a scart switch-box:

The funny thing about the switch box is, that the stereo channels are available at additional cinch connectors (right side), so I can route front L/R to a stereo input of my receiver, like I wanted.

Mission accomplished.

Thursday, September 4, 2008

The future from the last century

When I was younger (in the 70s), I spent a lot of time reading books about the exciting things to come in the year 2000 and beyond. We were supposed to have at least one permanent moon base, public transport with Cabin taxis and underwater cities. And of course peace, happiness and prosperity all over the planet.

Today, it's a bit disappointing to see that none of these things actually became true. Humans seem to be most innovative when it comes to better weapons or more effective Vendor lock-in.

Technical advances must go hand in hand with advances of the human mind; everything else is doomed to failure. Maybe the internet (which was not mentioned in my children's books) can help here.