Thursday, December 11, 2008

How gmerlin-avdecoder works

If you study multimedia decoding software like xine, ffmpeg or MPlayer you find that they all work surprisingly similar. Gmerlin-avdecoder is no exception here. The important components are shown in the image below:

The input module obtains the data. Examples for data sources are regular files, DVDs or DVB- or network streams. Usually the data is delivered in raw bytes. For DVDs and VCDs however, read and seek operations are sector based. Since both formats require that each sector must start with a syncpoint (an MPEG pack header) having sector based data in the demultiplexer speeds up several things.

This is, where the compressed frames are extracted from the container. During initialization, the demultiplexer creates the track table (bgav_track_table_t). This contains the tracks (in most cases just one). Each track contains the streams for audio, video and subtitles. For most containers the demultiplexer already knows the audio and video formats. For others, the codec must detect it. This means you should never trust the formats before you called bgav_start().

In some cases (DVD, VCD, DVB) the input already knows the complete track layout and which demultiplexer to use. The initialization of the demultiplexer can then skip the stream detection. In the general case the demultiplexer is selected according to the file content (i.e. the first few bytes). Some formats (MPEG, mp3) can have garbage before the first detection pattern, so we must repeatedly skip bytes before checking for one of these.

The demultiplexer has a routine, which reads the next packet from the input. Depending on the format, this involves decoding a packet header and extracting the compressed data, which can be handled by the codec later on. Some formats (rm, asf, MPEG-2 transport streams, Ogg) use 2-layer multiplexing. There are top-level packets which contain subpackets.

If the format is well designed, we also know the timestamps and duration of the packet and if the packet contains a keyframe. Not all formats are well designed though (or encoders are buggy), then we must do a lot of black magic to get as much information as possible.

Most demultiplexers are native implementations. As a fallback for very obscure formats we also support demultiplexing with libavformat.

Gmerlin-avdecoder is strictly pull-based. If a video codec requests a packet, but the next packet in the stream belongs to an audio stream, it must be stored for later usage. For this, we have buffers, which are just chained lists of packets. They can grow dynamically. This approach makes the decoding process mostly insensitive to badly interleaved streams.

Interleaved vs. noninterleaved
The demultiplexing method described above reads the file strictly sequentially. This has the advantage that we never seek in the stream so we can do this for non-seekable sources.

Some files (more than you might think) are however completely non-interleaved, e.g. the audio packets follow all video packets. These always have a global index though. In this case, if a video packet is requested, the demultiplexer seeks to the packet start and reads the packet. This mode, which is also used in sample accurate mode, only works for seekable sources.

These convert packets to A/V frames. In most cases, one packet equals one frame. In some cases (mostly for MPEG streams), the codec must first assemble frames from packets or split packets containing multiple frames. The codec outputs the gavl frames, which are handled by the application.

Codecs are selected according to fourccs. For formats, which don't have fourccs, we either invent them or we use fourccs from AVI or Quicktime.

Video codecs must care about timestamps. For MPEG streams the timestamps at multiplex level (with a 90 kHz clock) must be converted to the ones according to the video framerate. Audio codecs must do buffering because the the application can decide how many samples to read at once.

Text subtitles
There are no codecs for text subtitles. Each packet contains the string (converted to UTF-8 by the demultiplexer), the presentation timestamp and the duration.

Reading a Video frame
If an application calls bgav_read_video, the following happens:
  • The core calls the decode function of the codec
  • The codec checks if there is already a decoded frame available. This is the case after initialization because some codecs need to decode the first picture to detect the video format
  • If no frame is left, the codec decodes one. For this it will most likely need a new packet
  • The codec requests a packet. Either the packet buffer already has one, or the demultiplexer must get one
  • In streaming mode, the demultiplexer gets the next packet from the input and puts it into the packet buffer of the stream to which it belongs. If the stream is not used, no packet is produced and it's data is skipped. This is repeated until the end is reached or we found a packet for the video stream. If the end is reached, the demultiplexer signals EOF for the whole track.
  • In non-streaming mode the demultiplexer knows, which stream requested the packet. It seeks to the position of the next packet and reads it. EOF is signaled per stream.
Index building
Building file indices for sample accurate access can happen in different ways depending on the container format. In the end, we need byte positions in the file, the associated timestamps (in output timescale) and keyframe flags. The following modes are supported:
  • MPEG mode: The codec must build the index. This involves parsing the frames (only needed parts) to extract timing information with sample accuracy. Codecs supporting this mode are libmpeg2, libavcodec, libmad, liba52 and faad2.
  • Simple mode: The demultiplexer knows about the output timescale, gets precise timestamps and one packet equals one frame. Then no codec is needed for building the index.
  • Mix of the above. E.g. in flv timestamps are always in millseconds. This is precise for video streams. For audio streams (mostly mp3), we need MPEG mode.
B-frames are ommitted in the index. That's because noone will use them for a seekpoint anyway and the PTS get strictly monotone. This lets us do a fast binary search in the index, but the demultiplexer must be prepared for packets not contained in the index.

Friday, December 5, 2008

Downscaling algorithms

The theory
Downscaling images is a commonly needed operation, e.g. if HD videos or photos from megapixel cameras are converterted for burning on DVD or uploading to the web. Mathematically, image scaling is exactly the same as audio samplerate conversion (only in 2D, but it can be decomposed into two subsequent 1D operations). All these have in common that samples of the destination must be calculated from the source samples, which are given on a (1D- or 2D-) grid with a different spacing (temporally or spatially). The reciprocal of the grid spacings are the sample frequencies.

In the general case (i.e. if the resample ratio is arbitrary) one will end up with interpolating the destination samples from the source samples. The interpolation method (nearest neighbor, linear, cubic...) can be configured in better applications.

One might think we are done here, but unfortunately we aren't. Interpolation is the second step of downscaling. Before that, we must make sure that the sampling theorem is not violated. The sampling theorem requires the original signal to be band-limited with a cutoff frequency of half the sample frequency. This cutoff frequency is also called the Nyquist frequency.

So if we upscale an image (or resample an audio signal to a higher sample frequency), we can assume that the original was already band-limited with half the lower lower sample frequency so we have no problem. If we downsample, we must first apply a digital low-pass filter to the image. Low-pass filtering of images is the same as blurring.

The imagemagick solution
What pointed me to this subject in the first place was this post in the gmerlin-help forum (I knew about sampling theory before, I simply forgot about it when programming the scaler). The suggestion, which is implemented in ImageMagick, was to simply widen the filter kernel by the inverse scaling ratio. For linear downscaling to 1/2 size this would mean to do an interpolation involving 4 source pixels instead of 2. The implementation extremely simple, it's just one additional variable for calculating the number and values of filter coefficients. This blurs the image because it does some kind of averaging involving more source pixels. Also the amount of blurring is proportional to the inverse scale factor, which is correct. The results actually look ok.

The gavl solution
I thought what's the correct way to do this. As explained above, we must blur the image with a defined cutoff frequency and then interpolate it. For the blur filter, I decided to use a Gaussian low-pass because it seems the best suited for this.

The naive implementation is to blur the source image into a temporary video frame and then interpolate into the destination frame. It can however be done much faster and without temporary frame, because the 2 operations are both convolutions. And the convolution has the nice property, that it's associative. This means, that we can convolve the blur coefficients with the interpolation coefficients resulting in the filter coefficients for the combined operation. These are then used on the images.

The difference
The 2 solutions have a lot in common. Both run in 1 pass and blur the image according to the inverse scaling ratio. The difference is, that the imagemagick method simply widens the filter kernel by a factor while gavl widens the filter by colvolving with a low-pass.

During my research I found this page. I downloaded the sample image (1000x1000) and downscaled it to 200x200 with different methods.

First the scary ones:

OpenGL (Scale mode linear, GeForce 8500 GT, binary drivers)

XVideo (GeForce 8500 GT, binary drivers)

Firefox 3.0.4 (that's why I never let the browser scale when writing html)

Gimp (linear, indexed mode)

gavl (downscale filter: GAVL_DOWNSCALE_FILTER_NONE)

Now it gets better:

Gimp (linear, grayscale mode)

mplayer -vf scale=200:200
The movie was made with qtrechunk from the png file.

gavl with imagemagick method (downscale filter: GAVL_DOWNSCALE_FILTER_WIDE)

gavl with gaussian preblur (downscale filter: GAVL_DOWNSCALE_FILTER_GAUSS)

Blogger thumbnail (400x400). Couldn't resist to upload the original size image to blogger and see what happens. Not bad, but not 200x200.