Saturday, February 28, 2009

Parallelizing video routines

Basics
One project shortly after the last gmerlin release was to push the CPU usage beyond 50 % on my dual core machine. There is lots of talk about parallelization of commonly used video routines, but only little actual code.

In contrast to other areas (like scientific simulations) video routines are extremely simple to parallelize because most of them fulfill the following 2 conditions:
  • The outermost loop runs over scanlines of the destination image
  • The destination image is (except for a few local variables) the only memory area, where data are written
So the first step is to change the conversion functions of the type:
void conversion_func(void * data, int width, int height)
{
int i, j;
for(i = 0; i < height; i++)
{
for(j = 0; j < width; j++)
{
/* Do something here */
}
}
}
to something like:
void conversion_func(void * data, int width, int start, int end)
{
int i, j;
for(i = start; i < end; i++)
{
for(j = 0; j < width; j++)
{
/* Do something here */
}
}
}

The data argument points to a struct containing everything needed for the conversion like destination frame, source frame(s), lookup tables etc. The height argument was replaced by start and end arguments, which means that the function now processes a slice (i.e. a number of consecutive scanlines) instead of the whole image. The remaining thing now is to call the conversion function from multiple threads. It is important that the outermost loop is split to keep the overhead as small as possible.

The API perspective
Everything was implemented
  • With a minimal public API
  • Backwards compatible (i.e. if you ignore the API extensions things still work)
  • Completely independent of the actual thread implementation (i.e. no -lpthread is needed for gavl)
The whole magic can be boiled down to:
/* Slice process function, which is passed by gavl to the application */
typedef void (*gavl_video_process_func)(void * data, int start, int end);

/* Routine, which passes the actual work to the worker thread (which is managed by the application) */
typedef void (*gavl_video_run_func)(gavl_video_process_func func,
void * gavl_data,
int start, int end,
void * client_data, int thread);

/* Synchronization routine, which waits for the worker thread to finish */
typedef void (*gavl_video_stop_func)(void * client_data, int thread);

/* Set the maximum number of available worker threads (gavl might not need all of them
if you have more CPUs than scanlines) */
void gavl_video_options_set_num_threads(gavl_video_options_t * opt, int n);

/* Pass the run function to gavl */
gavl_video_options_set_run_func(gavl_video_options_t * opt,
gavl_video_run_func func,
void * client_data);

/* Pass the stop function to gavl */
void gavl_video_options_set_stop_func(gavl_video_options_t * opt,
gavl_video_stop_func func,
void * client_data);
Like always, the gavl_video_options_set_* routines have corresponding
gavl_video_options_get_* routines. These can be used for using the same
multithreading mechanism outside gavl (e.g. in a video filter).

The application perspective
As noted above there is no pthread specific code inside gavl. Everything is passed via callbacks. libgmerlin has a pthread based thread pool, which does exactly what gavl needs.

The thread pool is just an array of context structures for each thread:
typedef struct
{
/* Thread handle */
pthread_t t;

/* gavl -> thread: Do something */
sem_t run_sem;

/* thread -> gavl: I'm done */
sem_t done_sem;

/* Mechanism to make the fuction finish */
pthread_mutex_t stop_mutex;
int do_stop;

/* Passed from gavl */
void (*func)(void*, int, int);
void * data;
int start;
int end;
} thread_t;
The main loop (passed to pthread_create) for each worker thread looks like:
static void * thread_func(void * data)
{
thread_t * t = data;
int do_stop;
while(1)
{
sem_wait(&t->run_sem);

pthread_mutex_lock(&t->stop_mutex);
do_stop = t->do_stop;
pthread_mutex_unlock(&t->stop_mutex);

if(do_stop)
break;
t->func(t->data, t->start, t->len);
sem_post(&t->done_sem);
}
return NULL;
}
The worker threads are launched at program start and run all the time. As long as nothing is to do, they wait for the run_sem semaphore (using zero CPU). Launching new threads for every little piece of work would have a much higher overhead.

Passing work to a worker thread happens with the following gavl_video_run_func:
void bg_thread_pool_run(void (*func)(void*,int start, int len),
void * gavl_data,
int start, int end,
void * client_data, int thread)
{
bg_thread_pool_t * p = client_data;
p->threads[thread].func = func;
p->threads[thread].data = gavl_data;
p->threads[thread].start = start;
p->threads[thread].end = end;

sem_post(&p->threads[thread].run_sem);
}
Synchronizing (waiting for the work to finish) happens with the following gavl_video_stop_func:
void bg_thread_pool_stop(void * client_data, int thread)
{
bg_thread_pool_t * p = client_data;
sem_wait(&p->threads[thread].done_sem);
}
The whole thread pool implementation can be found in the gmerlin source tree in lib/threadpool.c (128 lines including copyright header).

Implementations
Parallelization was implemented for
Benchmarks
For benchmarking one needs a scenario where the parallelized processing routine(s) need the lions share of the total CPU usage. I decided to make a (completely nonsensical) gaussian blur with a radius of 50 over 1000 frames of a 720x576 PAL sequence. All code was compiled with default (i.e. optimized) options. Enabling 2 threads descreased the processing time from 81.31 sec to 43.35 sec.

Credits
Finally found this page for making source snippets in blogger suck less.

Friday, February 27, 2009

Elementary stream parsing

See the image below for an updated block-schematics of gmerlin-avdecoder as addition to my last blog entry:



You see, what's new here: A parser, which converts "bad" packets into "good" packets. Now what does that exactly mean? A good (or well formed) packet has the following properties:
  • It always has a correct timestamp (the presentation time at least)
  • It has a flag to determine if the packet is a keyframe
  • It has a valid duration. This is necessary for frame accurate repositioning of the stream after seeking. At least if you want to support variable framerates.
Good packets (e.g. from quicktime files) can directly be passed to the decoder, and it's possible to build an index from them. Unfortunately some formats (most notably MPEG program- and transport streams), don't have such nice packets. You neither know where a frame starts, nor whether it is a keyframe. Large frames can be split across many packets, some packets can contain several small frames. Also timestamps are not available for each frame. To make things worse, these formats are very widely used. You find them in .mpg files, on DVDs, (S)VCDs, BluRay disks and in DVB broadcasts. Also all newer formats for consumer cameras (HDV and AVCHD) use MPEG-2 transport streams. Since they are important for video editing applications, so sample accurate access is a main goal here.

The first solutions for this were realized inside the decoders. libmpeg2 is very tolerant with regard to the frame alignment and ffmpeg has an AVParser, which splits a continuous stream into frames. Additional routines were written for building an index.

It was predictable that this would not be the ultimate solution. The decoders got very complicated and building the index was not possible without firing up an ffmpeg decoder (because the AVParser doesn't tell about keyframes) so index building was very slow.

So I spent some time to write a parsers for elementary A/V streams, which parse the streams to get all necessary information for creating well formed packets.

After that worked, I could be sure, that every codec always gets well-formed packets. What followed then, was by far the biggest cleanup in the history of gmerlin-avdecoder. Many things could be simplified, lots of cruft got kicked out, duplicate code got moved to central locations.

New features are:
  • Decoding of (and sample accurate seeking within) elementary H.264 streams
  • Sample accurate access for elementary MPEG-1/2 streams
  • Sample accurate access for MPEG program streams with DTS- and LPCM audio
  • Faster seeking because non-reference frames are skipped while approaching the new stream position
  • Much cleaner and simpler architecture.
The cleaner architecture doesn't necessarily mean less bugs, but they are easier to fix now :)