Saturday, February 28, 2009

Parallelizing video routines

Basics
One project shortly after the last gmerlin release was to push the CPU usage beyond 50 % on my dual core machine. There is lots of talk about parallelization of commonly used video routines, but only little actual code.

In contrast to other areas (like scientific simulations) video routines are extremely simple to parallelize because most of them fulfill the following 2 conditions:
  • The outermost loop runs over scanlines of the destination image
  • The destination image is (except for a few local variables) the only memory area, where data are written
So the first step is to change the conversion functions of the type:
void conversion_func(void * data, int width, int height)
{
int i, j;
for(i = 0; i < height; i++)
{
for(j = 0; j < width; j++)
{
/* Do something here */
}
}
}
to something like:
void conversion_func(void * data, int width, int start, int end)
{
int i, j;
for(i = start; i < end; i++)
{
for(j = 0; j < width; j++)
{
/* Do something here */
}
}
}

The data argument points to a struct containing everything needed for the conversion like destination frame, source frame(s), lookup tables etc. The height argument was replaced by start and end arguments, which means that the function now processes a slice (i.e. a number of consecutive scanlines) instead of the whole image. The remaining thing now is to call the conversion function from multiple threads. It is important that the outermost loop is split to keep the overhead as small as possible.

The API perspective
Everything was implemented
  • With a minimal public API
  • Backwards compatible (i.e. if you ignore the API extensions things still work)
  • Completely independent of the actual thread implementation (i.e. no -lpthread is needed for gavl)
The whole magic can be boiled down to:
/* Slice process function, which is passed by gavl to the application */
typedef void (*gavl_video_process_func)(void * data, int start, int end);

/* Routine, which passes the actual work to the worker thread (which is managed by the application) */
typedef void (*gavl_video_run_func)(gavl_video_process_func func,
void * gavl_data,
int start, int end,
void * client_data, int thread);

/* Synchronization routine, which waits for the worker thread to finish */
typedef void (*gavl_video_stop_func)(void * client_data, int thread);

/* Set the maximum number of available worker threads (gavl might not need all of them
if you have more CPUs than scanlines) */
void gavl_video_options_set_num_threads(gavl_video_options_t * opt, int n);

/* Pass the run function to gavl */
gavl_video_options_set_run_func(gavl_video_options_t * opt,
gavl_video_run_func func,
void * client_data);

/* Pass the stop function to gavl */
void gavl_video_options_set_stop_func(gavl_video_options_t * opt,
gavl_video_stop_func func,
void * client_data);
Like always, the gavl_video_options_set_* routines have corresponding
gavl_video_options_get_* routines. These can be used for using the same
multithreading mechanism outside gavl (e.g. in a video filter).

The application perspective
As noted above there is no pthread specific code inside gavl. Everything is passed via callbacks. libgmerlin has a pthread based thread pool, which does exactly what gavl needs.

The thread pool is just an array of context structures for each thread:
typedef struct
{
/* Thread handle */
pthread_t t;

/* gavl -> thread: Do something */
sem_t run_sem;

/* thread -> gavl: I'm done */
sem_t done_sem;

/* Mechanism to make the fuction finish */
pthread_mutex_t stop_mutex;
int do_stop;

/* Passed from gavl */
void (*func)(void*, int, int);
void * data;
int start;
int end;
} thread_t;
The main loop (passed to pthread_create) for each worker thread looks like:
static void * thread_func(void * data)
{
thread_t * t = data;
int do_stop;
while(1)
{
sem_wait(&t->run_sem);

pthread_mutex_lock(&t->stop_mutex);
do_stop = t->do_stop;
pthread_mutex_unlock(&t->stop_mutex);

if(do_stop)
break;
t->func(t->data, t->start, t->len);
sem_post(&t->done_sem);
}
return NULL;
}
The worker threads are launched at program start and run all the time. As long as nothing is to do, they wait for the run_sem semaphore (using zero CPU). Launching new threads for every little piece of work would have a much higher overhead.

Passing work to a worker thread happens with the following gavl_video_run_func:
void bg_thread_pool_run(void (*func)(void*,int start, int len),
void * gavl_data,
int start, int end,
void * client_data, int thread)
{
bg_thread_pool_t * p = client_data;
p->threads[thread].func = func;
p->threads[thread].data = gavl_data;
p->threads[thread].start = start;
p->threads[thread].end = end;

sem_post(&p->threads[thread].run_sem);
}
Synchronizing (waiting for the work to finish) happens with the following gavl_video_stop_func:
void bg_thread_pool_stop(void * client_data, int thread)
{
bg_thread_pool_t * p = client_data;
sem_wait(&p->threads[thread].done_sem);
}
The whole thread pool implementation can be found in the gmerlin source tree in lib/threadpool.c (128 lines including copyright header).

Implementations
Parallelization was implemented for
Benchmarks
For benchmarking one needs a scenario where the parallelized processing routine(s) need the lions share of the total CPU usage. I decided to make a (completely nonsensical) gaussian blur with a radius of 50 over 1000 frames of a 720x576 PAL sequence. All code was compiled with default (i.e. optimized) options. Enabling 2 threads descreased the processing time from 81.31 sec to 43.35 sec.

Credits
Finally found this page for making source snippets in blogger suck less.

No comments: