Thursday, October 16, 2008

Streaming through the NAT

The only missing network streaming protocol for gmerlin-avdecoder was RTSP/RTP, so I decided to implement it. Some parts (the ones needed for playing the Real-rtsp variant) were already there, but the whole RTP stuff was missing.

The advantage of these is that they are well documented in RFCs. With some knowledge about how sockets and their API work, implementation was straightforward. Special about RTSP/RTP is, that there is one RTSP connection (usually TCP port 554) which acts like a remote control, while the actual A/V data are delivered over RTP, which usually uses UDP. To make things more complicated, each stream is transported over an own UDP socket, with another socket used for Qos infos. Playing a normal movie with audio and video needs 4 UDP sockets then.

The basic functions were implemented, I opened 4 UDP ports on my DSL-Router and I could play movies :)

Then I stumbled across something strange:
  • My code worked completely predictable regarding the router configuration. When I closed the ports on the router (or changed the ports in my code), it stopped working
  • Both ffmpeg and vlc (which, like MPlayer, uses live555 for RTSP) always work in UDP mode, no need to manually open the UDP ports. Somehow they make my router forward the incoming RTP packets to my machine.
So the question was: How?

After spending some time with wireshark and strace I made sure, that I setup my sockets the same way as the other applications, and the RTSP requests are the same. When gmerlin-avdecoder still didn't make it through the NAT (and me almost freaking out) I decided to take a look at some TCP packets, which were marked with the string "TCP segment of a reassembled PDU". I noticed that these occur only in the wireshark dump of gmerlin-avdecoder, not in the others.

After googling a bit, the mystery was solved:
  • The Router (which was found to be a MIPS-based Linux box) recognizes the RTSP protocol. By parsing the client_port field of the SETUP request it knows which UDP ports it must open and forward to the client machine.
  • The "TCP segment of a reassembled PDU" packets are small pieces belonging to one larger RTSP request.
  • If the SETUP line is not in the same TCP packet as the line which defines the transport, the recognition by the router will fail.
  • Wireshark fooled me by assembling the packets belonging to the same request into a larger one and displaying it together with the pieces (this feature can be turned off in the wireshark TCP configuration).
  • The fix was simple: I write the whole request into one string, and send this string at once. Finally the router automagically sends the RTP packets to gmerlin-avdecoder.
What did I learn through this process? Most notably that TCP is stream-based only for the client and the server. Any hardware between these 2 only sees packets. Applications relying on intelligent network hardware must indeed take care, which data will end up in which packet.

You might think that it's actually no problem to open UDP ports on the router, and doing such things manually is better than automatically. But then you'll have many people running their clients with the same UDP ports, which makes attacks easier for the case that gmerlin-avdecoder has a security hole. Much better is to choose the ports randomly. Then, we can also have multiple clients in the same machine. The live555 library uses random ports, ffmpeg doesn't.

Wednesday, October 8, 2008

Globals in libs: If and how

As a follow-up to this post I want to concentrate on the cases, where global variables are tolerable and how this should be done.

Tolerable as global variables are data, whose initialization must be done at runtime and
takes a significant amount of time. One example is the libquicktime codec registry. It's creation involves scanning the plugin directory, comparing the contents with a registry and loading all modules (with a time consuming dlopen), for which the registry entries are missing or outdated. This is certainly not something, which should be done per instance (i.e. for each opened file). Other libraries have similar things.

Next question is how can they be implemented? A simple goal is, that the library must linkable with a plugin (i.e. dynamic module) instead of an executable. This means, that repeated loading and unloading (from different threads) must work without any problems. A well designed plugin architecture knows as little as possible about the plugins, so having global reference counters for each library a plugin might link in, is not possible.

Global initialization and cleanup functions


Many libraries have functions like libfoo_init() and libfoo_cleanup(), which are to be called before the first and after the last use of other functions from libfoo respectively. This causes problems for a plugin, which has no idea if this library was already be loaded/initialized by another plugin (or by another instance of itself). Also before a plugin is unloaded there is no way to find out, if libfoo_cleanup() can safely be called or if this will crash another plugin. Omitting the libfoo_cleanup() call opens a memory leak if the libfoo_init() function allocated memory. From this we find that the global housekeeping functions are ok if either:

  • Initialization doesn't allocate any resources (i.e the cleanup function is either a noop or missing) and
  • Initialization is (thread safely) protected against multiple calls

or:

  • Initialization and cleanup functions maintain an internal (thread safe) reference counter, so that only the first init and last cleanup call will actually do something


Initialization on demand, cleanup automatically


This is how the libquicktime codec registry is handled. It meets the above goals but doesn't need any global functions. Initialization on demand means, that the codec registry is initialized before it's accessed the first time. Each function, which accesses the registry starts with a call to lqt_registry_init(). The subsequent registry access is enclosed by lqt_registry_lock() and lqt_registry_unlock(). These 3 functions do the whole magic and they look like:


static int registry_init_done = 0;
pthread_mutex_t codecs_mutex = PTHREAD_MUTEX_INITIALIZER;

void lqt_registry_lock()
{
pthread_mutex_lock(&codecs_mutex);
}

void lqt_registry_unlock()
{
pthread_mutex_unlock(&codecs_mutex);
}

void lqt_registry_init()
{
/* Variable declarations omitted */
/* ... */

lqt_registry_lock();
if(registry_init_done)
{
lqt_registry_unlock();
return;
}

registry_init_done = 1;

/* Lots of stuff */
/* ... */

lqt_registry_unlock();
}


We see that protection against multiple calls is garantueed. The protection mutex itself initialized from the very beginning (before the main function is called).

While this initialization should work on all POSIX systems, automatic freeing is a bit more tricky and only possible for gcc (don't know if other compilers have similar features). The best time for freeing global resources is right before the library is unloaded. Most binary formats let you mark functions, which should be called before unmapping the library (in ELF files, this is done by putting these into the .fini section). In the sourcecode, this looks like:

#if defined(__GNUC__)

static void __lqt_cleanup_codecinfo() __attribute__ ((destructor));

static void __lqt_cleanup_codecinfo()
{
lqt_registry_destroy();
}

#endif

Fortunately the dlopen() and dlclose() functions maintain reference counts for each module. So the cleanup function is garantueed to be called by the dlclose() call, which unloads the last instance of the last plugin linked to libquicktime.

I regularly check my programs for memory leaks with valgrind. Usually (i.e. after I fixed my own code) all remaining leaks come from libraries, which miss some of the goals described above.

Remove globals from libs

Global variables in libraries are bad, everyone knows that. Maybe I'll make another post explaining when they can be tolerated how this can be done. But for now we assume that they are simply bad.

One common mistake is to declare static data (like tables) as non-const. This makes them
practically variables. If the code never changes them, they cause no problem in terms of thread safety. But unfortunately the dynamic linker doesn't know that, so they will be mapped into r/w pages when the library is loaded. And those pages will, of course, not be shared between applications so you end up with predictable redundant blocks in your precious RAM.

Cleaning this up is simple: Add const to all declarations, where it's missing. But how does one find all these declarations in a larger sourcetree in a reasonable time? The ELF format is well documented and there are numerous tools to examine ELF files.

Let's take the following C file and pretend it's a library build from 100s of sourcefiles with 100000 of codelines:

struct s
{
char * str;
char ** str_list;
int i;
};

struct s static_data_1 =
{
"String1",
(char*[]){ "Str1", "Str2" },
1,
};

char * static_string_1 = "String2";

int zeroinit = 0;
Now there are 2 sections in an ELF file, which need special attention: The .data section contains statically initialized variables. The .bss section contains data which is initialized to zero. After compiling the file with gcc -c the sizes of the sections can be obtained with:
# size --format=SysV global.o
global.o :
section size addr
.text 0 0
.data 56 0
.bss 4 0
.rodata 26 0
.comment 42 0
.note.GNU-stack 0 0
Total 128
So we have 56 bytes in .data and 4 bytes in .bss. After successful cleanup all these should ideally end up in the .rodata section (read-only statically initialized data). Since we have 100000 lines of code, the next step is to find the variable names (linker symbols) contained in the sections:
# objdump -t global.o

global.o: file format elf64-x86-64

SYMBOL TABLE:
0000000000000000 l df *ABS* 0000000000000000 global.c
0000000000000000 l d .text 0000000000000000 .text
0000000000000000 l d .data 0000000000000000 .data
0000000000000000 l d .bss 0000000000000000 .bss
0000000000000000 l d .rodata 0000000000000000 .rodata
0000000000000020 l O .data 0000000000000010 __compound_literal.0
0000000000000000 l d .note.GNU-stack 0000000000000000 .note.GNU-stack
0000000000000000 l d .comment 0000000000000000 .comment
0000000000000000 g O .data 0000000000000018 static_data_1
0000000000000030 g O .data 0000000000000008 static_string_1
0000000000000000 g O .bss 0000000000000004 zeroinit
Now we know that the variables static_data_1, static_string_1
and zeroinit are affected.

The symbol __compound_literal.0 comes from the expression (char*[]){ "Str1", "Str2" }. The bad news is that compound literals are lvalues according to the C99 standard, so they won't be assumed const by gcc. You can declare them const, but they'll still be in the .data section, at least for gcc-Version 4.2.3 (Ubuntu 4.2.3-2ubuntu7). The cleaned up file looks like:
struct s
{
const char * str;
char ** const str_list;
int i;
};

static const struct s static_data_1 =
{
"String1",
(char*[]){ "Str1", "Str2" },
1,
};

char const * const static_string_1 = "String2";

const int zeroinit = 0;
The resulting symbol table:
0000000000000000 l    df *ABS* 0000000000000000 global1.c
0000000000000000 l d .text 0000000000000000 .text
0000000000000000 l d .data 0000000000000000 .data
0000000000000000 l d .bss 0000000000000000 .bss
0000000000000000 l d .rodata 0000000000000000 .rodata
0000000000000010 l O .rodata 0000000000000018 static_data_1
0000000000000000 l O .data 0000000000000010 __compound_literal.0
0000000000000000 l d .note.GNU-stack 0000000000000000 .note.GNU-stack
0000000000000000 l d .comment 0000000000000000 .comment
0000000000000040 g O .rodata 0000000000000008 static_string_1
0000000000000048 g O .rodata 0000000000000004 zeroinit
Larger libraries have huge symbol tables, so you will of course filter it with:

grep \\.data | grep -v __compound_literal

So if you want to contribute to a library which needs some cleanup, and you are of the "I know just a little C but I want to help"-type, this is a good idea for a patch :)

2 ssh servers on the same port

A tcp port can only be used by one server process for incoming connections. If another process wants to listen on the same port it will get an "address already in use" error from the OS. If you know the background it's pretty clear why it must be so.

But imagine a case like the following:
  • You want to make a linux machine reachable via ssh
  • From the same subnet passwords are sufficient
  • From outside only public key authentication is allowed
  • Your users are already happy if they get their ssh clients working on Windows XP. You don't want to bother them (and indirectly yourself as the admin) with nonstandard port numbers.
  • Your sshd doesn't support different configurations depending on the source address.
At a first glance, this looks unsolvable. But if you have an iptables firewall (and you will have one if the machine is worldwide reachable) there is a little known trick called port redirection.

You run 2 ssh servers: The external one (with public key authentication) listens at port 22, the internal one (with passwords) listens e.g. at port 2222. Then you configure your iptables such, that incoming packets which come from the subnet to port 22 are redirected to port 2222. The corresponding lines in the firewall script look like:


# Our Subnet
SUB_NET="192.168.1.0/24"

# iptables command
IPTABLES=/usr/sbin/iptables

# default policies, flush all tables etc....
...

# ssh from our subnet (redirect to port 2222 and let them through)
$IPTABLES -t nat -A PREROUTING -s $SUB_NET -p tcp --dport 22 \
-j REDIRECT --to-ports 2222
$IPTABLES -A INPUT -p tcp -s $SUB_NET --syn --dport 2222 -j ACCEPT

# ssh from outside
$IPTABLES -A INPUT -p tcp -s ! $SUB_NET --syn --dport 22 -j ACCEPT


I have this configuration on 2 machines for many months now with zero complaints so far.