Discussion:
radiusd: Error detected by libpthread: Invalid mutex.
Edgar Fuß
2013-04-26 15:59:00 UTC
Permalink
I just updated from pkgsrc-2012Q4 to -2013Q1, and now FreeRADIUS fails with

radiusd: Error detected by libpthread: Invalid mutex.

There was no update to freeradius2 from 2012Q4 to 2013Q1, and indeed reverting
to the old package doesn't help.
So the issue is probably with one of the packages it depends on, i.e. perl,
gdbm, openldap-client, net-snmp or libltdl. I haven't tried to revert these yet.

It does help to start radiud single-threaded (-s), though.

Any ideas how to debug this? Does it ring a bell to someone?
Christos Zoulas
2013-04-26 18:55:31 UTC
Permalink
Post by Edgar Fuß
I just updated from pkgsrc-2012Q4 to -2013Q1, and now FreeRADIUS fails with
radiusd: Error detected by libpthread: Invalid mutex.
There was no update to freeradius2 from 2012Q4 to 2013Q1, and indeed reverting
to the old package doesn't help.
So the issue is probably with one of the packages it depends on, i.e. perl,
gdbm, openldap-client, net-snmp or libltdl. I haven't tried to revert these yet.
It does help to start radiud single-threaded (-s), though.
Any ideas how to debug this? Does it ring a bell to someone?
It is prolly a radiusd bug:

$ PTHREAD_DIAGASSERT=a radiusd
$ gdb radiusd radiusd.core
(gdb) where

christos
Edgar Fuß
2013-04-30 12:44:31 UTC
Permalink
Maybe yes. See below.
Post by Christos Zoulas
$ PTHREAD_DIAGASSERT=a radiusd
$ gdb radiusd radiusd.core
Unfortunately, this (with allow_core_dumps=yes in radiusd.conf) doesn't give
me a core dump.

The problem seems to be that radiusd sets up mutexes (pthread_mutex_init)
while reading the config file. It reads the config file before fork()ing
in order to daemonize. In fact, running it with -f and then manually
backgrounding it makes the problem disappear.
Also, moving the config-file reading code below the fork() also helps.

Before reporting a bug to the FreeRADIUS people: It looks weird to me to
create a mutex, then fork and use the mutex in the child process. However,
I'm unsure whether this is really illegal (I can't find anything saying so
besides the sheer existance of pthread_atfork(), which radiusd doesn't call)
or a problem with NetBSD's implementation.

Any pthread experts to help me?
Christos Zoulas
2013-05-01 00:29:54 UTC
Permalink
Post by Edgar Fuß
Maybe yes. See below.
Post by Christos Zoulas
$ PTHREAD_DIAGASSERT=a radiusd
$ gdb radiusd radiusd.core
Unfortunately, this (with allow_core_dumps=yes in radiusd.conf) doesn't give
me a core dump.
The problem seems to be that radiusd sets up mutexes (pthread_mutex_init)
while reading the config file. It reads the config file before fork()ing
in order to daemonize. In fact, running it with -f and then manually
backgrounding it makes the problem disappear.
Also, moving the config-file reading code below the fork() also helps.
Before reporting a bug to the FreeRADIUS people: It looks weird to me to
create a mutex, then fork and use the mutex in the child process. However,
I'm unsure whether this is really illegal (I can't find anything saying so
besides the sheer existance of pthread_atfork(), which radiusd doesn't call)
or a problem with NetBSD's implementation.
Any pthread experts to help me?
It is implementation dependend; for portability pthread_mutex_init() should
be done in the child process and not cross a fork. Eg:

http://www.sbin.org/doc/glibc/libc_34.html
http://stackoverflow.com/questions/2620313/how-to-use-pthread-atfork-and-pthread-once-to-reinitialize-mutexes-in-child

Please file a PR to add a note in our man pages.

christos
Greg Troxel
2013-05-01 11:35:27 UTC
Permalink
Post by Christos Zoulas
Post by Edgar Fuß
Maybe yes. See below.
Post by Christos Zoulas
$ PTHREAD_DIAGASSERT=a radiusd
$ gdb radiusd radiusd.core
Unfortunately, this (with allow_core_dumps=yes in radiusd.conf) doesn't give
me a core dump.
The problem seems to be that radiusd sets up mutexes (pthread_mutex_init)
while reading the config file. It reads the config file before fork()ing
in order to daemonize. In fact, running it with -f and then manually
backgrounding it makes the problem disappear.
Also, moving the config-file reading code below the fork() also helps.
Before reporting a bug to the FreeRADIUS people: It looks weird to me to
create a mutex, then fork and use the mutex in the child process. However,
I'm unsure whether this is really illegal (I can't find anything saying so
besides the sheer existance of pthread_atfork(), which radiusd doesn't call)
or a problem with NetBSD's implementation.
Any pthread experts to help me?
It is implementation dependend; for portability pthread_mutex_init() should
http://www.sbin.org/doc/glibc/libc_34.html
http://stackoverflow.com/questions/2620313/how-to-use-pthread-atfork-and-pthread-once-to-reinitialize-mutexes-in-child
Please file a PR to add a note in our man pages.
Also, in the general case, when a threaded program forks, essentially
nothing can be done in the child - only "async signal safe" calls can be
made (without invoking undefined behavior).

In practice, I would be surprised if a program with only one thread had
any trouble. The basic issue that can't be gotten around is that when
fork happens, if some other thread holds a mutex, that thread, being
missing in the child, cannot release it, and you have a mess.

In FreeBSD, and I think now in NetBSD -current, there is a
pthread_atfork handler called by libc to make sure the internal malloc
locks don't end up in a bad state. THis is necessary becuase there are
so many buggy programs out there.

In the radius case, is the parent process running mulitple threads
before forking? If so, it really should not do that. If not, I am
surprised there is trouble.
Edgar Fuß
2013-05-03 13:52:30 UTC
Permalink
Post by Greg Troxel
In the radius case, is the parent process running mulitple threads
before forking?
No. To be precise, it doesn't hit a breakpoint on pthread_create before hitting
the one on fork.
Post by Greg Troxel
If so, it really should not do that.
Well. apparantly, it doesn't
Post by Greg Troxel
If not, I am surprised there is trouble.
Hm. I'll probably have to ask the FreeRADIUS developers why they fork so late.
Martin Husemann
2013-05-03 14:06:40 UTC
Permalink
Can you reproduce the problem with a small C program?

Martin
Emmanuel Dreyfus
2013-05-04 00:16:45 UTC
Permalink
Post by Christos Zoulas
It is implementation dependend; for portability pthread_mutex_init() should
An important point: it seems current Linux implementation is perfectly
fine with using mutexes that were not initialized, leading Linux
developper to often forget the pthread_mutex_init() call. This is a bug
I encountered a few times when porting glusterfs on NetBSD.
--
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
***@netbsd.org
Christos Zoulas
2013-05-04 02:51:02 UTC
Permalink
On May 4, 2:16am, ***@netbsd.org (Emmanuel Dreyfus) wrote:
-- Subject: Re: radiusd: Error detected by libpthread: Invalid mutex.

| Christos Zoulas <***@astron.com> wrote:
|
| > It is implementation dependend; for portability pthread_mutex_init() should
| > be done in the child process and not cross a fork. Eg:
|
| An important point: it seems current Linux implementation is perfectly
| fine with using mutexes that were not initialized, leading Linux
| developper to often forget the pthread_mutex_init() call. This is a bug
| I encountered a few times when porting glusterfs on NetBSD.

I agree, but how do you suggest that we improve the situation?

christos
Emmanuel Dreyfus
2013-05-04 03:53:45 UTC
Permalink
Post by Christos Zoulas
| An important point: it seems current Linux implementation is perfectly
| fine with using mutexes that were not initialized, leading Linux
| developper to often forget the pthread_mutex_init() call. This is a bug
| I encountered a few times when porting glusterfs on NetBSD.
I agree, but how do you suggest that we improve the situation?
PTHREAD_DIAGASSERT=ale helps because it makes problems reproductible,
but it does not reports everything needed to track the problem: the log
message gives no hint of where the problem is, and for reasons I have
not yet fully understood, core are very often unusable, with gdb being
unable to see the relevant thread.

Attaching the process with gdb before the problem arise almost never
works at mine: at attach time, gdb tells it got confused and suggests
dumping a core of itself.
--
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
***@netbsd.org
Edgar Fuß
2013-05-08 16:57:44 UTC
Permalink
And I probably need some help here.
Post by Christos Zoulas
$ PTHREAD_DIAGASSERT=a radiusd
$ gdb radiusd radiusd.core
(gdb) where
I don't get a core (with allow_core_dumps = yes in radiusd.conf).

I tried gdb with follow-fork-mode set to child and a breakpoint on
pthread__errorfunc, but I get

/usr/src/external/gpl3/gdb/dist/gdb/infrun.c:3036: internal-error: handle_inferior_event: Assertion `inf' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.

So I'm having a pretty hard time finding out who is calling pthread_mutex_lock()
with a non-initialized mutex.

This is on a pretty recent (few days old) 6 (i.e. 6.1_RC4).
Christos Zoulas
2013-05-08 18:16:05 UTC
Permalink
Post by Edgar Fuß
And I probably need some help here.
Post by Christos Zoulas
$ PTHREAD_DIAGASSERT=a radiusd
$ gdb radiusd radiusd.core
(gdb) where
I don't get a core (with allow_core_dumps = yes in radiusd.conf).
I tried gdb with follow-fork-mode set to child and a breakpoint on
pthread__errorfunc, but I get
handle_inferior_event: Assertion `inf' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
So I'm having a pretty hard time finding out who is calling
pthread_mutex_lock()
with a non-initialized mutex.
This is on a pretty recent (few days old) 6 (i.e. 6.1_RC4).
You can do the stopfork trick with sysctl in the proc tree and then attach
to the child before it dumps.

christos
Christos Zoulas
2013-05-08 20:15:48 UTC
Permalink
Post by Edgar Fuß
And I probably need some help here.
Post by Christos Zoulas
$ PTHREAD_DIAGASSERT=a radiusd
$ gdb radiusd radiusd.core
(gdb) where
I don't get a core (with allow_core_dumps = yes in radiusd.conf).
I tried gdb with follow-fork-mode set to child and a breakpoint on
pthread__errorfunc, but I get
handle_inferior_event: Assertion `inf' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
So I'm having a pretty hard time finding out who is calling
pthread_mutex_lock()
with a non-initialized mutex.
This is on a pretty recent (few days old) 6 (i.e. 6.1_RC4).
Otoh, this seems to say that mutexes work across fork.

#include <sys/wait.h>
#include <stdlib.h>
#include <pthread.h>
#include <unistd.h>
#include <err.h>
#include <errno.h>

int
main(int argc, char *argv[])
{
pthread_mutex_t mt;
int status;

pthread_mutex_init(&mt, NULL);

switch (fork()) {
case -1:
err(EXIT_FAILURE, "fork");
case 0:
if ((errno = pthread_mutex_lock(&mt)) == 0)
errx(EXIT_FAILURE, "no error from forked mutex");
else
err(EXIT_SUCCESS, "ok, expected error");
default:
if (wait(&status) == -1)
err(EXIT_FAILURE, "wait");
return WEXITSTATUS(status);
}
}
Edgar Fuß
2013-05-08 22:45:29 UTC
Permalink
Post by Christos Zoulas
You can do the stopfork trick with sysctl in the proc tree and then attach
to the child before it dumps.
Can you elaborate how this is done?

I gdb'd radiusd, set a breakpoint on main and let it run.
When it hit the breakpoint, I looked up the debugged radiusd's PID and did
sysctl -w proc.XXXX.stopfork=1
and let gdb continue. The main process exited, but I would expect to find a
child in ps output. Or are these stopped childs hidden from ps? If yes, how
do I learn their PID to attach to them?
Or what am I doing wrong?
Christos Zoulas
2013-05-09 00:20:39 UTC
Permalink
Post by Edgar Fuß
Post by Christos Zoulas
You can do the stopfork trick with sysctl in the proc tree and then attach
to the child before it dumps.
Can you elaborate how this is done?
I gdb'd radiusd, set a breakpoint on main and let it run.
When it hit the breakpoint, I looked up the debugged radiusd's PID and did
sysctl -w proc.XXXX.stopfork=1
and let gdb continue. The main process exited, but I would expect to find a
child in ps output. Or are these stopped childs hidden from ps? If yes, how
do I learn their PID to attach to them?
That sounds right to me. ps -x should show you a child too. Perhaps
add stopexit too to return to the debugger before the parent exits.
Then you should be able to attach to the child via pid.

christos
Edgar Fuß
2013-05-09 19:29:11 UTC
Permalink
Post by Christos Zoulas
That sounds right to me. ps -x should show you a child too.
It doesn't.

Unsing this trivial program:

#include <unistd.h>
#include <stdio.h>

int main(int argc, char *argv[]) {
printf("pid=%d\n", getpid());
puts("sleep");
sleep(15);
puts("fork");
int pid = fork();
if (pid < 0) return 1;
if (pid > 0) puts("parent");
if (pid == 0) puts("child");
return 0;
}

If I set proc.N.stopfork while the program sleeps, it just prints "parent",
i.e. the child doesn't run, and it doesn't show up in ps alx.
Post by Christos Zoulas
Perhaps add stopexit too to return to the debugger before the parent exits.
If I set both stopfork and stopexit, it behaves as expected, i.e. it prints
"parent", the parent shows up as stopped, the child too, and I can attach gdb
to that child.

If, with both stopfork and stopexit set, I kill -CONT the child, it prints
"child" (as expected). If I then kill -CONT the parent, it exits (as expected).

But, if I kill -CONT the parent first, the stopped child vanishes and doesn't
print "child".

This is on 6.1_RC4/amd64. Is there something broken? (At least gdb seems to be
broken since follow-fork-mode child doesn't work.)

I managed to catch radiusd in the act by setting both stopfork and stopexit
of the shell, then exec'ing radiusd and attach gdb to the child. More on the
results in another message.
Edgar Fuß
2013-05-09 19:34:37 UTC
Permalink
Hm, I finally managed to catch FreeRADIUS in the act (by setting both
stopfork and stopexit of teh shell and then exec'ing it, after which I
was finally able to attach to the child.

I set a breakpoint to pthread__errorfunc, and accoring to the backtrace,
the problem is indeed fork(). To be precise, it's somewhere INSIDE fork():

0x00007f7ff568d35a in __fork () from /usr/lib/libc.so.12
(gdb) break pthread__errorfunc
Breakpoint 1 at 0x7f7ff5a0b2d6
(gdb) cont
Continuing.

Breakpoint 1, 0x00007f7ff5a0b2d6 in pthread__errorfunc () from /usr/lib/libpthread.so.1
(gdb) bt
#0 0x00007f7ff5a0b2d6 in pthread__errorfunc () from /usr/lib/libpthread.so.1
#1 0x00007f7ff5a089ae in ?? () from /usr/lib/libpthread.so.1
#2 0x00007f7fece2df14 in ldap_ld_free () from /usr/lib/libldap_r.so.4
#3 0x00007f7ff4e05ca2 in ?? () from /usr/lib/nss_ldap.so.0
#4 0x00007f7ff4e076e4 in ?? () from /usr/lib/nss_ldap.so.0
#5 0x00007f7ff568026e in fork () from /usr/lib/libc.so.12
#6 0x000000000041b6a9 in main ()

OOPS?
Joerg Sonnenberger
2013-05-09 20:10:59 UTC
Permalink
Post by Edgar Fuß
I set a breakpoint to pthread__errorfunc, and accoring to the backtrace,
Are you sure this isn't a double free?

Joerg
Edgar Fuß
2013-05-09 20:22:57 UTC
Permalink
Post by Joerg Sonnenberger
Are you sure this isn't a double free?
Eh, no. But the backtrace shows it's happening inside fork(), isn't it?
Christos Zoulas
2013-05-09 20:22:02 UTC
Permalink
Post by Edgar Fuß
Hm, I finally managed to catch FreeRADIUS in the act (by setting both
stopfork and stopexit of teh shell and then exec'ing it, after which I
was finally able to attach to the child.
I set a breakpoint to pthread__errorfunc, and accoring to the backtrace,
0x00007f7ff568d35a in __fork () from /usr/lib/libc.so.12
(gdb) break pthread__errorfunc
Breakpoint 1 at 0x7f7ff5a0b2d6
(gdb) cont
Continuing.
Breakpoint 1, 0x00007f7ff5a0b2d6 in pthread__errorfunc () from
/usr/lib/libpthread.so.1
(gdb) bt
#0 0x00007f7ff5a0b2d6 in pthread__errorfunc () from /usr/lib/libpthread.so.1
#1 0x00007f7ff5a089ae in ?? () from /usr/lib/libpthread.so.1
#2 0x00007f7fece2df14 in ldap_ld_free () from /usr/lib/libldap_r.so.4
#3 0x00007f7ff4e05ca2 in ?? () from /usr/lib/nss_ldap.so.0
#4 0x00007f7ff4e076e4 in ?? () from /usr/lib/nss_ldap.so.0
#5 0x00007f7ff568026e in fork () from /usr/lib/libc.so.12
#6 0x000000000041b6a9 in main ()
Could be a double free. Why don't you rebuild your system with MKDEBUG=yes
and then we can better see what's going on?

christos
Edgar Fuß
2013-05-10 18:14:48 UTC
Permalink
Post by Edgar Fuß
#0 0x00007f7ff5a0b2d6 in pthread__errorfunc () from /usr/lib/libpthread.so.1
#1 0x00007f7ff5a089ae in ?? () from /usr/lib/libpthread.so.1
#2 0x00007f7fece2df14 in ldap_ld_free () from /usr/lib/libldap_r.so.4
#3 0x00007f7ff4e05ca2 in ?? () from /usr/lib/nss_ldap.so.0
#4 0x00007f7ff4e076e4 in ?? () from /usr/lib/nss_ldap.so.0
#5 0x00007f7ff568026e in fork () from /usr/lib/libc.so.12
#6 0x000000000041b6a9 in main ()
OOPS?
I'm making some progress on this. That ldap_ld_free() is called from a
pthread_atfork (child) handler installed by nss_ldap.

However, I had a hard time setting breakpoints that where not hit despite
the corresponding function showing up in a later backtrace until I realized
that I had TWO copies of LDAP libraries in memory: One libldap pulled in by
nss_ldap and one libldap_r pulled in by FreeRADIUS' rlm_ldap module.

Question one: Is that alone (i.e. using both libldap and libldap_r) asking
for trouble?

Question two: If not, how do I set a breakpoint on a specific incarnation
of ldap_foobar?
Christos Zoulas
2013-05-10 20:39:32 UTC
Permalink
Post by Edgar Fuß
Post by Edgar Fuß
#0 0x00007f7ff5a0b2d6 in pthread__errorfunc () from /usr/lib/libpthread.so.1
#1 0x00007f7ff5a089ae in ?? () from /usr/lib/libpthread.so.1
#2 0x00007f7fece2df14 in ldap_ld_free () from /usr/lib/libldap_r.so.4
#3 0x00007f7ff4e05ca2 in ?? () from /usr/lib/nss_ldap.so.0
#4 0x00007f7ff4e076e4 in ?? () from /usr/lib/nss_ldap.so.0
#5 0x00007f7ff568026e in fork () from /usr/lib/libc.so.12
#6 0x000000000041b6a9 in main ()
OOPS?
I'm making some progress on this. That ldap_ld_free() is called from a
pthread_atfork (child) handler installed by nss_ldap.
However, I had a hard time setting breakpoints that where not hit despite
the corresponding function showing up in a later backtrace until I realized
that I had TWO copies of LDAP libraries in memory: One libldap pulled in by
nss_ldap and one libldap_r pulled in by FreeRADIUS' rlm_ldap module.
Question one: Is that alone (i.e. using both libldap and libldap_r) asking
for trouble?
Question two: If not, how do I set a breakpoint on a specific incarnation
of ldap_foobar?
Using 2 copies of the library is asking for trouble. I suggest that the
package should be changed to use the system ldap copy if available like
we do with openssl.

christos
Edgar Fuß
2013-05-10 21:01:51 UTC
Permalink
I suggest that the package should be changed to use the system ldap copy if available
But it does use the system ldap, doesn't it? Only nss uses libldap while rlm_ldap uses libldap_r.
Edgar Fuß
2013-05-13 16:55:37 UTC
Permalink
Post by Christos Zoulas
Using 2 copies of the library is asking for trouble.
Yes, the problem is the following:

radiusd calls initgroups() from the C library.
That, by /etc/nsswitch.conf, dynamically loads nss_ldap.
nss_ldap pulls in libldap (not libldap_r).
Later, radiusd dynamically loads rlm_ldap.
rlm_ldap pulls in libldap_r.
Later, nss_ldap, via PLT, calls ldap_ld_free(). Being linked against libldap,
it expects to call the libldap version of ldap_ld_free().
However, since the dynamic linker, in the meantime, has loaded libldap_r,
it resolves that PLT entry to the libldap_r version of ldap_ld_free().

My impression is that this is fundamentally broken and cannot easily be fixed.
We can link nss_ldap against libldap_r, but then some program may later pull in
libldap, and we get the same mess the other way round.
Edgar Fuß
2013-05-14 10:25:28 UTC
Permalink
Do we have something a la nscd so one can confine the dynamic linking
to just one process?
Edgar Fuß
2013-05-16 14:21:07 UTC
Permalink
Is there a method of forcing resolution of a PLT entry other then calling the
function in question?

As stated earlier in this thread, my problem is that pam_ldap loads libldap,
doesn't resolve ldap_ld_free until libldap_r has been loaded, causing havoc.
Nick Hudson
2013-05-16 15:40:49 UTC
Permalink
Post by Edgar Fuß
Is there a method of forcing resolution of a PLT entry other then calling the
function in question?
As stated earlier in this thread, my problem is that pam_ldap loads libldap,
doesn't resolve ldap_ld_free until libldap_r has been loaded, causing havoc.
Look for LD_BIND_NOW in ld.elf_so(1).

Nick
David Laight
2013-05-20 21:22:54 UTC
Permalink
Post by Edgar Fuß
Is there a method of forcing resolution of a PLT entry other then calling the
function in question?
As stated earlier in this thread, my problem is that pam_ldap loads libldap,
doesn't resolve ldap_ld_free until libldap_r has been loaded, causing havoc.
I wouldn't have thought it caused that much grief - unless someone
is using RTLD_GLOBAL...

David
--
David Laight: ***@l8s.co.uk
Edgar Fuß
2013-05-21 14:26:59 UTC
Permalink
The more I think about it, the more fundamental the problem appears to be.

If a library uses LDAP, i.e. libldap{,_r}, it can't know whether it will be
used in a threaded or non-threaded world. Should it link to libldap or to
libldap_r?

Loading...