Discussion:
Odd pigz behavior on build cluster - libpthread bug?
Thor Lancelot Simon
2014-06-14 13:31:18 UTC
Permalink
26269 builds 0 0 25M 1372K CPU/18 130:47 99.02% 99.02% nbpigz
21425 builds 0 0 25M 1344K CPU/13 130:31 99.02% 99.02% nbpigz
This looks very odd to me. Why has pigz been running so long, and even
if it's doing something, why's it on so few CPUs -- shouldn't each process
have a thread per CPU? It doesn't seem to, the load average is only 4.
OK, here's the deal: it's doing something, but, to me it also looks like
it may be exhibiting pathological libpthread behavior of some kind.

26269 1 nbpigz EMUL "netbsd"
26269 1 nbpigz RET ___lwp_park50 -1 errno 3 No such process
26269 1 nbpigz CALL ___lwp_park50(0,0xffffffff,0x7f7ff7e12ac0,0)
26269 1 nbpigz RET ___lwp_park50 -1 errno 3 No such process
26269 1 nbpigz CALL ___lwp_park50(0,0xffffffff,0x7f7ff7e12ac0,0)
26269 1 nbpigz RET ___lwp_park50 -1 errno 3 No such process
26269 1 nbpigz CALL ___lwp_park50(0,0xffffffff,0x7f7ff7e12ac0,0)
26269 1 nbpigz RET ___lwp_park50 -1 errno 3 No such process
26269 1 nbpigz CALL ___lwp_park50(0,0xffffffff,0x7f7ff7e12ac0,0)
26269 1 nbpigz RET ___lwp_park50 -1 errno 3 No such process
26269 1 nbpigz CALL ___lwp_park50(0,0xffffffff,0x7f7ff7e12ac0,0)
26269 1 nbpigz RET ___lwp_park50 -1 errno 3 No such process
26269 1 nbpigz CALL ___lwp_park50(0,0xffffffff,0x7f7ff7e12ac0,0)
26269 1 nbpigz RET ___lwp_park50 -1 errno 3 No such process
[repeat 685,651 times]
26269 2 nbpigz CALL _lwp_unpark(1,0x7f7ff7e12ac0)
26269 2 nbpigz RET _lwp_unpark 0
26269 2 nbpigz CALL ___lwp_park50(0,0,0x7f7ff7b030f8,0x7f7ff7b030f8)
26269 1 nbpigz RET ___lwp_park50 -1 errno 3 No such process
26269 1 nbpigz CALL __sigprocmask14(3,0x7f7fffffc120,0)
26269 1 nbpigz RET __sigprocmask14 0
26269 1 nbpigz CALL getcontext(0x7f7ff6400a88)
26269 1 nbpigz RET getcontext 0
26269 1 nbpigz CALL _lwp_create(0x7f7ff6400a88,0x40,0x7f7ff6400180)
26269 1 nbpigz RET _lwp_create 0
26269 1 nbpigz CALL ___lwp_park50(0,0,0x7f7ff6800030,0x7f7ff6800030)
26269 1 nbpigz RET ___lwp_park50 -1 errno 37 Operation already in progress
26269 1 nbpigz CALL ___lwp_park50(0,0,0x7f7ff6800030,0x7f7ff6800030)
26269 3 nbpigz CALL _lwp_ctl(1,0x7f7ff6400208)
26269 3 nbpigz RET _lwp_ctl 0
26269 3 nbpigz CALL mmap(0,0x1000,3,0x1002,0xffffffff,0,0)
[mmap more stuff]
26269 3 nbpigz CALL _lwp_unpark(2,0x7f7ff7b030f8)
26269 3 nbpigz RET _lwp_unpark 0
26269 2 nbpigz CSW stop kernel
26269 2 nbpigz CSW resume kernel
26269 2 nbpigz RET ___lwp_park50 0
26269 2 nbpigz CALL write(1,0x7f7ff7747000,0x1127)
26269 3 nbpigz CALL ___lwp_park50(0,0,0x7f7ff7b03098,0x7f7ff7b03098)
26269 2 nbpigz GIO fd 1 wrote 4088 bytes
[after the write, there is a short burst of repeated calls to lwp_self
and sigprocmask, then the cycle seems to resume]

WTF? Ideas most welcome.

Thor
Thor Lancelot Simon
2014-06-14 13:38:25 UTC
Permalink
Post by Thor Lancelot Simon
26269 1 nbpigz CALL ___lwp_park50(0,0,0x7f7ff6800030,0x7f7ff6800030)
26269 3 nbpigz CALL _lwp_ctl(1,0x7f7ff6400208)
26269 3 nbpigz RET _lwp_ctl 0
26269 3 nbpigz CALL mmap(0,0x1000,3,0x1002,0xffffffff,0,0)
[mmap more stuff]
26269 3 nbpigz CALL _lwp_unpark(2,0x7f7ff7b030f8)
26269 3 nbpigz RET _lwp_unpark 0
26269 2 nbpigz CSW stop kernel
26269 2 nbpigz CSW resume kernel
26269 2 nbpigz RET ___lwp_park50 0
26269 2 nbpigz CALL write(1,0x7f7ff7747000,0x1127)
26269 3 nbpigz CALL ___lwp_park50(0,0,0x7f7ff7b03098,0x7f7ff7b03098)
26269 2 nbpigz GIO fd 1 wrote 4088 bytes
[after the write, there is a short burst of repeated calls to lwp_self
and sigprocmask, then the cycle seems to resume]
Correction - the cycle did *not* resume. The process did one more -- very
short -- write to stdout, then exited.

Perhaps one thread is spinning in some way waiting for the others to finish
work, and attacking it with ktrace -t A resolved a race of some kind.

The other pigz process in question, which I didn't ktrace, was still
"happily" spinning along. Perhaps not surprisingly, attaching GDB to
try to get a backtrace allowed it to exit as well. "Yay".

The backtrace isn't terribly useful:

(gdb) attach 21425
Attaching to process 21425
Reading symbols from
/home/builds/ab/HEAD/amd64/201406141020Z-tools/bin/nbpigz...(no debugging
symbols found)...done.
Reading symbols from /usr/lib/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /usr/lib/libpthread.so.1...(no debugging symbols
found)...done.
[New LWP 2]
Loaded symbols for /usr/lib/libpthread.so.1
Reading symbols from /usr/lib/libc.so.12...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib/libc.so.12
Reading symbols from /usr/lib/libgcc_s.so.1...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib/libgcc_s.so.1
Reading symbols from /usr/libexec/ld.elf_so...(no debugging symbols
found)...done.
Loaded symbols for /usr/libexec/ld.elf_so
[Switching to LWP 1]
0x00007f7ff7c0d818 in strcmp () from /usr/libexec/ld.elf_so
(gdb) where
#0 0x00007f7ff7c0d818 in strcmp () from /usr/libexec/ld.elf_so
#1 0x00007f7ff7c073cf in _rtld_symlook_obj () from /usr/libexec/ld.elf_so
#2 0x00007f7ff7c079a9 in _rtld_symlook_default.clone.3 ()
from /usr/libexec/ld.elf_so
#3 0x00007f7ff7c07f23 in _rtld_find_plt_symdef () from /usr/libexec/ld.elf_so
#4 0x00007f7ff7c03fbd in _rtld_bind () from /usr/libexec/ld.elf_so
#5 0x00007f7ff7c03c1d in _rtld_bind_start () from /usr/libexec/ld.elf_so
#6 0x0000000000000246 in ?? ()
#7 0x00007f7ff707624a in getcontext () from /usr/lib/libc.so.12
#8 0x00000000003fe000 in ?? ()
#9 0x00007f7ff6802000 in ?? ()
#10 0x0000000000408e92 in twist ()
#11 0x0000000000000000 in ?? ()
(gdb) detach

Thor

Loading...