Discussion:
lxc_monitor exiting, but not cleaning monitor-fifo?
(too old to reply)
Florian Klink
2014-03-29 22:39:33 UTC
Permalink
Hi,

when running multiple lxc actions in row using the command line tools, I
sometimes observe the following state:


- lxc-monitord is not running anymore
- /run/lxc/var/lib/lxc/monitor-fifo still exists, but is "refusing
connection"

In the logs, I then see the following:


lxc-start 1395671045.703 ERROR lxc_monitor - connect : backing off 10
lxc-start 1395671045.713 ERROR lxc_monitor - connect : backing off 50
lxc-start 1395671045.763 ERROR lxc_monitor - connect : backing off 100
lxc-start 1395671045.864 ERROR lxc_monitor - connect : Connection refused


... and the command fails.



A possible workaround would be checking for non-running lxc-monitord
process but existing monitor-fifo file then removing the fifo if it
exists before running the next lxc command, but thats ugly ;-)


Is this behaviour known? Is there some missing "cleanup code" in
lxc(_monitord) or why is it failing like this?


Florian
Dwight Engen
2014-03-31 18:10:32 UTC
Permalink
On Sat, 29 Mar 2014 23:39:33 +0100
Post by Florian Klink
Hi,
when running multiple lxc actions in row using the command line
- lxc-monitord is not running anymore
- /run/lxc/var/lib/lxc/monitor-fifo still exists, but is "refusing
connection"
lxc-start 1395671045.703 ERROR lxc_monitor - connect : backing off
10 lxc-start 1395671045.713 ERROR lxc_monitor - connect : backing
backing off 100 lxc-start 1395671045.864 ERROR lxc_monitor -
connect : Connection refused
... and the command fails.
The only time I've seen this happen is if lxc-monitord is hard killed
so it doesn't have a chance to clean up and remove the socket.
Post by Florian Klink
A possible workaround would be checking for non-running lxc-monitord
process but existing monitor-fifo file then removing the fifo if it
exists before running the next lxc command, but thats ugly ;-)
Is there a good non-racy way to do this? I guess monitord could write
its pid in $LXCPATH and we could kill(pid, 0) it.
Post by Florian Klink
Is this behaviour known? Is there some missing "cleanup code" in
lxc(_monitord) or why is it failing like this?
Currently it catches SIGILL, SIGSEGV, SIGBUS, and SIGTERM and cleans
up. Other than hard kill I'm not sure what else might cause it to exit
without cleaning up.
Post by Florian Klink
Florian
_______________________________________________
lxc-users mailing list
lxc-users at lists.linuxcontainers.org
http://lists.linuxcontainers.org/listinfo/lxc-users
Florian Klink
2014-03-31 18:34:15 UTC
Permalink
Post by Dwight Engen
On Sat, 29 Mar 2014 23:39:33 +0100
Post by Florian Klink
Hi,
when running multiple lxc actions in row using the command line
- lxc-monitord is not running anymore
- /run/lxc/var/lib/lxc/monitor-fifo still exists, but is "refusing
connection"
lxc-start 1395671045.703 ERROR lxc_monitor - connect : backing off
10 lxc-start 1395671045.713 ERROR lxc_monitor - connect : backing
backing off 100 lxc-start 1395671045.864 ERROR lxc_monitor -
connect : Connection refused
... and the command fails.
The only time I've seen this happen is if lxc-monitord is hard killed
so it doesn't have a chance to clean up and remove the socket.
Here, it's happening quite frequently. However, the script never kills
lxc-monitord on its own, it just tries to detect and fix this state by
removing the socket file...
Post by Dwight Engen
Post by Florian Klink
A possible workaround would be checking for non-running lxc-monitord
process but existing monitor-fifo file then removing the fifo if it
exists before running the next lxc command, but thats ugly ;-)
Is there a good non-racy way to do this? I guess monitord could write
its pid in $LXCPATH and we could kill(pid, 0) it.
Post by Florian Klink
Is this behaviour known? Is there some missing "cleanup code" in
lxc(_monitord) or why is it failing like this?
Currently it catches SIGILL, SIGSEGV, SIGBUS, and SIGTERM and cleans
up. Other than hard kill I'm not sure what else might cause it to exit
without cleaning up.
I shutdown containers with `lxc-stop -n container-name`
(lxc.stopsignal=30 (SIGPWR)), however this signal should never go to
lxc_monitord, right?
Post by Dwight Engen
Post by Florian Klink
Florian
_______________________________________________
lxc-users mailing list
lxc-users at lists.linuxcontainers.org
http://lists.linuxcontainers.org/listinfo/lxc-users
Dwight Engen
2014-03-31 19:13:44 UTC
Permalink
On Mon, 31 Mar 2014 20:34:15 +0200
Post by Florian Klink
Post by Dwight Engen
On Sat, 29 Mar 2014 23:39:33 +0100
Post by Florian Klink
Hi,
when running multiple lxc actions in row using the command line
- lxc-monitord is not running anymore
- /run/lxc/var/lib/lxc/monitor-fifo still exists, but is "refusing
connection"
lxc-start 1395671045.703 ERROR lxc_monitor - connect : backing
backing off 50 lxc-start 1395671045.763 ERROR lxc_monitor -
connect : backing off 100 lxc-start 1395671045.864 ERROR
lxc_monitor - connect : Connection refused
... and the command fails.
The only time I've seen this happen is if lxc-monitord is hard
killed so it doesn't have a chance to clean up and remove the
socket.
Here, it's happening quite frequently. However, the script never kills
lxc-monitord on its own, it just tries to detect and fix this state by
removing the socket file...
Right, removing the socket file makes it so another lxc-monitord will
start, but the question is why is the first one exiting without
cleaning up? Can you reliably reproduce it at will? If so then maybe
you could attach an strace to lxc-monitord and see why it is exiting.
Post by Florian Klink
Post by Dwight Engen
Post by Florian Klink
A possible workaround would be checking for non-running
lxc-monitord process but existing monitor-fifo file then removing
the fifo if it exists before running the next lxc command, but
thats ugly ;-)
Is there a good non-racy way to do this? I guess monitord could
write its pid in $LXCPATH and we could kill(pid, 0) it.
Post by Florian Klink
Is this behaviour known? Is there some missing "cleanup code" in
lxc(_monitord) or why is it failing like this?
Currently it catches SIGILL, SIGSEGV, SIGBUS, and SIGTERM and cleans
up. Other than hard kill I'm not sure what else might cause it to
exit without cleaning up.
I shutdown containers with `lxc-stop -n container-name`
(lxc.stopsignal=30 (SIGPWR)), however this signal should never go to
lxc_monitord, right?
Right, that goes to the init process of the container.
Post by Florian Klink
Post by Dwight Engen
Post by Florian Klink
Florian
_______________________________________________
lxc-users mailing list
lxc-users at lists.linuxcontainers.org
http://lists.linuxcontainers.org/listinfo/lxc-users
_______________________________________________
lxc-users mailing list
lxc-users at lists.linuxcontainers.org
http://lists.linuxcontainers.org/listinfo/lxc-users
Florian Klink
2014-03-31 21:18:13 UTC
Permalink
Post by Dwight Engen
On Mon, 31 Mar 2014 20:34:15 +0200
Post by Florian Klink
Post by Dwight Engen
On Sat, 29 Mar 2014 23:39:33 +0100
Post by Florian Klink
Hi,
when running multiple lxc actions in row using the command line
- lxc-monitord is not running anymore
- /run/lxc/var/lib/lxc/monitor-fifo still exists, but is "refusing
connection"
lxc-start 1395671045.703 ERROR lxc_monitor - connect : backing
backing off 50 lxc-start 1395671045.763 ERROR lxc_monitor -
connect : backing off 100 lxc-start 1395671045.864 ERROR
lxc_monitor - connect : Connection refused
... and the command fails.
The only time I've seen this happen is if lxc-monitord is hard
killed so it doesn't have a chance to clean up and remove the
socket.
Here, it's happening quite frequently. However, the script never kills
lxc-monitord on its own, it just tries to detect and fix this state by
removing the socket file...
Right, removing the socket file makes it so another lxc-monitord will
start, but the question is why is the first one exiting without
cleaning up? Can you reliably reproduce it at will? If so then maybe
you could attach an strace to lxc-monitord and see why it is exiting.
I was so far not successful in reproducing the bug while having an
strace running. :-( But I'll continue to try!
Post by Dwight Engen
Post by Florian Klink
Post by Dwight Engen
Post by Florian Klink
A possible workaround would be checking for non-running
lxc-monitord process but existing monitor-fifo file then removing
the fifo if it exists before running the next lxc command, but
thats ugly ;-)
Is there a good non-racy way to do this? I guess monitord could
write its pid in $LXCPATH and we could kill(pid, 0) it.
I also think that lxc should be able to recover from this problem
automatically.
Post by Dwight Engen
Post by Florian Klink
Post by Dwight Engen
Post by Florian Klink
Is this behaviour known? Is there some missing "cleanup code" in
lxc(_monitord) or why is it failing like this?
Currently it catches SIGILL, SIGSEGV, SIGBUS, and SIGTERM and cleans
up. Other than hard kill I'm not sure what else might cause it to
exit without cleaning up.
I shutdown containers with `lxc-stop -n container-name`
(lxc.stopsignal=30 (SIGPWR)), however this signal should never go to
lxc_monitord, right?
Right, that goes to the init process of the container.
Post by Florian Klink
Post by Dwight Engen
Post by Florian Klink
Florian
_______________________________________________
lxc-users mailing list
lxc-users at lists.linuxcontainers.org
http://lists.linuxcontainers.org/listinfo/lxc-users
_______________________________________________
lxc-users mailing list
lxc-users at lists.linuxcontainers.org
http://lists.linuxcontainers.org/listinfo/lxc-users
Dwight Engen
2014-03-31 23:49:00 UTC
Permalink
On Mon, 31 Mar 2014 23:18:13 +0200
Post by Florian Klink
Post by Dwight Engen
On Mon, 31 Mar 2014 20:34:15 +0200
Post by Florian Klink
Post by Dwight Engen
On Sat, 29 Mar 2014 23:39:33 +0100
Post by Florian Klink
Hi,
when running multiple lxc actions in row using the command line
- lxc-monitord is not running anymore
- /run/lxc/var/lib/lxc/monitor-fifo still exists, but is
"refusing connection"
lxc-start 1395671045.703 ERROR lxc_monitor - connect : backing
backing off 50 lxc-start 1395671045.763 ERROR lxc_monitor -
connect : backing off 100 lxc-start 1395671045.864 ERROR
lxc_monitor - connect : Connection refused
... and the command fails.
The only time I've seen this happen is if lxc-monitord is hard
killed so it doesn't have a chance to clean up and remove the
socket.
Here, it's happening quite frequently. However, the script never
kills lxc-monitord on its own, it just tries to detect and fix
this state by removing the socket file...
Right, removing the socket file makes it so another lxc-monitord
will start, but the question is why is the first one exiting without
cleaning up? Can you reliably reproduce it at will? If so then maybe
you could attach an strace to lxc-monitord and see why it is
exiting.
I was so far not successful in reproducing the bug while having an
strace running. :-( But I'll continue to try!
Post by Dwight Engen
Post by Florian Klink
Post by Dwight Engen
Post by Florian Klink
A possible workaround would be checking for non-running
lxc-monitord process but existing monitor-fifo file then removing
the fifo if it exists before running the next lxc command, but
thats ugly ;-)
Is there a good non-racy way to do this? I guess monitord could
write its pid in $LXCPATH and we could kill(pid, 0) it.
I also think that lxc should be able to recover from this problem
automatically.
I agree, though I would like to understand the root cause. Can you try
out the attached patch? I think it will cure your issues.
Post by Florian Klink
Post by Dwight Engen
Post by Florian Klink
Post by Dwight Engen
Post by Florian Klink
Is this behaviour known? Is there some missing "cleanup code" in
lxc(_monitord) or why is it failing like this?
Currently it catches SIGILL, SIGSEGV, SIGBUS, and SIGTERM and
cleans up. Other than hard kill I'm not sure what else might
cause it to exit without cleaning up.
I shutdown containers with `lxc-stop -n container-name`
(lxc.stopsignal=30 (SIGPWR)), however this signal should never go
to lxc_monitord, right?
Right, that goes to the init process of the container.
Post by Florian Klink
Post by Dwight Engen
Post by Florian Klink
Florian
_______________________________________________
lxc-users mailing list
lxc-users at lists.linuxcontainers.org
http://lists.linuxcontainers.org/listinfo/lxc-users
_______________________________________________
lxc-users mailing list
lxc-users at lists.linuxcontainers.org
http://lists.linuxcontainers.org/listinfo/lxc-users
_______________________________________________
lxc-users mailing list
lxc-users at lists.linuxcontainers.org
http://lists.linuxcontainers.org/listinfo/lxc-users
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-make-monitor-monitord-more-resilient-to-unexpected-t.patch
Type: text/x-patch
Size: 8722 bytes
Desc: not available
URL: <http://lists.linuxcontainers.org/pipermail/lxc-users/attachments/20140331/91621633/attachment.bin>
Serge Hallyn
2014-04-01 18:01:36 UTC
Permalink
As an alternative to doing pidfiles, how about following the way
that lxcapi_create does it with fcntl(fd, F_SETLKW? (see
create_partial() and ongoing_create()?

Then if the monitor exited without being able to clean up, we can
detect it and clean up.
Dwight Engen
2014-04-02 14:59:21 UTC
Permalink
On Tue, 1 Apr 2014 13:01:36 -0500
Post by Serge Hallyn
As an alternative to doing pidfiles, how about following the way
that lxcapi_create does it with fcntl(fd, F_SETLKW? (see
create_partial() and ongoing_create()?
Then if the monitor exited without being able to clean up, we can
detect it and clean up.
Hey Serge, thats a good idea. I think F_SETLKW could be used instead if
we're worried about the kill(pid, 0) and pid reuse. Is that the case
you're worried about?
Serge Hallyn
2014-04-02 15:23:52 UTC
Permalink
Post by Dwight Engen
On Tue, 1 Apr 2014 13:01:36 -0500
Post by Serge Hallyn
As an alternative to doing pidfiles, how about following the way
that lxcapi_create does it with fcntl(fd, F_SETLKW? (see
create_partial() and ongoing_create()?
Then if the monitor exited without being able to clean up, we can
detect it and clean up.
Hey Serge, thats a good idea. I think F_SETLKW could be used instead if
we're worried about the kill(pid, 0) and pid reuse. Is that the case
you're worried about?
And I just don't want all that pidfile code in there if we don't
need it. Is there any advantage to using a pidfile?
Dwight Engen
2014-04-02 17:08:40 UTC
Permalink
On Wed, 2 Apr 2014 10:23:52 -0500
Post by Serge Hallyn
Post by Dwight Engen
On Tue, 1 Apr 2014 13:01:36 -0500
Post by Serge Hallyn
As an alternative to doing pidfiles, how about following the way
that lxcapi_create does it with fcntl(fd, F_SETLKW? (see
create_partial() and ongoing_create()?
Then if the monitor exited without being able to clean up, we can
detect it and clean up.
Hey Serge, thats a good idea. I think F_SETLKW could be used
instead if we're worried about the kill(pid, 0) and pid reuse. Is
that the case you're worried about?
And I just don't want all that pidfile code in there if we don't
need it. Is there any advantage to using a pidfile?
Nope, I agree I don't think there is an advantage to a pidfile. I'll
post a patch over on devel with the SETLK approach.
Florian Klink
2014-04-01 20:15:25 UTC
Permalink
Post by Dwight Engen
On Mon, 31 Mar 2014 23:18:13 +0200
Post by Florian Klink
Post by Dwight Engen
On Mon, 31 Mar 2014 20:34:15 +0200
Post by Florian Klink
Post by Dwight Engen
On Sat, 29 Mar 2014 23:39:33 +0100
Post by Florian Klink
Hi,
when running multiple lxc actions in row using the command line
- lxc-monitord is not running anymore
- /run/lxc/var/lib/lxc/monitor-fifo still exists, but is
"refusing connection"
lxc-start 1395671045.703 ERROR lxc_monitor - connect : backing
backing off 50 lxc-start 1395671045.763 ERROR lxc_monitor -
connect : backing off 100 lxc-start 1395671045.864 ERROR
lxc_monitor - connect : Connection refused
... and the command fails.
The only time I've seen this happen is if lxc-monitord is hard
killed so it doesn't have a chance to clean up and remove the
socket.
Here, it's happening quite frequently. However, the script never
kills lxc-monitord on its own, it just tries to detect and fix
this state by removing the socket file...
Right, removing the socket file makes it so another lxc-monitord
will start, but the question is why is the first one exiting without
cleaning up? Can you reliably reproduce it at will? If so then maybe
you could attach an strace to lxc-monitord and see why it is
exiting.
I was so far not successful in reproducing the bug while having an
strace running. :-( But I'll continue to try!
Success :-) I managed to get an strace while trying to reproduce the
bug. I gzipped and attached it to this mail.

Its the output of strace -f -s 200 /usr/lib/lxc/lxc-monitord
/var/lib/lxc /run/lxc/var/lib/lxc/monitor-fifo &> strace_output.txt

I fired a bunch of lxc-starts and lxc-stops in row, then stopped my
script and waited for lxc-monitord (and strace too) to stop.

Then I started my script again and had the "leftover monitor-fifo state".
Post by Dwight Engen
Post by Florian Klink
Post by Dwight Engen
Post by Florian Klink
Post by Dwight Engen
Post by Florian Klink
A possible workaround would be checking for non-running
lxc-monitord process but existing monitor-fifo file then removing
the fifo if it exists before running the next lxc command, but
thats ugly ;-)
Is there a good non-racy way to do this? I guess monitord could
write its pid in $LXCPATH and we could kill(pid, 0) it.
I also think that lxc should be able to recover from this problem
automatically.
I agree, though I would like to understand the root cause. Can you try
out the attached patch? I think it will cure your issues.
Thanks for the patch! Just tell me if you need more information for the
strace above. If not, I'll happily apply the patch :-)
Post by Dwight Engen
Post by Florian Klink
Post by Dwight Engen
Post by Florian Klink
Post by Dwight Engen
Post by Florian Klink
Is this behaviour known? Is there some missing "cleanup code" in
lxc(_monitord) or why is it failing like this?
Currently it catches SIGILL, SIGSEGV, SIGBUS, and SIGTERM and
cleans up. Other than hard kill I'm not sure what else might
cause it to exit without cleaning up.
I shutdown containers with `lxc-stop -n container-name`
(lxc.stopsignal=30 (SIGPWR)), however this signal should never go
to lxc_monitord, right?
Right, that goes to the init process of the container.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: strace_output.txt.gz
Type: application/gzip
Size: 2863 bytes
Desc: not available
URL: <http://lists.linuxcontainers.org/pipermail/lxc-users/attachments/20140401/2cedaee3/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: OpenPGP digital signature
URL: <http://lists.linuxcontainers.org/pipermail/lxc-users/attachments/20140401/2cedaee3/attachment.pgp>
Dwight Engen
2014-04-02 14:42:41 UTC
Permalink
On Tue, 01 Apr 2014 22:15:25 +0200
Post by Florian Klink
Post by Dwight Engen
On Mon, 31 Mar 2014 23:18:13 +0200
Post by Florian Klink
Post by Dwight Engen
On Mon, 31 Mar 2014 20:34:15 +0200
Post by Florian Klink
Post by Dwight Engen
On Sat, 29 Mar 2014 23:39:33 +0100
Post by Florian Klink
Hi,
when running multiple lxc actions in row using the command line
- lxc-monitord is not running anymore
- /run/lxc/var/lib/lxc/monitor-fifo still exists, but is
"refusing connection"
backing off 10 lxc-start 1395671045.713 ERROR lxc_monitor -
connect : backing off 50 lxc-start 1395671045.763 ERROR
lxc_monitor - connect : backing off 100 lxc-start
1395671045.864 ERROR lxc_monitor - connect : Connection refused
... and the command fails.
The only time I've seen this happen is if lxc-monitord is hard
killed so it doesn't have a chance to clean up and remove the
socket.
Here, it's happening quite frequently. However, the script never
kills lxc-monitord on its own, it just tries to detect and fix
this state by removing the socket file...
Right, removing the socket file makes it so another lxc-monitord
will start, but the question is why is the first one exiting
without cleaning up? Can you reliably reproduce it at will? If so
then maybe you could attach an strace to lxc-monitord and see why
it is exiting.
I was so far not successful in reproducing the bug while having an
strace running. :-( But I'll continue to try!
Success :-) I managed to get an strace while trying to reproduce the
bug. I gzipped and attached it to this mail.
Its the output of strace -f -s 200 /usr/lib/lxc/lxc-monitord
/var/lib/lxc /run/lxc/var/lib/lxc/monitor-fifo &> strace_output.txt
I fired a bunch of lxc-starts and lxc-stops in row, then stopped my
script and waited for lxc-monitord (and strace too) to stop.
Then I started my script again and had the "leftover monitor-fifo state".
Unfortunately, I don't think that strace shows the problem. It looks to
me like a normal exit with a successful
unlink("/run/lxc//var/lib/lxc/monitor-fifo") = 0 right near the end.

You can't really run monitord by hand like that since it is expecting a
pipe fd as argv[2]. Thats why I was suggesting attaching to it. So
something like:

lxc-start <your ct>
lxc-monitor -n '.*'

in another terminal:
ps aux |grep monitord -> find the pid of lxc-monitord
strace -v -t -o straceout.txt -p <pid of monitord>

and then do whatever you do to make things fail :)
Post by Florian Klink
Post by Dwight Engen
Post by Florian Klink
Post by Dwight Engen
Post by Florian Klink
Post by Dwight Engen
Post by Florian Klink
A possible workaround would be checking for non-running
lxc-monitord process but existing monitor-fifo file then
removing the fifo if it exists before running the next lxc
command, but thats ugly ;-)
Is there a good non-racy way to do this? I guess monitord could
write its pid in $LXCPATH and we could kill(pid, 0) it.
I also think that lxc should be able to recover from this problem
automatically.
I agree, though I would like to understand the root cause. Can you
try out the attached patch? I think it will cure your issues.
Thanks for the patch! Just tell me if you need more information for
the strace above. If not, I'll happily apply the patch :-)
You can try the patch to see if it solves your issue, though I'd still
like to understand why its happening in the first place. I may rework
the patch based on Serge's suggestion, but it'd be nice to know if the
one I sent does fix what you are seeing. It worked for all the
hard-kill cases I tried.
Post by Florian Klink
Post by Dwight Engen
Post by Florian Klink
Post by Dwight Engen
Post by Florian Klink
Post by Dwight Engen
Post by Florian Klink
Is this behaviour known? Is there some missing "cleanup code"
in lxc(_monitord) or why is it failing like this?
Currently it catches SIGILL, SIGSEGV, SIGBUS, and SIGTERM and
cleans up. Other than hard kill I'm not sure what else might
cause it to exit without cleaning up.
I shutdown containers with `lxc-stop -n container-name`
(lxc.stopsignal=30 (SIGPWR)), however this signal should never go
to lxc_monitord, right?
Right, that goes to the init process of the container.
Florian Klink
2014-04-04 20:22:05 UTC
Permalink
Post by Dwight Engen
On Tue, 01 Apr 2014 22:15:25 +0200
Post by Florian Klink
Post by Dwight Engen
On Mon, 31 Mar 2014 23:18:13 +0200
Post by Florian Klink
Post by Dwight Engen
On Mon, 31 Mar 2014 20:34:15 +0200
Post by Florian Klink
Post by Dwight Engen
On Sat, 29 Mar 2014 23:39:33 +0100
Post by Florian Klink
Hi,
when running multiple lxc actions in row using the command line
- lxc-monitord is not running anymore
- /run/lxc/var/lib/lxc/monitor-fifo still exists, but is
"refusing connection"
backing off 10 lxc-start 1395671045.713 ERROR lxc_monitor -
connect : backing off 50 lxc-start 1395671045.763 ERROR
lxc_monitor - connect : backing off 100 lxc-start
1395671045.864 ERROR lxc_monitor - connect : Connection refused
... and the command fails.
The only time I've seen this happen is if lxc-monitord is hard
killed so it doesn't have a chance to clean up and remove the
socket.
Here, it's happening quite frequently. However, the script never
kills lxc-monitord on its own, it just tries to detect and fix
this state by removing the socket file...
Right, removing the socket file makes it so another lxc-monitord
will start, but the question is why is the first one exiting
without cleaning up? Can you reliably reproduce it at will? If so
then maybe you could attach an strace to lxc-monitord and see why
it is exiting.
I was so far not successful in reproducing the bug while having an
strace running. :-( But I'll continue to try!
Success :-) I managed to get an strace while trying to reproduce the
bug. I gzipped and attached it to this mail.
Its the output of strace -f -s 200 /usr/lib/lxc/lxc-monitord
/var/lib/lxc /run/lxc/var/lib/lxc/monitor-fifo &> strace_output.txt
I fired a bunch of lxc-starts and lxc-stops in row, then stopped my
script and waited for lxc-monitord (and strace too) to stop.
Then I started my script again and had the "leftover monitor-fifo state".
Unfortunately, I don't think that strace shows the problem. It looks to
me like a normal exit with a successful
unlink("/run/lxc//var/lib/lxc/monitor-fifo") = 0 right near the end.
You can't really run monitord by hand like that since it is expecting a
pipe fd as argv[2]. Thats why I was suggesting attaching to it. So
lxc-start <your ct>
lxc-monitor -n '.*'
ps aux |grep monitord -> find the pid of lxc-monitord
strace -v -t -o straceout.txt -p <pid of monitord>
and then do whatever you do to make things fail :)
I was not able to get an strace of the bug. I think was is only
triggered by a lot of lxc-monitord start/stop traffic ;-)
Post by Dwight Engen
Post by Florian Klink
Post by Dwight Engen
Post by Florian Klink
Post by Dwight Engen
Post by Florian Klink
Post by Dwight Engen
Post by Florian Klink
A possible workaround would be checking for non-running
lxc-monitord process but existing monitor-fifo file then
removing the fifo if it exists before running the next lxc
command, but thats ugly ;-)
Is there a good non-racy way to do this? I guess monitord could
write its pid in $LXCPATH and we could kill(pid, 0) it.
I also think that lxc should be able to recover from this problem
automatically.
I agree, though I would like to understand the root cause. Can you
try out the attached patch? I think it will cure your issues.
Thanks for the patch! Just tell me if you need more information for
the strace above. If not, I'll happily apply the patch :-)
You can try the patch to see if it solves your issue, though I'd still
like to understand why its happening in the first place. I may rework
the patch based on Serge's suggestion, but it'd be nice to know if the
one I sent does fix what you are seeing. It worked for all the
hard-kill cases I tried.
Both patches, the pidfile version and the reworked version fixed my
problem. So I'm very happy with it :-)


Will this patch also go to the stable-1.0 branch?
I'd really like to see this fixed in the 1.0.3 release ;-)

Florian

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: OpenPGP digital signature
URL: <http://lists.linuxcontainers.org/pipermail/lxc-users/attachments/20140404/717bbae3/attachment.pgp>
Dwight Engen
2014-04-08 14:30:02 UTC
Permalink
On Fri, 04 Apr 2014 22:22:05 +0200
Post by Florian Klink
Post by Dwight Engen
On Tue, 01 Apr 2014 22:15:25 +0200
Post by Florian Klink
Post by Dwight Engen
On Mon, 31 Mar 2014 23:18:13 +0200
Post by Florian Klink
Post by Dwight Engen
On Mon, 31 Mar 2014 20:34:15 +0200
Post by Florian Klink
Post by Dwight Engen
On Sat, 29 Mar 2014 23:39:33 +0100
Post by Florian Klink
Hi,
when running multiple lxc actions in row using the command
- lxc-monitord is not running anymore
- /run/lxc/var/lib/lxc/monitor-fifo still exists, but is
"refusing connection"
backing off 10 lxc-start 1395671045.713 ERROR lxc_monitor
- connect : backing off 50 lxc-start 1395671045.763 ERROR
lxc_monitor - connect : backing off 100 lxc-start
1395671045.864 ERROR lxc_monitor - connect : Connection refused
... and the command fails.
The only time I've seen this happen is if lxc-monitord is hard
killed so it doesn't have a chance to clean up and remove the
socket.
Here, it's happening quite frequently. However, the script
never kills lxc-monitord on its own, it just tries to detect
and fix this state by removing the socket file...
Right, removing the socket file makes it so another lxc-monitord
will start, but the question is why is the first one exiting
without cleaning up? Can you reliably reproduce it at will? If
so then maybe you could attach an strace to lxc-monitord and
see why it is exiting.
I was so far not successful in reproducing the bug while having
an strace running. :-( But I'll continue to try!
Success :-) I managed to get an strace while trying to reproduce
the bug. I gzipped and attached it to this mail.
Its the output of strace -f -s 200 /usr/lib/lxc/lxc-monitord
/var/lib/lxc /run/lxc/var/lib/lxc/monitor-fifo &> strace_output.txt
I fired a bunch of lxc-starts and lxc-stops in row, then stopped my
script and waited for lxc-monitord (and strace too) to stop.
Then I started my script again and had the "leftover monitor-fifo state".
Unfortunately, I don't think that strace shows the problem. It
looks to me like a normal exit with a successful
unlink("/run/lxc//var/lib/lxc/monitor-fifo") = 0 right near the end.
You can't really run monitord by hand like that since it is
expecting a pipe fd as argv[2]. Thats why I was suggesting
lxc-start <your ct>
lxc-monitor -n '.*'
ps aux |grep monitord -> find the pid of lxc-monitord
strace -v -t -o straceout.txt -p <pid of monitord>
and then do whatever you do to make things fail :)
I was not able to get an strace of the bug. I think was is only
triggered by a lot of lxc-monitord start/stop traffic ;-)
Post by Dwight Engen
Post by Florian Klink
Post by Dwight Engen
Post by Florian Klink
Post by Dwight Engen
Post by Florian Klink
Post by Dwight Engen
Post by Florian Klink
A possible workaround would be checking for non-running
lxc-monitord process but existing monitor-fifo file then
removing the fifo if it exists before running the next lxc
command, but thats ugly ;-)
Is there a good non-racy way to do this? I guess monitord
could write its pid in $LXCPATH and we could kill(pid, 0) it.
I also think that lxc should be able to recover from this problem
automatically.
I agree, though I would like to understand the root cause. Can you
try out the attached patch? I think it will cure your issues.
Thanks for the patch! Just tell me if you need more information for
the strace above. If not, I'll happily apply the patch :-)
You can try the patch to see if it solves your issue, though I'd
still like to understand why its happening in the first place. I
may rework the patch based on Serge's suggestion, but it'd be nice
to know if the one I sent does fix what you are seeing. It worked
for all the hard-kill cases I tried.
Both patches, the pidfile version and the reworked version fixed my
problem. So I'm very happy with it :-)
Will this patch also go to the stable-1.0 branch?
I'd really like to see this fixed in the 1.0.3 release ;-)
Looks like St?phane did pull it onto stable so you should be good.
Thanks for trying to debug/strace it. I still don't know why this is
happening in the first place but at least this should work around the
problem when it does happen.
Post by Florian Klink
Florian
Loading...