How are Jenkins builds killed exactly?

classic Classic list List threaded Threaded
7 messages Options
sberg sberg
Reply | Threaded
Open this post in threaded view
|

How are Jenkins builds killed exactly?

Still trying to track down why sometimes zombie processes survive on the
(Linux) Jenkins build machines (and then make later, unrelated Jenkins
builds on those machines fail when zombie soffice.bin processes still
hold onto named pipes that tests from the new builds want to create too).

One such recent case on tb79 was the aborted
<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/49895/>.  It
left behind a zombie python.bin -> oosplash -> soffice.bin process tree
executing UITest_calc_tests3.  (Where presumably the soffice.bin process
had deadlocked, which then caused the Jenkins

> Build timed out (after 15 minutes). Marking the build as aborted.
> Build was aborted
> Finished: ABORTED

reaction.  But once I noticed, the images of the involved processes had
already been overwritten by later builds, so I couldn't use gdb to get
backtraces.)

<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/49895/consoleFull>
shows that some entity runs lode's tb_slave_wrapper as (the main) part
of the build, see

> [linux_clang_dbgutil_64] $ /bin/sh -xe /tmp/jenkins3389683698813990355.sh
> + /home/tdf/lode/bin/tb_slave_wrapper --real --mode=config --clean

That tb_slave_wrapper script contains

> trap cleanup 1 2 3 6 15
>
> cleanup()
> {
>   echo "Caught Signal ... killing everything...."
>   # kill everything in same process group (pseudo-pid 0)
>   kill -9 0
> }

intended to kill all processes if the script itself receives any of
SIGHUP/-INT/-QUIT/-ABRT/-TERM.

But how does the tb_slave_wrapper script get terminated by whatever
entity that starts it and prints out the

> Build timed out (after 15 minutes). Marking the build as aborted.
> Build was aborted
> Finished: ABORTED

mentioned above?  Could it be that the script itself gets killed with
SIGKILL, so its cleanup() trap doesn't fire, and processes (indirectly)
spawned from the script may stay alive?

Interestingly, the output from the above

>   echo "Caught Signal ... killing everything...."

doesn't show up anywhere in
<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/49895/consoleFull>
(supporting the theory that cleanup() doesn't run), while other output
that apparently stems from similar echo/printf commands in that script
does show up there, see

> OS:
> pwd:/home/tdf/lode/jenkins/workspace/lo_gerrit/Config/linux_clang_dbgutil_64
> config mode : linux_clang_dbgutil_64
> Taking configuration values from ./distro-configs/Jenkins/linux_clang_dbgutil_64

_______________________________________________
LibreOffice mailing list
[hidden email]
https://lists.freedesktop.org/mailman/listinfo/libreoffice
Noel Grandin-2 Noel Grandin-2
Reply | Threaded
Open this post in threaded view
|

Re: How are Jenkins builds killed exactly?

Note that over the last week or so, I have been frequently manually killing these zombie processes.

If we can't fix this, I suggest we add:

kill $(ps -o pid= --ppid $$)

to the end of the Jenkins build script
 

_______________________________________________
LibreOffice mailing list
[hidden email]
https://lists.freedesktop.org/mailman/listinfo/libreoffice
sberg sberg
Reply | Threaded
Open this post in threaded view
|

Re: How are Jenkins builds killed exactly?

In reply to this post by sberg
On 29/12/2019 15:17, Stephan Bergmann wrote:
>> [linux_clang_dbgutil_64] $ /bin/sh -xe /tmp/jenkins3389683698813990355.sh
>> + /home/tdf/lode/bin/tb_slave_wrapper --real --mode=config --clean

...where those /tmp/jenkins*.sh contain

>
> ${LODE_HOME}/bin/tb_slave_wrapper --real --mode=config --clean

(i.e., apparently what is specified at "Build - Execute shell - Command"
at
<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/configure>),
and the parent of that `/bin/sh -xe /tmp/jenkins3389683698813990355.sh`
is some `java -jar remoting.jar ...`, i.e., apparently a Jenkins slave
process.

_______________________________________________
LibreOffice mailing list
[hidden email]
https://lists.freedesktop.org/mailman/listinfo/libreoffice
Christian Lohmaier-6 Christian Lohmaier-6
Reply | Threaded
Open this post in threaded view
|

Re: How are Jenkins builds killed exactly?

In reply to this post by sberg
Hi Stephan, *,

On Sun, Dec 29, 2019 at 3:17 PM Stephan Bergmann <[hidden email]> wrote:

>
> Still trying to track down why sometimes zombie processes survive on the
> (Linux) Jenkins build machines […]
> […]
> That tb_slave_wrapper script contains
>
> > trap cleanup 1 2 3 6 15
> >
> > cleanup()
> > {
> >   echo "Caught Signal ... killing everything...."
> >   # kill everything in same process group (pseudo-pid 0)
> >   kill -9 0
> > }
>
> intended to kill all processes if the script itself receives any of
> SIGHUP/-INT/-QUIT/-ABRT/-TERM.
>
> But how does the tb_slave_wrapper script get terminated by whatever
> entity that starts it and prints out the
>
> > Build timed out (after 15 minutes). Marking the build as aborted.
> > Build was aborted
> > Finished: ABORTED
>
> mentioned above?

Regular abortion should be using
https://wiki.jenkins.io/display/JENKINS/ProcessTreeKiller

it looks for processes with the BUILD_ID environment variable that the
scripts inherit when running under jenkins.
that tries a term and kill - although our handler from the scirpt then
should take over.
https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/util/ProcessTree.java#L776

>  Could it be that the script itself gets killed with
> SIGKILL, so its cleanup() trap doesn't fire, and processes (indirectly)
> spawned from the script may stay alive?

Timeout based abortion is done via
https://github.com/jenkinsci/build-timeout-plugin and that triggers
interrupt for the thread.

ciao
Christian

ciao
Christian


--
Christian Lohmaier, Release Engineer
Tel: +49 30 5557992-60 | IRC: cloph on Freenode
The Document Foundation, Kurfürstendamm 188, 10707 Berlin, DE
Gemeinnützige rechtsfähige Stiftung des bürgerlichen Rechts
Legal details: http://www.documentfoundation.org/imprint
_______________________________________________
LibreOffice mailing list
[hidden email]
https://lists.freedesktop.org/mailman/listinfo/libreoffice
sberg sberg
Reply | Threaded
Open this post in threaded view
|

Re: How are Jenkins builds killed exactly?

In reply to this post by sberg
On 29/12/2019 15:17, Stephan Bergmann wrote:

> Still trying to track down why sometimes zombie processes survive on the
> (Linux) Jenkins build machines (and then make later, unrelated Jenkins
> builds on those machines fail when zombie soffice.bin processes still
> hold onto named pipes that tests from the new builds want to create too).
>
> One such recent case on tb79 was the aborted
> <https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/49895/>.  It
> left behind a zombie python.bin -> oosplash -> soffice.bin process tree
> executing UITest_calc_tests3.  (Where presumably the soffice.bin process
> had deadlocked, which then caused the Jenkins
>
>> Build timed out (after 15 minutes). Marking the build as aborted.
>> Build was aborted
>> Finished: ABORTED
>
> reaction.  But once I noticed, the images of the involved processes had
> already been overwritten by later builds, so I couldn't use gdb to get
> backtraces.)

I think I now understand what's going on:  Assume some UITest hangs with
a deadlock in soffice.bin.  <https://plugins.jenkins.io/build-timeout>
will kick in after the specified timeout of 900s to kill the build.

The relevant process tree is

> java───sh───tb-slave-wrapper─┬─make───make───sh───sh───python.bin───oosplash───soffice.bin
>                              └─tee

where:

(1) java is running Jenkins' remoting.jar.

(2) The following sh is Jenkins' way of running

> ${LODE_HOME}/bin/tb_slave_wrapper --real --mode=config --clean

(as specified in "Build - Execute shell - Command" at
<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/configure>)
via some `/bin/sh -xe /tmp/jenkins*.sh` intermediary.

(3) tb-slave-wrapper is running the
<https://gerrit.libreoffice.org/plugins/gitiles/lode/+/af02c3a9564062b4d04e457275624a7a30ba2ba2/bin/tb_slave_wrapper>
script, which at line 325 does

>         make ${keep_going} $target 2>&1 | tee -a ${build_log}

which explains...

(4) ...the first make and...

(5) the tee.

(6) The second make is due to gbuild calling make recursively.

(7) The following sh is running the UITest target's (heavily redacted)
recipe line

> /bin/sh -c 'S=... && I=... && W=.. && rm -rf $W/UITest/... && mkdir -p $W/UITest/... && ... && ( TDOC=... /bin/sh $I/program/python $S/uitest/test_main.py ... || ( RET=$?; $S/solenv/bin/gdb-core-bt.sh ...))'

and...

(8) ...the following sh is running the

> TDOC=... /bin/sh $I/program/python $S/uitest/test_main.py ...

part in a subshell.

(9) python.bin is running LO's uitest/test_main.py, which...

(10) ...spawns the soffice script which then execs oosplash, which...

(11) ...spawns soffice.bin.

<https://ci.libreoffice.org/> says at the bottom "Jenkins ver. 2.212",
so lets assume that
<https://github.com/jenkinsci/jenkins/commit/13ab7b9909927b7afa31797097b6114bf22a2a41>
"[maven-release-plugin] prepare release jenkins-2.212" contains the
source code of the remoting.jar running on the slave.

I assume the relevant starting point for the killing of the Jenkins
job's processes is
<https://github.com/jenkinsci/jenkins/blob/13ab7b9909927b7afa31797097b6114bf22a2a41/core/src/main/java/hudson/Launcher.java#L953>

>         @Override
>         public void kill(Map<String, String> modelEnvVars) throws InterruptedException {
>             ProcessTree.get().killAll(modelEnvVars);
>         }

which will try to kill all processes it finds that have inherited the
BUILD_ID env var identifying the given Jenkins job.

<https://github.com/jenkinsci/jenkins/blob/13ab7b9909927b7afa31797097b6114bf22a2a41/core/src/main/java/hudson/util/ProcessTree.java#L731>

>         public void killAll(@Nonnull Map<String, String> modelEnvVars) throws InterruptedException {
>             for (OSProcess p : this)
>                 if(p.hasMatchingEnvVars(modelEnvVars))
>                     p.killRecursively();
>         }

first read the /proc file system tree to get a list of process IDs (in
the ProcfsUnix constructor at
<https://github.com/jenkinsci/jenkins/blob/13ab7b9909927b7afa31797097b6114bf22a2a41/core/src/main/java/hudson/util/ProcessTree.java#L741>),
then iterates over the list to find processes with matching BUILD_ID and
try to kill them.

This looks racy, in that a to-be-killed build can spawn further
processes after /proc has been read to produce the list.  But that
should not be relevant in our scenario of killing a hung build after a
timeout, as that build will no longer be spawning new processes
(everything but the deadlocked soffice.bin and its parent chain has
already terminated).

However, what appears to be relevant here is that processes are killed
in a random order (see below).

The actual process killing appears to take place at
<https://github.com/jenkinsci/jenkins/blob/13ab7b9909927b7afa31797097b6114bf22a2a41/core/src/main/java/hudson/util/ProcessTree.java#L909>

>         public static void destroy(int pid) throws IllegalAccessException,
>                 InvocationTargetException {
>             if (JAVA8_DESTROY_PROCESS != null) {
>                 JAVA8_DESTROY_PROCESS.invoke(null, pid, false);
>             } else {
>                 final Optional handle = (Optional)JAVA_9_PROCESSHANDLE_OF.invoke(null, pid);
>                 if (handle.isPresent()) {
>                     JAVA_9_PROCESSHANDLE_DESTROY.invoke(handle.get());
>                 }
>             }
>         }
>

For Java <= 8 it will call the internal
java.lang.UNIXProcess.destroyProcess with force=false at
<http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/5b5973c3db08/src/solaris/native/java/lang/UNIXProcess_md.c#l717>

> JNIEXPORT void JNICALL
> Java_java_lang_UNIXProcess_destroyProcess(JNIEnv *env,
>                                           jobject junk,
>                                           jint pid,
>                                           jboolean force)
> {
>     int sig = (force == JNI_TRUE) ? SIGKILL : SIGTERM;
>     kill(pid, sig);
> }

which thus sends a SIGTERM.  Similarly for Java >= 9 it will call the
internal java.lang.ProcessHandle.destroy at
<http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/65464a307408/src/java.base/share/classes/java/lang/ProcessHandle.java#l324>
(rather than the companion java.lang.ProcessHandle.destroyForcibly),
which ends up at
<http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/65464a307408/src/java.base/unix/native/libjava/ProcessHandleImpl_unix.c#l312>

> JNIEXPORT jboolean JNICALL
> Java_java_lang_ProcessHandleImpl_destroy0(JNIEnv *env,
>                                           jobject obj,
>                                           jlong jpid,
>                                           jlong startTime,
>                                           jboolean force) {
>     pid_t pid = (pid_t) jpid;
>     int sig = (force == JNI_TRUE) ? SIGKILL : SIGTERM;
>     jlong start = Java_java_lang_ProcessHandleImpl_isAlive0(env, obj, jpid);
>
>     if (start == startTime || start == 0 || startTime == 0) {
>         return (kill(pid, sig) < 0) ? JNI_FALSE : JNI_TRUE;
>     } else {
>         return JNI_FALSE;
>     }
> }

with force=false, which thus also sends a SIGTERM.

To summarize (and if my browsing of the Jenkins source code is correct),
Jenkins will send each process, in effectively random order, a SIGTERM,
and will send no process a SIGKILL.

<https://gerrit.libreoffice.org/plugins/gitiles/lode/+/bea0738dbadfe8784e5d3c00f533acf101db4e7e%5E!/>
"tb_slave_wrapper: trap signal and kill -9 everything"

> trap cleanup 1 2 3 6 15
>
> cleanup()
> {
>   echo "Caught Signal ... killing everything...."
>   # kill everything in same process group (pseudo-pid 0)
>   kill -9 0
> }

ensures that, if tb_slave_wrapper receives a signal asking it to quit
(incl. SIGTERM=15), it will eventually (i.e., once the Bash interpreter
would proceed to a new line in the script) send a SIGKILL to all
processes in the process group.

The processes (9)--(11) (pythion.bin, oosplash, soffice.bin) will not
respond to their SIGTERM, as the former two are stuck waiting on their
respective child, while the latter is otherwise deadlocked.  They need
to rely on the SIGKILL that might get send from (3) tb-slave-wrapper.

But, as processes get the SIGTERM in a random order, if e.g. (8) (the
innermost sh) gets a SIGTERM first, it will terminate and cause its
parent chain (7), (6), (4), (3) to terminate too.  That means that (3)
tb-slave-wrapper can terminate before it would receive a SIGTERM, so it
will not call cleanup.

That should explain those occasional

> python.bin───oosplash───soffice.bin

zombie chains (hanging off processes 1) from timed-out Jenkins builds
that we observe on tb75, tb76, and tb79.

Which in turn means that

On 29/12/2019 15:35, Noel Grandin wrote:
> If we can't fix this, I suggest we add:
>
> kill $(ps -o pid= --ppid $$)
>
> to the end of the Jenkins build script

would not help, as the problematic processes do not have that $$ as
their (immediate) parent.

(And what would happen if the cleanup function in tb_slave_wrapper did
kick in?  I think it would wreck havoc, as the outer Jenkins
remoting.jar process, as well as any parallel tb_slave_wrapper
instances, are in the same process group, and would thus all get
forcibly killed.)

_______________________________________________
LibreOffice mailing list
[hidden email]
https://lists.freedesktop.org/mailman/listinfo/libreoffice
sberg sberg
Reply | Threaded
Open this post in threaded view
|

Re: How are Jenkins builds killed exactly?

On 08/01/2020 16:23, Stephan Bergmann wrote:
> I think I now understand what's going on:

...and hope that <https://gerrit.libreoffice.org/c/lode/+/86522> "Make
sure tb_slave_wrapper kills all spawned processes" will get us towards a fix

_______________________________________________
LibreOffice mailing list
[hidden email]
https://lists.freedesktop.org/mailman/listinfo/libreoffice
sberg sberg
Reply | Threaded
Open this post in threaded view
|

Re: How are Jenkins builds killed exactly?

On 10/01/2020 15:57, Stephan Bergmann wrote:
> On 08/01/2020 16:23, Stephan Bergmann wrote:
>> I think I now understand what's going on:
>
> ...and hope that <https://gerrit.libreoffice.org/c/lode/+/86522> "Make
> sure tb_slave_wrapper kills all spawned processes" will get us towards a
> fix

This is in place now for
<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/> for a
while, and appears to be effective so far.

Just now I also enabled it for
<https://ci.libreoffice.org/job/lo_tb_master_linux_dbg/>, changing the
build shell command at
<https://ci.libreoffice.org/job/lo_tb_master_linux_dbg/configure> from

> ${LODE_HOME}/bin/tb_slave_wrapper --branch=MASTER --os=Linux --real --name=Jenkins_Linux_Dbg --distro-config="dev_master" --check --docs

to

> ${LODE_HOME}/bin/kill-wrapper "${LODE_HOME}/bin/tb_slave_wrapper --branch=MASTER --os=Linux --real --name=Jenkins_Linux_Dbg --distro-config=dev_master --check --docs"

(That job is restricted to tb75-lilith, so will have a recent enough
lode repo with a known-working bin/kill-wrapper executable checked out.)

Build <https://ci.libreoffice.org/job/lo_tb_master_linux_dbg/28606/> had
been aborted with

> Build timed out (after 45 minutes). Marking the build as aborted.

and the following
<https://ci.libreoffice.org/job/lo_tb_master_linux_dbg/28607/> had
failed UITest_chart in a way that smelled like a leftover soffice.bin
again, and indeed there was a matching zombie on tb75 (which I killed
manually).

_______________________________________________
LibreOffice mailing list
[hidden email]
https://lists.freedesktop.org/mailman/listinfo/libreoffice