Bash trap error

I'm trying to create some error reporting using a Trap to call a function on all errors: Trap "_func" ERR Is it possible to get what line the ERR signal was sent from? The shell is bash. If I do...

Is it possible to get what line the ERR signal was sent from?

Yes, LINENO and BASH_LINENO variables are supper useful for getting the line of failure and the lines that lead up to it.

Or maybe I’m going at this all wrong?

Nope, just missing -q option with grep…

echo hello | grep -q "asdf"

… With the -q option grep will return 0 for true and 1 for false. And in Bash it’s trap not Trap

trap "_func" ERR

… I need a native solution…

Here’s a trapper that ya might find useful for debugging things that have a bit more cyclomatic complexity…

failure.sh

## Outputs Front-Mater formatted failures for functions not returning 0
## Use the following line after sourcing this file to set failure trap
##    trap 'failure "LINENO" "BASH_LINENO" "${BASH_COMMAND}" "${?}"' ERR
failure(){
    local -n _lineno="${1:-LINENO}"
    local -n _bash_lineno="${2:-BASH_LINENO}"
    local _last_command="${3:-${BASH_COMMAND}}"
    local _code="${4:-0}"

    ## Workaround for read EOF combo tripping traps
    if ! ((_code)); then
        return "${_code}"
    fi

    local _last_command_height="$(wc -l <<<"${_last_command}")"

    local -a _output_array=()
    _output_array+=(
        '---'
        "lines_history: [${_lineno} ${_bash_lineno[*]}]"
        "function_trace: [${FUNCNAME[*]}]"
        "exit_code: ${_code}"
    )

    if [[ "${#BASH_SOURCE[@]}" -gt '1' ]]; then
        _output_array+=('source_trace:')
        for _item in "${BASH_SOURCE[@]}"; do
            _output_array+=("  - ${_item}")
        done
    else
        _output_array+=("source_trace: [${BASH_SOURCE[*]}]")
    fi

    if [[ "${_last_command_height}" -gt '1' ]]; then
        _output_array+=(
            'last_command: ->'
            "${_last_command}"
        )
    else
        _output_array+=("last_command: ${_last_command}")
    fi

    _output_array+=('---')
    printf '%sn' "${_output_array[@]}" >&2
    exit ${_code}
}

… and an example usage script for exposing the subtle differences in how to set the above trap for function tracing too…

example_usage.sh

#!/usr/bin/env bash

set -E -o functrace

## Optional, but recommended to find true directory this script resides in
__SOURCE__="${BASH_SOURCE[0]}"
while [[ -h "${__SOURCE__}" ]]; do
    __SOURCE__="$(find "${__SOURCE__}" -type l -ls | sed -n 's@^.* -> (.*)@1@p')"
done
__DIR__="$(cd -P "$(dirname "${__SOURCE__}")" && pwd)"


## Source module code within this script
source "${__DIR__}/modules/trap-failure/failure.sh"

trap 'failure "LINENO" "BASH_LINENO" "${BASH_COMMAND}" "${?}"' ERR


something_functional() {
    _req_arg_one="${1:?something_functional needs two arguments, missing the first already}"
    _opt_arg_one="${2:-SPAM}"
    _opt_arg_two="${3:0}"
    printf 'something_functional: %s %s %s' "${_req_arg_one}" "${_opt_arg_one}" "${_opt_arg_two}"
    ## Generate an error by calling nothing
    "${__DIR__}/nothing.sh"
}


## Ignoring errors prevents trap from being triggered
something_functional || echo "Ignored something_functional returning $?"
if [[ "$(something_functional 'Spam!?')" == '0' ]]; then
    printf 'Nothing somehow was something?!n' >&2 && exit 1
fi


## And generating an error state will cause the trap to _trace_ it
something_functional '' 'spam' 'Jam'

The above where tested on Bash version 4+, so leave a comment if something for versions prior to four are needed, or Open an Issue if it fails to trap failures on systems with a minimum version of four.

The main takeaways are…

set -E -o functrace
  • -E causes errors within functions to bubble up

  • -o functrace causes allows for more verbosity when something within a function fails

trap 'failure "LINENO" "BASH_LINENO" "${BASH_COMMAND}" "${?}"' ERR
  • Single quotes are used around function call and double quotes are around individual arguments

  • References to LINENO and BASH_LINENO are passed instead of the current values, though this might be shortened in later versions of linked to trap, such that the final failure line makes it into output

  • Values of BASH_COMMAND and exit status ($?) are passed, first to get the command that returned an error, and second for ensuring that the trap does not trigger on non-error statuses

And while others may disagree I find it’s easier to build an output array and use printf for printing each array element on it’s own line…

printf '%sn' "${_output_array[@]}" >&2

… also the >&2 bit at the end causes errors to go where they should (standard error), and allows for capturing just errors…

## ... to a file...
some_trapped_script.sh 2>some_trapped_errros.log

## ... or by ignoring standard out...
some_trapped_script.sh 1>/dev/null

As shown by these and other examples on Stack Overflow, there be lots of ways to build a debugging aid using built in utilities.

Running the following code:

#!/bin/bash

set -o pipefail
set -o errtrace
set -o nounset
set -o errexit

function err_handler ()
{
    local error_code="$?"
    echo "TRAP!"
    echo "error code: $error_code"
    exit
}
trap err_handler ERR

echo "wrong command in if statement"

if xsxsxsxs 
    then
        echo "if result is true"
    else
        echo "if result is false"
fi

echo -e "nwrong command directly"
xsxsxsxs 

exit

produces the following output:

wrong command in if statement
trap.sh: line 21: xsxsxsxs: command not found
if result is false

wrong command directly
trap.sh: line 29: xsxsxsxs: command not found
TRAP!
error code: 127

How can I trap the ‘command not found‘ error inside the if statement too?

Jolta's user avatar

Jolta

2,5941 gold badge31 silver badges41 bronze badges

asked Oct 27, 2012 at 20:03

Luca Borrione's user avatar

Luca BorrioneLuca Borrione

16k7 gold badges51 silver badges65 bronze badges

You can’t trap ERR for the test in the if

From bash man:

The ERR trap is not executed if the failed  command
is  part  of  the  command list immediately following a while or
until keyword, part of the test in an if statement,  part  of  a
command  executed in a && or || list, or if the command's return
value is being inverted via !

But you could change this

if xsxsxsxs 
then ..

to this

xsxsxsxs 
if [[ $? -eq 0 ]]
then ..

answered Oct 27, 2012 at 21:32

German Garcia's user avatar

3

Unluckly, German Garcia is right, so I wrote a workaround.
Any suggestion or improvement is more than welcome thanks.

Here’s what I did so far:

#!/bin/bash

set -o pipefail
set -o errtrace
set -o nounset
set -o errexit

declare stderr_log="/dev/shm/stderr.log"
exec 2>"${stderr_log}"

function err_handler ()
{
    local error_code=${1:-$?}
    echo "TRAP! ${error_code}"
    echo "exit status: $error_code"

    stderr=$( tail -n 1 "$stderr_log" )
    echo "error message: $stderr"


    echo "" > "${stderr_log}"
    echo "Normally I would exit now but I carry on the demo instead"

    # uncomemment the following two lines to exit now.
    # rm "${stderr_log}"
    # exit "${error_code}"
}
trap err_handler ERR

function check ()
{
    local params=( "$@" )

    local result=0

    local statement=''
    for param in "${params[@]}"
    do
        local regex='s+'
        if [[ ${param} =~ ${regex} ]]
            then
                param=$( echo "${param}" | sed 's/"/\"/g' )
                param=""$param""
        fi
        statement="${statement} $param"
    done

    eval "if $statement; then result=1; fi"

    stderr=$( tail -n 1 "$stderr_log" )

    ($statement); local error_code="$?"

    test -n "$stderr" && err_handler "${error_code}"
    test $result = 1 && [[ $( echo "1" ) ]] || [[ $( echo "" ) ]]
}

echo -e "n1) wrong command in if statement"

if check xsxsxs -d "/etc"
    then
        echo "if returns true"
    else
        echo "if returns false"
fi

echo -e "n2) right command in if statement"

if check test -d "/etc"
    then
        echo "if returns true"
    else
        echo "if returns false"
fi

echo -e "n3) wrong command directly"
xsxsxsxs 

exit

Running the above will produce:

1) wrong command in if statement
TRAP!
error code found: 0
error message: trap.sh: line 52: xsxsxs: command not found 
I would exit now but I carry on instead
if returns false

2) right command in if statement
if returns true

3) wrong command directly
TRAP!
error code found: 127
error message: trap.sh: line 77: xsxsxsxs: command not found
I would exit now but I carry on instead

So the idea is basically to create a method called ‘check’, then to add it before the command to debug in the if statement.
I cannot catch the error code in this way, but it doesn’t matter too much as long as I can get a message.

It would be nice to hear from you about that. Thanks

Community's user avatar

answered Oct 29, 2012 at 18:04

Luca Borrione's user avatar

Luca BorrioneLuca Borrione

16k7 gold badges51 silver badges65 bronze badges

5

Contents

  • 1 Problem
  • 2 Solutions
    • 2.1 Executed in subshell, exit on error
    • 2.2 Executed in subshell, trap error
  • 3 Caveat 1: `Exit on error’ ignoring subshell exit status
    • 3.1 Solution: Generate error yourself if subshell fails
      • 3.1.1 Example 1
      • 3.1.2 Example 2
  • 4 Caveat 2: `Exit on error’ not exitting subshell on error
    • 4.1 Solution: Use logical operators (&&, ||) within subshell
      • 4.1.1 Example
  • 5 Caveat 3: `Exit on error’ not exitting command substition on error
    • 5.1 Solution 1: Use logical operators (&&, ||) within command substitution
    • 5.2 Solution 2: Enable posix mode
  • 6 The tools
    • 6.1 Exit on error
      • 6.1.1 Specify `bash -e’ as the shebang interpreter
        • 6.1.1.1 Example
      • 6.1.2 Set ERR trap to exit
        • 6.1.2.1 Example
  • 7 Solutions revisited: Combining the tools
    • 7.1 Executed in subshell, trap on exit
      • 7.1.1 Rationale
    • 7.2 Sourced in current shell
      • 7.2.1 Todo
      • 7.2.2 Rationale
        • 7.2.2.1 `Exit’ trap in sourced script
        • 7.2.2.2 `Break’ trap in sourced script
        • 7.2.2.3 Trap in function in sourced script without `errtrace’
        • 7.2.2.4 Trap in function in sourced script with ‘errtrace’
        • 7.2.2.5 `Break’ trap in function in sourced script with `errtrace’
  • 8 Test
  • 9 See also
  • 10 Journal
    • 10.1 20210114
    • 10.2 20060524
    • 10.3 20060525
  • 11 Comments

Problem

I want to catch errors in bash script using set -e (or set -o errexit or trap ERR). What are best practices?

Solutions

See #Solutions revisited: Combining the tools for detailed explanations.

If the script is executed in a subshell, it’s relative easy: You don’t have to worry about backing up and restoring shell options and shell traps, because they’re automatically restored when you exit the subshell.

Executed in subshell, exit on error

Example script:

#!/bin/bash -eu
# -e: Exit immediately if a command exits with a non-zero status.
# -u: Treat unset variables as an error when substituting.
 
(false)                   # Caveat 1: If an error occurs in a subshell, it isn't detected
(false) || false          # Solution: If you want to exit, you have to detect the error yourself
(false; true) || false    # Caveat 2: The return status of the ';' separated list is `true'
(false && true) || false  # Solution: If you want to control the last command executed, use `&&'

See also #Caveat 1: `Exit on error’ ignoring subshell exit status

Executed in subshell, trap error

#!/bin/bash -Eu
# -E: ERR trap is inherited by shell functions.
# -u: Treat unset variables as an error when substituting.
# 
# Example script for handling bash errors.  Exit on error.  Trap exit.
# This script is supposed to run in a subshell.
# See also: http://fvue.nl/wiki/Bash:_Error_handling

    #  Trap non-normal exit signals: 1/HUP, 2/INT, 3/QUIT, 15/TERM, ERR
trap onexit 1 2 3 15 ERR


#--- onexit() -----------------------------------------------------
#  @param $1 integer  (optional) Exit status.  If not set, use `$?'

function onexit() {
    local exit_status=${1:-$?}
    echo Exiting $0 with $exit_status
    exit $exit_status
}


# myscript


    # Allways call `onexit' at end of script
onexit

Caveat 1: `Exit on error’ ignoring subshell exit status

The `-e’ setting does not exit if an error occurs within a subshell, for example with these subshell commands: (false) or bash -c false

Example script caveat1.sh:

#!/bin/bash -e
echo begin
(false)
echo end

Executing the script above gives:

$ ./caveat1.sh
begin
end
$

Conclusion: the script didn’t exit after (false).

Solution: Generate error yourself if subshell fails

( SHELL COMMANDS ) || false

In the line above, the exit status of the subshell is checked. The subshell must exit with a zero status — indicating success, otherwise `false’ will run, generating an error in the current shell.

Note that within a bash `list’, with commands separated by a `;’, the return status is the exit status of the last command executed. Use the control operators `&&’ and `||’ if you want to control the last command executed:

$ (false; true) || echo foo
$ (false && true) || echo foo
foo
$

Example 1

Example script example.sh:

#!/bin/bash -e
echo begin
(false) || false
echo end

Executing the script above gives:

$ ./example.sh
begin
$

Conclusion: the script exits after false.

Example 2

Example bash commands:

$ trap 'echo error' ERR       # Set ERR trap
$ false                       # Non-zero exit status will be trapped
error
$ (false)                     # Non-zero exit status within subshell will not be trapped
$ (false) || false            # Solution: generate error yourself if subshell fails
error
$ trap - ERR                  # Reset ERR trap

Caveat 2: `Exit on error’ not exitting subshell on error

The `-e’ setting doesn’t always immediately exit the subshell `(…)’ when an error occurs. It appears the subshell behaves as a simple command and has the same restrictions as `-e’:

Exit immediately if a simple command exits with a non-zero status, unless the subshell is part of the command list immediately following a `while’ or `until’ keyword, part of the test in an `if’ statement, part of the right-hand-side of a `&&’ or `||’ list, or if the command’s return status is being inverted using `!’

Example script caveat2.sh:

#!/bin/bash -e
(false; echo A)                        # Subshell exits after `false'
!(false; echo B)                       # Subshell doesn't exit after `false'
true && (false; echo C)                # Subshell exits after `false'
(false; echo D) && true                # Subshell doesn't exit after `false'
(false; echo E) || false               # Subshell doesn't exit after `false'
if (false; echo F); then true; fi      # Subshell doesn't exit after `false'
while (false; echo G); do break; done  # Subshell doesn't exit after `false'
until (false; echo H); do break; done  # Subshell doesn't exit after `false'

Executing the script above gives:

$ ./caveat2.sh
B
D
E
F
G
H

Solution: Use logical operators (&&, ||) within subshell

Use logical operators `&&’ or `||’ to control execution of commands within a subshell.

Example

#!/bin/bash -e
(false && echo A)
!(false && echo B)
true && (false && echo C)
(false && echo D) && true
(false && echo E) || false
if (false && echo F); then true; fi
while (false && echo G); do break; done
until (false && echo H); do break; done

Executing the script above gives no output:

$ ./example.sh
$

Conclusion: the subshells do not output anything because the `&&’ operator is used instead of the command separator `;’ as in caveat2.sh.

Caveat 3: `Exit on error’ not exitting command substition on error

The `-e’ setting doesn’t immediately exit command substitution when an error occurs, except when bash is in posix mode:

$ set -e
$ echo $(false; echo A)
A

Solution 1: Use logical operators (&&, ||) within command substitution

$ set -e
$ echo $(false || echo A)

Solution 2: Enable posix mode

When posix mode is enabled via set -o posix, command substition will exit if `-e’ has been set in the
parent shell.

$ set -e
$ set -o posix
$ echo $(false; echo A)

Enabling posix might have other effects though?

The tools

Exit on error

Bash can be told to exit immediately if a command fails. From the bash manual («set -e»):

«Exit immediately if a simple command (see SHELL GRAMMAR above) exits with a non-zero status. The shell does not exit if the command that fails is part of the command list immediately following a while or until keyword, part of the test in an if statement, part of a && or || list, or if the command’s return value is being inverted via !. A trap on ERR, if set, is executed before the shell exits.»

To let bash exit on error, different notations can be used:

  1. Specify `bash -e’ as shebang interpreter
  2. Start shell script with `bash -e’
  3. Use `set -e’ in shell script
  4. Use `set -o errexit’ in shell script
  5. Use `trap exit ERR’ in shell script

Specify `bash -e’ as the shebang interpreter

You can add `-e’ to the shebang line, the first line of your shell script:

#!/bin/bash -e

This will execute the shell script with `-e’ active. Note `-e’ can be overridden by invoking bash explicitly (without `-e’):

$ bash shell_script
Example

Create this shell script example.sh and make it executable with chmod u+x example.sh:

#!/bin/bash -e
echo begin
false     # This should exit bash because `false' returns error
echo end  # This should never be reached

Example run:

$ ./example.sh
begin
$ bash example.sh
begin
end
$

Set ERR trap to exit

By setting an ERR trap you can catch errors as well:

trap command ERR

By setting the command to `exit’, bash exits if an error occurs.

trap exit ERR
Example

Example script example.sh

#!/bin/bash
trap exit ERR
echo begin
false
echo end

Example run:

$ ./example.sh
begin
$

The non-zero exit status of `false’ is catched by the error trap. The error trap exits and `echo end’ is never reached.

Solutions revisited: Combining the tools

Executed in subshell, trap on exit

#!/bin/bash
# --- subshell_trap.sh -------------------------------------------------
# Example script for handling bash errors.  Exit on error.  Trap exit.
# This script is supposed to run in a subshell.
# See also: http://fvue.nl/wiki/Bash:_Error_handling
 
    # Let shell functions inherit ERR trap.  Same as `set -E'.
set -o errtrace 
    # Trigger error when expanding unset variables.  Same as `set -u'.
set -o nounset
    #  Trap non-normal exit signals: 1/HUP, 2/INT, 3/QUIT, 15/TERM, ERR
    #  NOTE1: - 9/KILL cannot be trapped.
    #+        - 0/EXIT isn't trapped because:
    #+          - with ERR trap defined, trap would be called twice on error
    #+          - with ERR trap defined, syntax errors exit with status 0, not 2
    #  NOTE2: Setting ERR trap does implicit `set -o errexit' or `set -e'.
trap onexit 1 2 3 15 ERR
 
 
#--- onexit() -----------------------------------------------------
#  @param $1 integer  (optional) Exit status.  If not set, use `$?'
 
function onexit() {
    local exit_status=${1:-$?}
    echo Exiting $0 with $exit_status
    exit $exit_status
}
 
 
 
# myscript
 
 
 
    # Allways call `onexit' at end of script
onexit

Rationale

+-------+   +----------+  +--------+  +------+
| shell |   | subshell |  | script |  | trap |
+-------+   +----------+  +--------+  +------+
     :           :            :           :
    +-+         +-+          +-+  error  +-+
    | |         | |          | |-------->| |
    | |  exit   | |          | !         | |
    | |<-----------------------------------+
    +-+          :            :           :
     :           :            :           :

Figure 1. Trap in executed script
When a script is executed from a shell, bash will create a subshell in which the script is run. If a trap catches an error, and the trap says `exit’, this will cause the subshell to exit.

Sourced in current shell

If the script is sourced (included) in the current shell, you have to worry about restoring shell options and shell traps. If they aren’t restored, they might cause problems in other programs which rely on specific settings.

#!/bin/bash
#--- listing6.inc.sh ---------------------------------------------------
# Demonstration of ERR trap being reset by foo_deinit() with the use
# of `errtrace'.
# Example run:
#
#    $ set +o errtrace         # Make sure errtrace is not set (bash default)
#    $ trap - ERR              # Make sure no ERR trap is set (bash default)
#    $ . listing6.inc.sh       # Source listing6.inc.sh
#    $ foo                     # Run foo()
#    foo_init
#    Entered `trap-loop'
#    trapped
#    This is always executed - with or without a trap occurring
#    foo_deinit
#    $ trap                    # Check if ERR trap is reset.
#    $ set -o | grep errtrace  # Check if the `errtrace' setting is...
#    errtrace        off        # ...restored.
#    $
#
# See: http://fvue.nl/wiki/Bash:_Error_handling
 
function foo_init {
    echo foo_init 
    fooOldErrtrace=$(set +o | grep errtrace)
    set -o errtrace
    trap 'echo trapped; break' ERR   # Set ERR trap 
}
function foo_deinit {
    echo foo_deinit
    trap - ERR                # Reset ERR trap
    eval $fooOldErrtrace      # Restore `errtrace' setting
    unset fooOldErrtrace      # Delete global variable
}
function foo {
    foo_init
        # `trap-loop'
    while true; do
        echo Entered `trap-loop'
        false
        echo This should never be reached because the `false' above is trapped
        break
    done
    echo This is always executed - with or without a trap occurring
    foo_deinit
}

Todo

  • an existing ERR trap must be restored and called
  • test if the `trap-loop’ is reached if the script breaks from a nested loop

Rationale

`Exit’ trap in sourced script

When the script is sourced in the current shell, it’s not possible to use `exit’ to terminate the program: This would terminate the current shell as well, as shown in the picture underneath.

+-------+                 +--------+  +------+
| shell |                 | script |  | trap |
+-------+                 +--------+  +------+
    :                         :           :
   +-+                       +-+  error  +-+
   | |                       | |-------->| |
   | |                       | |         | |
   | | exit                  | |         | |
<------------------------------------------+
    :                         :           :

Figure 2. `Exit’ trap in sourced script
When a script is sourced from a shell, bash will run the script in the current shell. If a trap catches an error, and the trap says `exit’, this will cause the current shell to exit.

`Break’ trap in sourced script

A solution is to introduce a main loop in the program, which is terminated by a `break’ statement within the trap.

+-------+    +--------+  +--------+   +------+
| shell |    | script |  | `loop' |   | trap |
+-------+    +--------+  +--------+   +------+
     :           :            :          :  
    +-+         +-+          +-+  error +-+
    | |         | |          | |------->| |
    | |         | |          | |        | |
    | |         | |  break   | |        | |
    | |  return | |<----------------------+
    | |<----------+           :          :
    +-+          :            :          :
     :           :            :          :

Figure 3. `Break’ trap in sourced script
When a script is sourced from a shell, e.g. . ./script, bash will run the script in the current shell. If a trap catches an error, and the trap says `break’, this will cause the `loop’ to break and to return to the script.

For example:

#!/bin/bash
#--- listing3.sh -------------------------------------------------------
# See: http://fvue.nl/wiki/Bash:_Error_handling

trap 'echo trapped; break' ERR;  # Set ERR trap

function foo { echo foo; false; }  # foo() exits with error

    # `trap-loop'
while true; do
    echo Entered `trap-loop'
    foo
    echo This is never reached
    break
done

echo This is always executed - with or without a trap occurring

trap - ERR  # Reset ERR trap

Listing 3. `Break’ trap in sourced script
When a script is sourced from a shell, e.g. ./script, bash will run the script in the current shell. If a trap catches an error, and the trap says `break’, this will cause the `loop’ to break and to return to the script.

Example output:

$> source listing3.sh
Entered `trap-loop'
foo
trapped
This is always executed after a trap
$>
Trap in function in sourced script without `errtrace’

A problem arises when the trap is reset from within a function of a sourced script. From the bash manual, set -o errtrace or set -E:

If set, any trap on `ERR’ is inherited by shell functions, command

substitutions, and commands executed in a subshell environment.

The `ERR’ trap is normally not inherited in such cases.

So with errtrace not set, a function does not know of any `ERR’ trap set, and thus the function is unable to reset the `ERR’ trap. For example, see listing 4 underneath.

#!/bin/bash
#--- listing4.inc.sh ---------------------------------------------------
# Demonstration of ERR trap not being reset by foo_deinit()
# Example run:
# 
#    $> set +o errtrace     # Make sure errtrace is not set (bash default)
#    $> trap - ERR          # Make sure no ERR trap is set (bash default)
#    $> . listing4.inc.sh   # Source listing4.inc.sh
#    $> foo                 # Run foo()
#    foo_init
#    foo
#    foo_deinit             # This should've reset the ERR trap...
#    $> trap                # but the ERR trap is still there:
#    trap -- 'echo trapped' ERR
#    $> trap

# See: http://fvue.nl/wiki/Bash:_Error_handling

function foo_init   { echo foo_init 
                      trap 'echo trapped' ERR;} # Set ERR trap 

function foo_deinit { echo foo_deinit
                      trap - ERR             ;} # Reset ERR trap

function foo        { foo_init
                      echo foo
                      foo_deinit             ;}

Listing 4. Trap in function in sourced script
foo_deinit() is unable to unset the ERR trap, because errtrace is not set.

Trap in function in sourced script with ‘errtrace’

The solution is to set -o errtrace. See listing 5 underneath:

#!/bin/bash
#--- listing5.inc.sh ---------------------------------------------------
# Demonstration of ERR trap being reset by foo_deinit() with the use
# of `errtrace'.
# Example run:
#
#    $> set +o errtrace         # Make sure errtrace is not set (bash default)
#    $> trap - ERR              # Make sure no ERR trap is set (bash default)
#    $> . listing5.inc.sh       # Source listing5.inc.sh
#    $> foo                     # Run foo()
#    foo_init
#    foo
#    foo_deinit                 # This should reset the ERR trap...
#    $> trap                    # and it is indeed.
#    $> set +o | grep errtrace  # And the `errtrace' setting is restored.
#    $>
#
# See: http://fvue.nl/wiki/Bash:_Error_handling

function foo_init   { echo foo_init 
                      fooOldErrtrace=$(set +o | grep errtrace)
                      set -o errtrace
                      trap 'echo trapped' ERR   # Set ERR trap 
                    }
function foo_deinit { echo foo_deinit
                      trap - ERR                # Reset ERR trap
                      eval($fooOldErrtrace)     # Restore `errtrace' setting
                      fooOldErrtrace=           # Delete global variable
                    }
function foo        { foo_init
                      echo foo
                      foo_deinit             ;}
`Break’ trap in function in sourced script with `errtrace’

Everything combined in listing 6 underneath:

#!/bin/bash
#--- listing6.inc.sh ---------------------------------------------------
# Demonstration of ERR trap being reset by foo_deinit() with the use
# of `errtrace'.
# Example run:
#
#    $> set +o errtrace         # Make sure errtrace is not set (bash default)
#    $> trap - ERR              # Make sure no ERR trap is set (bash default)
#    $> . listing6.inc.sh       # Source listing6.inc.sh
#    $> foo                     # Run foo()
#    foo_init
#    Entered `trap-loop'
#    trapped
#    This is always executed - with or without a trap occurring
#    foo_deinit
#    $> trap                    # Check if ERR trap is reset.
#    $> set -o | grep errtrace  # Check if the `errtrace' setting is...
#    errtrace        off        # ...restored.
#    $>
#
# See: http://fvue.nl/wiki/Bash:_Error_handling

function foo_init {
    echo foo_init 
    fooOldErrtrace=$(set +o | grep errtrace)
    set -o errtrace
    trap 'echo trapped; break' ERR   # Set ERR trap 
}
function foo_deinit {
    echo foo_deinit
    trap - ERR                # Reset ERR trap
    eval $fooOldErrtrace      # Restore `errtrace' setting
    unset fooOldErrtrace      # Delete global variable
}
function foo {
    foo_init
        # `trap-loop'
    while true; do
        echo Entered `trap-loop'
        false
        echo This should never be reached because the `false' above is trapped
        break
    done
    echo This is always executed - with or without a trap occurring
    foo_deinit
}

Test

#!/bin/bash

    # Tests

    # An erroneous command should have exit status 127.
    # The erroneous command should be trapped by the ERR trap.
#erroneous_command

    #  A simple command exiting with a non-zero status should have exit status
    #+ <> 0, in this case 1.  The simple command is trapped by the ERR trap.
#false

    # Manually calling 'onexit'
#onexit

    # Manually calling 'onexit' with exit status
#onexit 5

    #  Killing a process via CTRL-C (signal 2/SIGINT) is handled via the SIGINT trap
    #  NOTE: `sleep' cannot be killed via `kill' plus 1/SIGHUP, 2/SIGINT, 3/SIGQUIT
    #+       or 15/SIGTERM.
#echo $$; sleep 20

    #  Killing a process via 1/SIGHUP, 2/SIGQUIT, 3/SIGQUIT or 15/SIGTERM is
    #+ handled via the respective trap.
    #  NOTE: Unfortunately, I haven't found a way to retrieve the signal number from
    #+       within the trap function.
echo $$; while true; do :; done

    # A syntax error is not trapped, but should have exit status 2
#fi

    # An unbound variable is not trapped, but should have exit status 1
    # thanks to 'set -u'
#echo $foo

     # Executing `false' within a function should exit with 1 because of `set -E'
#function foo() {
#    false
#    true
#} # foo()
#foo

echo End of script
   # Allways call 'onexit' at end of script
onexit

See also

Bash: Err trap not reset
Solution for trap - ERR not resetting ERR trap.

Journal

20210114

Another caveat: exit (or an error-trap) executed within «process substitution» doesn’t end outer process. The script underneath keeps outputting «loop1»:

#!/bin/bash
# This script outputs "loop1" forever, while I hoped it would exit all while-loops
set -o pipefail
set -Eeu
 
while true; do
    echo loop1
    while read FOO; do
        echo loop2
        echo FOO: $FOO
    done < <( exit 1 )
done

The ‘< <()’ notation is called process substitution.

See also:

  • https://mywiki.wooledge.org/ProcessSubstitution
  • https://unix.stackexchange.com/questions/128560/how-do-i-capture-the-exit-code-handle-errors-correctly-when-using-process-subs
  • https://superuser.com/questions/696855/why-doesnt-a-bash-while-loop-exit-when-piping-to-terminated-subcommand

Workaround: Use «Here Strings» ([n]<<<word):

#!/bin/bash
# This script will exit correctly if building up $rows results in an error
 
set -Eeu
 
rows=$(exit 1)
while true; do
    echo loop1
    while read FOO; do
        echo loop2
        echo FOO: $FOO
    done <<< "$rows"
done

20060524

#!/bin/bash
#--- traptest.sh --------------------------------------------
# Example script for trapping bash errors.
# NOTE: Why doesn't this scripts catch syntax errors?

    # Exit on all errors
set -e
    # Trap exit
trap trap_exit_handler EXIT


    # Handle exit trap
function trap_exit_handler() {
        # Backup exit status if you're interested...
    local exit_status=$?
        # Change value of $?
    true
    echo $?
    #echo trap_handler $exit_status
} # trap_exit_handler()


    # An erroneous command will trigger a bash error and, because
    # of 'set -e', will 'exit 127' thus falling into the exit trap.
#erroneous_command
    # The same goes for a command with a false return status
#false

    # A manual exit will also fall into the exit trap
#exit 5

    # A syntax error isn't catched?
fi

    # Disable exit trap
trap - EXIT
exit 0

Normally, a syntax error exits with status 2, but when both ‘set -e’ and ‘trap EXIT’ are defined, my script exits with status 0. How can I have both ‘errexit’ and ‘trap EXIT’ enabled, *and* catch syntax errors
via exit status? Here’s an example script (test.sh):

set -e
trap 'echo trapped: $?' EXIT
fi

$> bash test.sh; echo $?: $?
test.sh: line 3: syntax error near unexpected token `fi'
trapped: 0
$?: 0

More trivia:

  • With the line ‘#set -e’ commented, bash traps 258 and returns an exit status of 2:
trapped: 258
$?: 2
  • With the line ‘#trap ‘echo trapped $?’ EXIT’ commented, bash returns an exit status of 2:
$?: 2
  • With a bogus function definition on top, bash returns an exit status of 2, but no exit trap is executed:
function foo() { foo=bar }
set -e
trap 'echo trapped: $?' EXIT
fi
fred@linux:~>bash test.sh; echo $?: $?
test.sh: line 4: syntax error near unexpected token `fi'
test.sh: line 4: `fi'
$?: 2

20060525

Example of a ‘cleanup’ script

trap

Writing Robust Bash Shell Scripts

#!/bin/bash
#--- cleanup.sh ---------------------------------------------------------------
# Example script for trapping bash errors.
# NOTE: Use 'cleanexit [status]' instead of 'exit [status]'

    # Trap not-normal exit signals: 1/HUP, 2/INT, 3/QUIT, 15/TERM
    # @see catch_sig()
trap catch_sig 1 2 3 15
    # Trap errors (simple commands exiting with a non-zero status)
    # @see catch_err()
trap catch_err ERR


#--- cleanexit() --------------------------------------------------------------
#  Wrapper around 'exit' to cleanup on exit.
#  @param $1 integer  Exit status.  If $1 not defined, exit status of global
#+                    variable 'EXIT_STATUS' is used.  If neither $1 or
#+                    'EXIT_STATUS' defined, exit with status 0 (success).
function cleanexit() {
    echo "Exiting with ${1:-${EXIT_STATUS:-0}}"
    exit ${1:-${EXIT_STATUS:-0}}
} # cleanexit()


#--- catch_err() --------------------------------------------------------------
#  Catch ERR trap.
#  This traps simple commands exiting with a non-zero status.
#  See also: info bash | "Shell Builtin Commands" | "The Set Builtin" | "-e"
function catch_err() {
    local exit_status=$?
    echo "Inside catch_err"
    cleanexit $exit_status
} # catch_err()


#--- catch_sig() --------------------------------------------------------------
# Catch signal trap.
# Trap not-normal exit signals: 1/HUP, 2/INT, 3/QUIT, 15/TERM
# @NOTE1: Non-trapped signals are 0/EXIT, 9/KILL.
function catch_sig() {
    local exit_status=$?
    echo "Inside catch_sig"
    cleanexit $exit_status
} # catch_sig()


    # An erroneous command should have exit status 127.
    # The erroneous command should be trapped by the ERR trap.
#erroneous_command

    # A command returning false should have exit status <> 0
    # The false returning command should be trapped by the ERR trap.
#false

    # Manually calling 'cleanexit'
#cleanexit

    # Manually calling 'cleanexit' with exit status
#cleanexit 5

    # Killing a process via CTRL-C is handled via the SIGINT trap
#sleep 20

    # A syntax error is not trapped, but should have exit status 2
#fi

    # Allways call 'cleanexit' at end of script
cleanexit

blog comments powered by

Advertisement

blog comments powered by

Introduction

A shell script can run into problems during its execution, resulting in an error signal that interrupts the script unexpectedly.

Errors occur due to a faulty script design, user actions, or system failures. A script that fails may leave behind temporary files that cause trouble when a user restarts the script.

This tutorial will show you how to use the trap command to ensure your scripts always exit predictably.

Bash trap command explained.

Prerequisites

  • Access to the terminal/command line.
  • A text editor (Nano, Vi/Vim, etc.).

Bash trap Syntax

The syntax for the trap command is:

trap [options] "[arguments]" [signals]

The command has the following components:

  • Options provide added functionality to the command.
  • Arguments are the commands trap executes upon detecting a signal. Unless the command is only one word, it should be enclosed with quotation marks (" "). If the argument contains more than one command, separate them with a semicolon (;).
  • Signals are asynchronous notifications sent by the system, usually indicating a user-generated or system-related interruption. Signals can be called by their name or number.

Bash trap Options

The trap command accepts the following options:

  • -p — Displays signal commands.
  • -l — Prints a list of all the signals and their numbers.

Below is the complete list of the 64 signals and their numbers:

# Signal # Signal # Signal
1 SIGHUP 23 SIGURG 45 SIGRTMIN+11
2 SIGINT 24 SIGXCPU 46 SIGRTMIN+12
3 SIGQUIT 25 SIGXFSZ 47 SIGRTMIN+13
4 SIGILL 26 SIGVTALRM 48 SIGRTMIN+14
5 SIGTRAP 27 SIGPROF 49 SIGRTMIN+15
6 SIGABRT 28 SIGWINCH 50 SIGRTMAX-14
7 SIGBUS 29 SIGIO 51 SIGRTMAX-13
8 SIGFPE 30 SIGPWR 52 SIGRTMAX-12
9 SIGKILL 31 SIGSYS 53 SIGRTMAX-11
10 SIGUSR1 32 SIGWAITING 54 SIGRTMAX-10
11 SIGSEGV 33 SIGLWP 55 SIGRTMAX-9
12 SIGUSR2 34 SIGRTMIN 56 SIGRTMAX-8
13 SIGPIPE 35 SIGRTMIN+1 57 SIGRTMAX-7
14 SIGALRM 36 SIGRTMIN+2 58 SIGRTMAX-6
15 SIGTERM 37 SIGRTMIN+3 59 SIGRTMAX-5
16 SIGSTKFLT 38 SIGRTMIN+4 60 SIGRTMAX-4
17 SIGCHLD 39 SIGRTMIN+5 61 SIGRTMAX-3
18 SIGCONT 40 SIGRTMIN+6 62 SIGRTMAX-2
19 SIGSTOP 41 SIGRTMIN+7 63 SIGRTMAX-1
20 SIGTSTP 42 SIGRTMIN+8 64 SIGRTMAX
21 SIGTTIN 43 SIGRTMIN+9
22 SIGTTOU 44 SIGRTMIN+10

Note: Signals 32 and 33 are not supported in Linux, and the trap -l command does not display them in the output.

The signals most commonly used with the trap command are:

  • SIGHUP (1) — Clean tidy-up
  • SIGINT (2) — Interrupt
  • SIGQUIT (3) — Quit
  • SIGABRT (6) — Cancel
  • SIGALRM (14) — Alarm clock
  • SIGTERM (15) — Terminate

Note: The SIG prefix in signal names is optional. For example, SIGTERM signal can also be written as TERM.

How to Use trap in Bash

A typical scenario for using the trap command is catching the SIGINT signal. This signal is sent by the system when the user interrupts the execution of the script by pressing Ctrl+C.

The following example script prints the word «Test» every second until the user interrupts it with Ctrl+C. The script then prints a message and quits.

trap "echo The script is terminated; exit" SIGINT

while true
do
    echo Test
    sleep 1
done

The while loop in the example above executes infinitely. The first line of the script contains the trap command and the instructions to wait for the SIGINT signal, then print the message and exit the script.

Executing the trap-test.sh script and terminating it with Ctrl+C.

The trap command is frequently used to clean up temporary files if the script exits due to interruption. The following example defines the cleanup function, which prints a message, removes all the files added to the $TRASH variable, and exits the script.

$TRASH=$(mktemp -t tmp.XXXXXXXXXX)

trap cleanup 1 2 3 6

cleanup()
{
  echo "Removing temporary files:"
  rm -rf "$TRASH"
  exit
}

...

The trap in the example above executes the cleanup function when it detects one of the four signals: SIGHUP, SIGINT, SIGQUIT, or SIGABRT. The signals are referred to by their number.

You can also use trap to ensure the user cannot interrupt the script execution. This feature is important when executing sensitive commands whose interruption may permanently damage the system. The syntax for disabling a signal is:

trap "" [signal]

Double quotation marks mean that no command will be executed. For example, to trap the SIGINT and SIGABRT signals, type:

trap "" SIGINT SIGABRT
[a command that must not be interrupted]

If you wish to re-enable the signals at any time during the script, reset the rules by using the dash symbol:

trap - SIGINT SIGABRT
[a command that can be interrupted]

Note: The SIGKILL signal cannot be trapped. It always immediately interrupts the script.

Conclusion

After reading this tutorial, you know how to use the trap command to ensure your bash script always exits properly. If you are interested in more Bash-related topics, read How to Run a Bash Script.

It’s easy to detect when a shell script starts, but it’s not always easy to know when it stops. A script might end normally, just as its author intends it to end, but it could also fail due to an unexpected fatal error. Sometimes it’s beneficial to preserve the remnants of whatever was in progress when a script failed, and other times it’s inconvenient. Either way, detecting the end of a script and reacting to it in some pre-calculated manner is why the Bash trap directive exists.

Responding to failure

Here’s an example of how one failure in a script can lead to future failures. Say you have written a program that creates a temporary directory in /tmp so that it can unarchive and process files before bundling them back together in a different format:

#!/usr/bin/env bash
CWD=`pwd`
TMP=${TMP:-/tmp/tmpdir}

## create tmp dir
mkdir "${TMP}"

## extract files to tmp
tar xf "${1}" --directory "${TMP}"

## move to tmpdir and run commands
pushd "${TMP}"
for IMG in *.jpg; do
  mogrify -verbose -flip -flop "${IMG}"
done
tar --create --file "${1%.*}".tar *.jpg

## move back to origin
popd

## bundle with bzip2
bzip2 --compress "${TMP}"/"${1%.*}".tar 
      --stdout > "${1%.*}".tbz

## clean up
/usr/bin/rm -r /tmp/tmpdir

Most of the time, the script works as expected. However, if you accidentally run it on an archive filled with PNG files instead of the expected JPEG files, it fails halfway through. One failure leads to another, and eventually, the script exits without reaching its final directive to remove the temporary directory. As long as you manually remove the directory, you can recover quickly, but if you aren’t around to do that, then the next time the script runs, it has to deal with an existing temporary directory full of unpredictable leftover files.

One way to combat this is to reverse and double-up on the logic by adding a precautionary removal to the start of the script. While valid, that relies on brute force instead of structure. A more elegant solution is trap.

Catching signals with trap

The trap keyword catches signals that may happen during execution. You’ve used one of these signals if you’ve ever used the kill or killall commands, which call SIGTERM by default. There are many other signals that shells respond to, and you can see most of them with trap -l (as in «list»):

$ trap --list
 1) SIGHUP       2) SIGINT       3) SIGQUIT      4) SIGILL       5) SIGTRAP
 6) SIGABRT      7) SIGBUS       8) SIGFPE       9) SIGKILL     10) SIGUSR1
11) SIGSEGV     12) SIGUSR2     13) SIGPIPE     14) SIGALRM     15) SIGTERM
16) SIGSTKFLT   17) SIGCHLD     18) SIGCONT     19) SIGSTOP     20) SIGTSTP
21) SIGTTIN     22) SIGTTOU     23) SIGURG      24) SIGXCPU     25) SIGXFSZ
26) SIGVTALRM   27) SIGPROF     28) SIGWINCH    29) SIGIO       30) SIGPWR
31) SIGSYS      34) SIGRTMIN    35) SIGRTMIN+1  36) SIGRTMIN+2  37) SIGRTMIN+3
38) SIGRTMIN+4  39) SIGRTMIN+5  40) SIGRTMIN+6  41) SIGRTMIN+7  42) SIGRTMIN+8
43) SIGRTMIN+9  44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12 47) SIGRTMIN+13
48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14 51) SIGRTMAX-13 52) SIGRTMAX-12
53) SIGRTMAX-11 54) SIGRTMAX-10 55) SIGRTMAX-9  56) SIGRTMAX-8  57) SIGRTMAX-7
58) SIGRTMAX-6  59) SIGRTMAX-5  60) SIGRTMAX-4  61) SIGRTMAX-3  62) SIGRTMAX-2
63) SIGRTMAX-1  64) SIGRTMAX

Any of these signals may be anticipated with trap. In addition to these, trap recognizes:

  • EXIT: Occurs when the shell process itself exits
  • ERR: Occurs when a command (such as tar or mkdir) or a built-in command (such as pushd or cd) completes with a non-zero status
  • DEBUG: A Boolean representing debug mode

To set a trap in Bash, use trap followed by a list of commands you want to be executed, followed by a list of signals to trigger it.

For instance, this trap detects a SIGINT, the signal sent when a user presses Ctrl+C while a process is running:

trap "{ echo 'Terminated with Ctrl+C'; }" SIGINT

The example script with temporary directory problems can be fixed with a trap detecting SIGINT, errors, and successful exits:

#!/usr/bin/env bash
CWD=`pwd`
TMP=${TMP:-/tmp/tmpdir}

trap 
 "{ /usr/bin/rm -r "${TMP}" ; exit 255; }" 
 SIGINT SIGTERM ERR EXIT

## create tmp dir
mkdir "${TMP}"
tar xf "${1}" --directory "${TMP}"

## move to tmp and run commands
pushd "${TMP}"
for IMG in *.jpg; do
  mogrify -verbose -flip -flop "${IMG}"
done
tar --create --file "${1%.*}".tar *.jpg

## move back to origin
popd

## zip tar
bzip2 --compress $TMP/"${1%.*}".tar 
      --stdout > "${1%.*}".tbz

For complex actions, you can simplify trap statements with Bash functions.

Traps in Bash

Traps are useful to ensure that your scripts end cleanly, whether they run successfully or not. It’s never safe to rely completely on automated garbage collection, so this is a good habit to get into in general. Try using them in your scripts, and see what they can do!

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.

In this article, I present a few tricks to handle error conditions—Some strictly don’t fall under the category of error handling (a reactive way to handle the unexpected) but also some techniques to avoid errors before they happen.

Case study: Simple script that downloads a hardware report from multiple hosts and inserts it into a database.

Say that you have a cron job on each one of your Linux systems, and you have a script to collect the hardware information from each:

#!/bin/bash
# Script to collect the status of lshw output from home servers
# Dependencies:
# * LSHW: http://ezix.org/project/wiki/HardwareLiSter
# * JQ: http://stedolan.github.io/jq/
#
# On each machine you can run something like this from cron (Don't know CRON, no worries: https://crontab-generator.org/)
# 0 0 * * * /usr/sbin/lshw -json -quiet > /var/log/lshw-dump.json
# Author: Jose Vicente Nunez
#
declare -a servers=(
dmaf5
)

DATADIR="$HOME/Documents/lshw-dump"

/usr/bin/mkdir -p -v "$DATADIR"
for server in ${servers[*]}; do
    echo "Visiting: $server"
    /usr/bin/scp -o logLevel=Error ${server}:/var/log/lshw-dump.json ${DATADIR}/lshw-$server-dump.json &
done
wait
for lshw in $(/usr/bin/find $DATADIR -type f -name 'lshw-*-dump.json'); do
    /usr/bin/jq '.["product","vendor", "configuration"]' $lshw
done

If everything goes well, then you collect your files in parallel because you don’t have more than ten systems. You can afford to ssh to all of them at the same time and then show the hardware details of each one.

Visiting: dmaf5
lshw-dump.json                                                                                         100%   54KB 136.9MB/s   00:00    
"DMAF5 (Default string)"
"BESSTAR TECH LIMITED"
{
  "boot": "normal",
  "chassis": "desktop",
  "family": "Default string",
  "sku": "Default string",
  "uuid": "00020003-0004-0005-0006-000700080009"
}

Here are some possibilities of why things went wrong:

  • Your report didn’t run because the server was down
  • You couldn’t create the directory where the files need to be saved
  • The tools you need to run the script are missing
  • You can’t collect the report because your remote machine crashed
  • One or more of the reports is corrupt

The current version of the script has a problem—It will run from the beginning to the end, errors or not:

./collect_data_from_servers.sh 
Visiting: macmini2
Visiting: mac-pro-1-1
Visiting: dmaf5
lshw-dump.json                                                                                         100%   54KB  48.8MB/s   00:00    
scp: /var/log/lshw-dump.json: No such file or directory
scp: /var/log/lshw-dump.json: No such file or directory
parse error: Expected separator between values at line 3, column 9

Next, I demonstrate a few things to make your script more robust and in some times recover from failure.

The nuclear option: Failing hard, failing fast

The proper way to handle errors is to check if the program finished successfully or not, using return codes. It sounds obvious but return codes, an integer number stored in bash $? or $! variable, have sometimes a broader meaning. The bash man page tells you:

For the shell’s purposes, a command which exits with a zero exit
status has succeeded. An exit status of zero indicates success.
A non-zero exit status indicates failure. When a command
terminates on a fatal signal N, bash uses the value of 128+N as
the exit status.

As usual, you should always read the man page of the scripts you’re calling, to see what the conventions are for each of them. If you’ve programmed with a language like Java or Python, then you’re most likely familiar with their exceptions, different meanings, and how not all of them are handled the same way.

If you add set -o errexit to your script, from that point forward it will abort the execution if any command exists with a code != 0. But errexit isn’t used when executing functions inside an if condition, so instead of remembering that exception, I rather do explicit error handling.

Take a look at version two of the script. It’s slightly better:

1 #!/bin/bash
2 # Script to collect the status of lshw output from home servers
3 # Dependencies:
4 # * LSHW: http://ezix.org/project/wiki/HardwareLiSter
5 # * JQ: http://stedolan.github.io/jq/
6 #
7 # On each machine you can run something like this from cron (Don't know CRON, no worries: https://crontab-generator.org/        ) 
8 # 0 0 * * * /usr/sbin/lshw -json -quiet > /var/log/lshw-dump.json
9   Author: Jose Vicente Nunez
10 #
11 set -o errtrace # Enable the err trap, code will get called when an error is detected
12 trap "echo ERROR: There was an error in ${FUNCNAME-main context}, details to follow" ERR
13 declare -a servers=(
14 macmini2
15 mac-pro-1-1
16 dmaf5
17 )
18  
19 DATADIR="$HOME/Documents/lshw-dump"
20 if [ ! -d "$DATADIR" ]; then 
21    /usr/bin/mkdir -p -v "$DATADIR"|| "FATAL: Failed to create $DATADIR" && exit 100
22 fi 
23 declare -A server_pid
24 for server in ${servers[*]}; do
25    echo "Visiting: $server"
26    /usr/bin/scp -o logLevel=Error ${server}:/var/log/lshw-dump.json ${DATADIR}/lshw-$server-dump.json &
27   server_pid[$server]=$! # Save the PID of the scp  of a given server for later
28 done
29 # Iterate through all the servers and:
30 # Wait for the return code of each
31 # Check the exit code from each scp
32 for server in ${!server_pid[*]}; do
33    wait ${server_pid[$server]}
34    test $? -ne 0 && echo "ERROR: Copy from $server had problems, will not continue" && exit 100
35 done
36 for lshw in $(/usr/bin/find $DATADIR -type f -name 'lshw-*-dump.json'); do
37    /usr/bin/jq '.["product","vendor", "configuration"]' $lshw
38 done

Here’s what changed:

  • Lines 11 and 12, I enable error trace and added a ‘trap’ to tell the user there was an error and there is turbulence ahead. You may want to kill your script here instead, I’ll show you why that may not be the best.
  • Line 20, if the directory doesn’t exist, then try to create it on line 21. If directory creation fails, then exit with an error.
  • On line 27, after running each background job, I capture the PID and associate that with the machine (1:1 relationship).
  • On lines 33-35, I wait for the scp task to finish, get the return code, and if it’s an error, abort.
  • On line 37, I check that the file could be parsed, otherwise, I exit with an error.

So how does the error handling look now?

Visiting: macmini2
Visiting: mac-pro-1-1
Visiting: dmaf5
lshw-dump.json                                                                                         100%   54KB 146.1MB/s   00:00    
scp: /var/log/lshw-dump.json: No such file or directory
ERROR: There was an error in main context, details to follow
ERROR: Copy from mac-pro-1-1 had problems, will not continue
scp: /var/log/lshw-dump.json: No such file or directory

As you can see, this version is better at detecting errors but it’s very unforgiving. Also, it doesn’t detect all the errors, does it?

When you get stuck and you wish you had an alarm

The code looks better, except that sometimes the scp could get stuck on a server (while trying to copy a file) because the server is too busy to respond or just in a bad state.

Another example is to try to access a directory through NFS where $HOME is mounted from an NFS server:

/usr/bin/find $HOME -type f -name '*.csv' -print -fprint /tmp/report.txt

And you discover hours later that the NFS mount point is stale and your script is stuck.

A timeout is the solution. And, GNU timeout comes to the rescue:

/usr/bin/timeout --kill-after 20.0s 10.0s /usr/bin/find $HOME -type f -name '*.csv' -print -fprint /tmp/report.txt

Here you try to regularly kill (TERM signal) the process nicely after 10.0 seconds after it has started. If it’s still running after 20.0 seconds, then send a KILL signal (kill -9). If in doubt, check which signals are supported in your system (kill -l, for example).

If this isn’t clear from my dialog, then look at the script for more clarity.

/usr/bin/time /usr/bin/timeout --kill-after=10.0s 20.0s /usr/bin/sleep 60s
real    0m20.003s
user    0m0.000s
sys     0m0.003s

Back to the original script to add a few more options and you have version three:

 1 #!/bin/bash
  2 # Script to collect the status of lshw output from home servers
  3 # Dependencies:
  4 # * Open SSH: http://www.openssh.com/portable.html
  5 # * LSHW: http://ezix.org/project/wiki/HardwareLiSter
  6 # * JQ: http://stedolan.github.io/jq/
  7 # * timeout: https://www.gnu.org/software/coreutils/
  8 #
  9 # On each machine you can run something like this from cron (Don't know CRON, no worries: https://crontab-generator.org/)
 10 # 0 0 * * * /usr/sbin/lshw -json -quiet > /var/log/lshw-dump.json
 11 # Author: Jose Vicente Nunez
 12 #
 13 set -o errtrace # Enable the err trap, code will get called when an error is detected
 14 trap "echo ERROR: There was an error in ${FUNCNAME-main context}, details to follow" ERR
 15 
 16 declare -a dependencies=(/usr/bin/timeout /usr/bin/ssh /usr/bin/jq)
 17 for dependency in ${dependencies[@]}; do
 18     if [ ! -x $dependency ]; then
 19         echo "ERROR: Missing $dependency"
 20         exit 100
 21     fi
 22 done
 23 
 24 declare -a servers=(
 25 macmini2
 26 mac-pro-1-1
 27 dmaf5
 28 )
 29 
 30 function remote_copy {
 31     local server=$1
 32     echo "Visiting: $server"
 33     /usr/bin/timeout --kill-after 25.0s 20.0s 
 34         /usr/bin/scp 
 35             -o BatchMode=yes 
 36             -o logLevel=Error 
 37             -o ConnectTimeout=5 
 38             -o ConnectionAttempts=3 
 39             ${server}:/var/log/lshw-dump.json ${DATADIR}/lshw-$server-dump.json
 40     return $?
 41 }
 42 
 43 DATADIR="$HOME/Documents/lshw-dump"
 44 if [ ! -d "$DATADIR" ]; then
 45     /usr/bin/mkdir -p -v "$DATADIR"|| "FATAL: Failed to create $DATADIR" && exit 100
 46 fi
 47 declare -A server_pid
 48 for server in ${servers[*]}; do
 49     remote_copy $server &
 50     server_pid[$server]=$! # Save the PID of the scp  of a given server for later
 51 done
 52 # Iterate through all the servers and:
 53 # Wait for the return code of each
 54 # Check the exit code from each scp
 55 for server in ${!server_pid[*]}; do
 56     wait ${server_pid[$server]}
 57     test $? -ne 0 && echo "ERROR: Copy from $server had problems, will not continue" && exit 100
 58 done
 59 for lshw in $(/usr/bin/find $DATADIR -type f -name 'lshw-*-dump.json'); do
 60     /usr/bin/jq '.["product","vendor", "configuration"]' $lshw
 61 done

What are the changes?:

  • Between lines 16-22, check if all the required dependency tools are present. If it cannot execute, then ‘Houston we have a problem.’
  • Created a remote_copy function, which uses a timeout to make sure the scp finishes no later than 45.0s—line 33.
  • Added a connection timeout of 5 seconds instead of the TCP default—line 37.
  • Added a retry to scp on line 38—3 attempts that wait 1 second between each.

There other ways to retry when there’s an error.

Waiting for the end of the world-how and when to retry

You noticed there’s an added retry to the scp command. But that retries only for failed connections, what if the command fails during the middle of the copy?

Sometimes you want to just fail because there’s very little chance to recover from an issue. A system that requires hardware fixes, for example, or you can just fail back to a degraded mode—meaning that you’re able to continue your system work without the updated data. In those cases, it makes no sense to wait forever but only for a specific amount of time.

Here are the changes to the remote_copy, to keep this brief (version four):

#!/bin/bash
# Omitted code for clarity...
declare REMOTE_FILE="/var/log/lshw-dump.json"
declare MAX_RETRIES=3

# Blah blah blah...

function remote_copy {
    local server=$1
    local retries=$2
    local now=1
    status=0
    while [ $now -le $retries ]; do
        echo "INFO: Trying to copy file from: $server, attempt=$now"
        /usr/bin/timeout --kill-after 25.0s 20.0s 
            /usr/bin/scp 
                -o BatchMode=yes 
                -o logLevel=Error 
                -o ConnectTimeout=5 
                -o ConnectionAttempts=3 
                ${server}:$REMOTE_FILE ${DATADIR}/lshw-$server-dump.json
        status=$?
        if [ $status -ne 0 ]; then
            sleep_time=$(((RANDOM % 60)+ 1))
            echo "WARNING: Copy failed for $server:$REMOTE_FILE. Waiting '${sleep_time} seconds' before re-trying..."
            /usr/bin/sleep ${sleep_time}s
        else
            break # All good, no point on waiting...
        fi
        ((now=now+1))
    done
    return $status
}

DATADIR="$HOME/Documents/lshw-dump"
if [ ! -d "$DATADIR" ]; then
    /usr/bin/mkdir -p -v "$DATADIR"|| "FATAL: Failed to create $DATADIR" && exit 100
fi
declare -A server_pid
for server in ${servers[*]}; do
    remote_copy $server $MAX_RETRIES &
    server_pid[$server]=$! # Save the PID of the scp  of a given server for later
done

# Iterate through all the servers and:
# Wait for the return code of each
# Check the exit code from each scp
for server in ${!server_pid[*]}; do
    wait ${server_pid[$server]}
    test $? -ne 0 && echo "ERROR: Copy from $server had problems, will not continue" && exit 100
done

# Blah blah blah, process the files you just copied...

How does it look now? In this run, I have one system down (mac-pro-1-1) and one system without the file (macmini2). You can see that the copy from server dmaf5 works right away, but for the other two, there’s a retry for a random time between 1 and 60 seconds before exiting:

INFO: Trying to copy file from: macmini2, attempt=1
INFO: Trying to copy file from: mac-pro-1-1, attempt=1
INFO: Trying to copy file from: dmaf5, attempt=1
scp: /var/log/lshw-dump.json: No such file or directory
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for macmini2:/var/log/lshw-dump.json. Waiting '60 seconds' before re-trying...
ssh: connect to host mac-pro-1-1 port 22: No route to host
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for mac-pro-1-1:/var/log/lshw-dump.json. Waiting '32 seconds' before re-trying...
INFO: Trying to copy file from: mac-pro-1-1, attempt=2
ssh: connect to host mac-pro-1-1 port 22: No route to host
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for mac-pro-1-1:/var/log/lshw-dump.json. Waiting '18 seconds' before re-trying...
INFO: Trying to copy file from: macmini2, attempt=2
scp: /var/log/lshw-dump.json: No such file or directory
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for macmini2:/var/log/lshw-dump.json. Waiting '3 seconds' before re-trying...
INFO: Trying to copy file from: macmini2, attempt=3
scp: /var/log/lshw-dump.json: No such file or directory
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for macmini2:/var/log/lshw-dump.json. Waiting '6 seconds' before re-trying...
INFO: Trying to copy file from: mac-pro-1-1, attempt=3
ssh: connect to host mac-pro-1-1 port 22: No route to host
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for mac-pro-1-1:/var/log/lshw-dump.json. Waiting '47 seconds' before re-trying...
ERROR: There was an error in main context, details to follow
ERROR: Copy from mac-pro-1-1 had problems, will not continue

If I fail, do I have to do this all over again? Using a checkpoint

Suppose that the remote copy is the most expensive operation of this whole script and that you’re willing or able to re-run this script, maybe using cron or doing so by hand two times during the day to ensure you pick up the files if one or more systems are down.

You could, for the day, create a small ‘status cache’, where you record only the successful processing operations per machine. If a system is in there, then don’t bother to check again for that day.

Some programs, like Ansible, do something similar and allow you to retry a playbook on a limited number of machines after a failure (--limit @/home/user/site.retry).

A new version (version five) of the script has code to record the status of the copy (lines 15-33):

15 declare SCRIPT_NAME=$(/usr/bin/basename $BASH_SOURCE)|| exit 100
16 declare YYYYMMDD=$(/usr/bin/date +%Y%m%d)|| exit 100
17 declare CACHE_DIR="/tmp/$SCRIPT_NAME/$YYYYMMDD"
18 # Logic to clean up the cache dir on daily basis is not shown here
19 if [ ! -d "$CACHE_DIR" ]; then
20   /usr/bin/mkdir -p -v "$CACHE_DIR"|| exit 100
21 fi
22 trap "/bin/rm -rf $CACHE_DIR" INT KILL
23
24 function check_previous_run {
25  local machine=$1
26  test -f $CACHE_DIR/$machine && return 0|| return 1
27 }
28
29 function mark_previous_run {
30    machine=$1
31    /usr/bin/touch $CACHE_DIR/$machine
32    return $?
33 }

Did you notice the trap on line 22? If the script is interrupted (killed), I want to make sure the whole cache is invalidated.

And then, add this new helper logic into the remote_copy function (lines 52-81):

52 function remote_copy {
53    local server=$1
54    check_previous_run $server
55    test $? -eq 0 && echo "INFO: $1 ran successfully before. Not doing again" && return 0
56    local retries=$2
57    local now=1
58    status=0
59    while [ $now -le $retries ]; do
60        echo "INFO: Trying to copy file from: $server, attempt=$now"
61        /usr/bin/timeout --kill-after 25.0s 20.0s 
62            /usr/bin/scp 
63                -o BatchMode=yes 
64                -o logLevel=Error 
65                -o ConnectTimeout=5 
66               -o ConnectionAttempts=3 
67                ${server}:$REMOTE_FILE ${DATADIR}/lshw-$server-dump.json
68        status=$?
69        if [ $status -ne 0 ]; then
70            sleep_time=$(((RANDOM % 60)+ 1))
71            echo "WARNING: Copy failed for $server:$REMOTE_FILE. Waiting '${sleep_time} seconds' before re-trying..."
72            /usr/bin/sleep ${sleep_time}s
73        else
74            break # All good, no point on waiting...
75        fi
76        ((now=now+1))
77    done
78    test $status -eq 0 && mark_previous_run $server
79    test $? -ne 0 && status=1
80    return $status
81 }

The first time it runs, a new new message for the cache directory is printed out:

./collect_data_from_servers.v5.sh
/usr/bin/mkdir: created directory '/tmp/collect_data_from_servers.v5.sh'
/usr/bin/mkdir: created directory '/tmp/collect_data_from_servers.v5.sh/20210612'
ERROR: There was an error in main context, details to follow
INFO: Trying to copy file from: macmini2, attempt=1
ERROR: There was an error in main context, details to follow

If you run it again, then the script knows that dma5f is good to go, no need to retry the copy:

./collect_data_from_servers.v5.sh
INFO: dmaf5 ran successfully before. Not doing again
ERROR: There was an error in main context, details to follow
INFO: Trying to copy file from: macmini2, attempt=1
ERROR: There was an error in main context, details to follow
INFO: Trying to copy file from: mac-pro-1-1, attempt=1

Imagine how this speeds up when you have more machines that should not be revisited.

Leaving crumbs behind: What to log, how to log, and verbose output

If you’re like me, I like a bit of context to correlate with when something goes wrong. The echo statements on the script are nice but what if you could add a timestamp to them.

If you use logger, you can save the output on journalctl for later review (even aggregation with other tools out there). The best part is that you show the power of journalctl right away.

So instead of just doing echo, you can also add a call to logger like this using a new bash function called ‘message’:

SCRIPT_NAME=$(/usr/bin/basename $BASH_SOURCE)|| exit 100
FULL_PATH=$(/usr/bin/realpath ${BASH_SOURCE[0]})|| exit 100
set -o errtrace # Enable the err trap, code will get called when an error is detected
trap "echo ERROR: There was an error in ${FUNCNAME[0]-main context}, details to follow" ERR
declare CACHE_DIR="/tmp/$SCRIPT_NAME/$YYYYMMDD"

function message {
    message="$1"
    func_name="${2-unknown}"
    priority=6
    if [ -z "$2" ]; then
        echo "INFO:" $message
    else
        echo "ERROR:" $message
        priority=0
    fi
    /usr/bin/logger --journald<<EOF
MESSAGE_ID=$SCRIPT_NAME
MESSAGE=$message
PRIORITY=$priority
CODE_FILE=$FULL_PATH
CODE_FUNC=$func_name
EOF
}

You can see that you can store separate fields as part of the message, like the priority, the script that produced the message, etc.

So how is this useful? Well, you could get the messages between 1:26 PM and 1:27 PM, only errors (priority=0) and only for our script (collect_data_from_servers.v6.sh) like this, output in JSON format:

journalctl --since 13:26 --until 13:27 --output json-pretty PRIORITY=0 MESSAGE_ID=collect_data_from_servers.v6.sh
{
        "_BOOT_ID" : "dfcda9a1a1cd406ebd88a339bec96fb6",
        "_AUDIT_LOGINUID" : "1000",
        "SYSLOG_IDENTIFIER" : "logger",
        "PRIORITY" : "0",
        "_TRANSPORT" : "journal",
        "_SELINUX_CONTEXT" : "unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023",
        "__REALTIME_TIMESTAMP" : "1623518797641880",
        "_AUDIT_SESSION" : "3",
        "_GID" : "1000",
        "MESSAGE_ID" : "collect_data_from_servers.v6.sh",
        "MESSAGE" : "Copy failed for macmini2:/var/log/lshw-dump.json. Waiting '45 seconds' before re-trying...",
        "_CAP_EFFECTIVE" : "0",
        "CODE_FUNC" : "remote_copy",
        "_MACHINE_ID" : "60d7a3f69b674aaebb600c0e82e01d05",
        "_COMM" : "logger",
        "CODE_FILE" : "/home/josevnz/BashError/collect_data_from_servers.v6.sh",
        "_PID" : "41832",
        "__MONOTONIC_TIMESTAMP" : "25928272252",
        "_HOSTNAME" : "dmaf5",
        "_SOURCE_REALTIME_TIMESTAMP" : "1623518797641843",
        "__CURSOR" : "s=97bb6295795a4560ad6fdedd8143df97;i=1f826;b=dfcda9a1a1cd406ebd88a339bec96fb6;m=60972097c;t=5c494ed383898;x=921c71966b8943e3",
        "_UID" : "1000"
}

Because this is structured data, other logs collectors can go through all your machines, aggregate your script logs, and then you not only have data but also the information.

You can take a look at the whole version six of the script.

Don’t be so eager to replace your data until you’ve checked it.

If you noticed from the very beginning, I’ve been copying a corrupted JSON file over and over:

Parse error: Expected separator between values at line 4, column 11
ERROR parsing '/home/josevnz/Documents/lshw-dump/lshw-dmaf5-dump.json'

That’s easy to prevent. Copy the file into a temporary location and if the file is corrupted, then don’t attempt to replace the previous version (and leave the bad one for inspection. lines 99-107 of version seven of the script):

function remote_copy {
    local server=$1
    check_previous_run $server
    test $? -eq 0 && message "$1 ran successfully before. Not doing again" && return 0
    local retries=$2
    local now=1
    status=0
    while [ $now -le $retries ]; do
        message "Trying to copy file from: $server, attempt=$now"
        /usr/bin/timeout --kill-after 25.0s 20.0s 
            /usr/bin/scp 
                -o BatchMode=yes 
                -o logLevel=Error 
                -o ConnectTimeout=5 
                -o ConnectionAttempts=3 
                ${server}:$REMOTE_FILE ${DATADIR}/lshw-$server-dump.json.$$
        status=$?
        if [ $status -ne 0 ]; then
            sleep_time=$(((RANDOM % 60)+ 1))
            message "Copy failed for $server:$REMOTE_FILE. Waiting '${sleep_time} seconds' before re-trying..." ${FUNCNAME[0]}
            /usr/bin/sleep ${sleep_time}s
        else
            break # All good, no point on waiting...
        fi
        ((now=now+1))
    done
    if [ $status -eq 0 ]; then
        /usr/bin/jq '.' ${DATADIR}/lshw-$server-dump.json.$$ > /dev/null 2>&1
        status=$?
        if [ $status -eq 0 ]; then
            /usr/bin/mv -v -f ${DATADIR}/lshw-$server-dump.json.$$ ${DATADIR}/lshw-$server-dump.json && mark_previous_run $server
            test $? -ne 0 && status=1
        else
            message "${DATADIR}/lshw-$server-dump.json.$$ Is corrupted. Leaving for inspection..." ${FUNCNAME[0]}
        fi
    fi
    return $status
}

Choose the right tools for the task and prep your code from the first line

One very important aspect of error handling is proper coding. If you have bad logic in your code, no amount of error handling will make it better. To keep this short and bash-related, I’ll give you below a few hints.

You should ALWAYS check for error syntax before running your script:

bash -n $my_bash_script.sh

Seriously. It should be as automatic as performing any other test.

Read the bash man page and get familiar with must-know options, like:

set -xv
my_complicated_instruction1
my_complicated_instruction2
my_complicated_instruction3
set +xv

Use ShellCheck to check your bash scripts

It’s very easy to miss simple issues when your scripts start to grow large. ShellCheck is one of those tools that saves you from making mistakes.

shellcheck collect_data_from_servers.v7.sh

In collect_data_from_servers.v7.sh line 15:
for dependency in ${dependencies[@]}; do
                  ^----------------^ SC2068: Double quote array expansions to avoid re-splitting elements.


In collect_data_from_servers.v7.sh line 16:
    if [ ! -x $dependency ]; then
              ^---------^ SC2086: Double quote to prevent globbing and word splitting.

Did you mean: 
    if [ ! -x "$dependency" ]; then
...

If you’re wondering, the final version of the script, after passing ShellCheck is here. Squeaky clean.

You noticed something with the background scp processes

You probably noticed that if you kill the script, it leaves some forked processes behind. That isn’t good and this is one of the reasons I prefer to use tools like Ansible or Parallel to handle this type of task on multiple hosts, letting the frameworks do the proper cleanup for me. You can, of course, add more code to handle this situation.

This bash script could potentially create a fork bomb. It has no control of how many processes to spawn at the same time, which is a big problem in a real production environment. Also, there is a limit on how many concurrent ssh sessions you can have (let alone consume bandwidth). Again, I wrote this fictional example in bash to show you how you can always improve a program to better handle errors.

Let’s recap

[ Download now: A sysadmin’s guide to Bash scripting. ]

1.  You must check the return code of your commands. That could mean deciding to retry until a transitory condition improves or to short-circuit the whole script.
2.  Speaking of transitory conditions, you don’t need to start from scratch. You can save the status of successful tasks and then retry from that point forward.
3.  Bash ‘trap’ is your friend. Use it for cleanup and error handling.
4.  When downloading data from any source, assume it’s corrupted. Never overwrite your good data set with fresh data until you have done some integrity checks.
5.  Take advantage of journalctl and custom fields. You can perform sophisticated searches looking for issues, and even send that data to log aggregators.
6.  You can check the status of background tasks (including sub-shells). Just remember to save the PID and wait on it.
7.  And finally: Use a Bash lint helper like  ShellCheck. You can install it on your favorite editor (like VIM or PyCharm). You will be surprised how many errors go undetected on Bash scripts…

If you enjoyed this content or would like to expand on it, contact the team at enable-sysadmin@redhat.com.

Понравилась статья? Поделить с друзьями:
  • Bash syntax error in expression error token is
  • Bash syntax error in conditional expression
  • Bash suppress error
  • Bash sudo command not found как исправить
  • Bash sh syntax error unexpected end of file