Is it possible to get what line the ERR signal was sent from?
Yes, LINENO
and BASH_LINENO
variables are supper useful for getting the line of failure and the lines that lead up to it.
Or maybe I’m going at this all wrong?
Nope, just missing -q
option with grep…
echo hello | grep -q "asdf"
… With the -q
option grep
will return 0
for true
and 1
for false
. And in Bash it’s trap
not Trap
…
trap "_func" ERR
… I need a native solution…
Here’s a trapper that ya might find useful for debugging things that have a bit more cyclomatic complexity…
failure.sh
## Outputs Front-Mater formatted failures for functions not returning 0
## Use the following line after sourcing this file to set failure trap
## trap 'failure "LINENO" "BASH_LINENO" "${BASH_COMMAND}" "${?}"' ERR
failure(){
local -n _lineno="${1:-LINENO}"
local -n _bash_lineno="${2:-BASH_LINENO}"
local _last_command="${3:-${BASH_COMMAND}}"
local _code="${4:-0}"
## Workaround for read EOF combo tripping traps
if ! ((_code)); then
return "${_code}"
fi
local _last_command_height="$(wc -l <<<"${_last_command}")"
local -a _output_array=()
_output_array+=(
'---'
"lines_history: [${_lineno} ${_bash_lineno[*]}]"
"function_trace: [${FUNCNAME[*]}]"
"exit_code: ${_code}"
)
if [[ "${#BASH_SOURCE[@]}" -gt '1' ]]; then
_output_array+=('source_trace:')
for _item in "${BASH_SOURCE[@]}"; do
_output_array+=(" - ${_item}")
done
else
_output_array+=("source_trace: [${BASH_SOURCE[*]}]")
fi
if [[ "${_last_command_height}" -gt '1' ]]; then
_output_array+=(
'last_command: ->'
"${_last_command}"
)
else
_output_array+=("last_command: ${_last_command}")
fi
_output_array+=('---')
printf '%sn' "${_output_array[@]}" >&2
exit ${_code}
}
… and an example usage script for exposing the subtle differences in how to set the above trap for function tracing too…
example_usage.sh
#!/usr/bin/env bash
set -E -o functrace
## Optional, but recommended to find true directory this script resides in
__SOURCE__="${BASH_SOURCE[0]}"
while [[ -h "${__SOURCE__}" ]]; do
__SOURCE__="$(find "${__SOURCE__}" -type l -ls | sed -n 's@^.* -> (.*)@1@p')"
done
__DIR__="$(cd -P "$(dirname "${__SOURCE__}")" && pwd)"
## Source module code within this script
source "${__DIR__}/modules/trap-failure/failure.sh"
trap 'failure "LINENO" "BASH_LINENO" "${BASH_COMMAND}" "${?}"' ERR
something_functional() {
_req_arg_one="${1:?something_functional needs two arguments, missing the first already}"
_opt_arg_one="${2:-SPAM}"
_opt_arg_two="${3:0}"
printf 'something_functional: %s %s %s' "${_req_arg_one}" "${_opt_arg_one}" "${_opt_arg_two}"
## Generate an error by calling nothing
"${__DIR__}/nothing.sh"
}
## Ignoring errors prevents trap from being triggered
something_functional || echo "Ignored something_functional returning $?"
if [[ "$(something_functional 'Spam!?')" == '0' ]]; then
printf 'Nothing somehow was something?!n' >&2 && exit 1
fi
## And generating an error state will cause the trap to _trace_ it
something_functional '' 'spam' 'Jam'
The above where tested on Bash version 4+, so leave a comment if something for versions prior to four are needed, or Open an Issue if it fails to trap failures on systems with a minimum version of four.
The main takeaways are…
set -E -o functrace
-
-E
causes errors within functions to bubble up -
-o functrace
causes allows for more verbosity when something within a function fails
trap 'failure "LINENO" "BASH_LINENO" "${BASH_COMMAND}" "${?}"' ERR
-
Single quotes are used around function call and double quotes are around individual arguments
-
References to
LINENO
andBASH_LINENO
are passed instead of the current values, though this might be shortened in later versions of linked to trap, such that the final failure line makes it into output -
Values of
BASH_COMMAND
and exit status ($?
) are passed, first to get the command that returned an error, and second for ensuring that the trap does not trigger on non-error statuses
And while others may disagree I find it’s easier to build an output array and use printf for printing each array element on it’s own line…
printf '%sn' "${_output_array[@]}" >&2
… also the >&2
bit at the end causes errors to go where they should (standard error), and allows for capturing just errors…
## ... to a file...
some_trapped_script.sh 2>some_trapped_errros.log
## ... or by ignoring standard out...
some_trapped_script.sh 1>/dev/null
As shown by these and other examples on Stack Overflow, there be lots of ways to build a debugging aid using built in utilities.
Running the following code:
#!/bin/bash
set -o pipefail
set -o errtrace
set -o nounset
set -o errexit
function err_handler ()
{
local error_code="$?"
echo "TRAP!"
echo "error code: $error_code"
exit
}
trap err_handler ERR
echo "wrong command in if statement"
if xsxsxsxs
then
echo "if result is true"
else
echo "if result is false"
fi
echo -e "nwrong command directly"
xsxsxsxs
exit
produces the following output:
wrong command in if statement
trap.sh: line 21: xsxsxsxs: command not found
if result is false
wrong command directly
trap.sh: line 29: xsxsxsxs: command not found
TRAP!
error code: 127
How can I trap the ‘command not found‘ error inside the if statement too?
Jolta
2,5941 gold badge31 silver badges41 bronze badges
asked Oct 27, 2012 at 20:03
Luca BorrioneLuca Borrione
16k7 gold badges51 silver badges65 bronze badges
You can’t trap ERR for the test in the if
From bash man:
The ERR trap is not executed if the failed command
is part of the command list immediately following a while or
until keyword, part of the test in an if statement, part of a
command executed in a && or || list, or if the command's return
value is being inverted via !
But you could change this
if xsxsxsxs
then ..
to this
xsxsxsxs
if [[ $? -eq 0 ]]
then ..
answered Oct 27, 2012 at 21:32
3
Unluckly, German Garcia is right, so I wrote a workaround.
Any suggestion or improvement is more than welcome thanks.
Here’s what I did so far:
#!/bin/bash
set -o pipefail
set -o errtrace
set -o nounset
set -o errexit
declare stderr_log="/dev/shm/stderr.log"
exec 2>"${stderr_log}"
function err_handler ()
{
local error_code=${1:-$?}
echo "TRAP! ${error_code}"
echo "exit status: $error_code"
stderr=$( tail -n 1 "$stderr_log" )
echo "error message: $stderr"
echo "" > "${stderr_log}"
echo "Normally I would exit now but I carry on the demo instead"
# uncomemment the following two lines to exit now.
# rm "${stderr_log}"
# exit "${error_code}"
}
trap err_handler ERR
function check ()
{
local params=( "$@" )
local result=0
local statement=''
for param in "${params[@]}"
do
local regex='s+'
if [[ ${param} =~ ${regex} ]]
then
param=$( echo "${param}" | sed 's/"/\"/g' )
param=""$param""
fi
statement="${statement} $param"
done
eval "if $statement; then result=1; fi"
stderr=$( tail -n 1 "$stderr_log" )
($statement); local error_code="$?"
test -n "$stderr" && err_handler "${error_code}"
test $result = 1 && [[ $( echo "1" ) ]] || [[ $( echo "" ) ]]
}
echo -e "n1) wrong command in if statement"
if check xsxsxs -d "/etc"
then
echo "if returns true"
else
echo "if returns false"
fi
echo -e "n2) right command in if statement"
if check test -d "/etc"
then
echo "if returns true"
else
echo "if returns false"
fi
echo -e "n3) wrong command directly"
xsxsxsxs
exit
Running the above will produce:
1) wrong command in if statement
TRAP!
error code found: 0
error message: trap.sh: line 52: xsxsxs: command not found
I would exit now but I carry on instead
if returns false
2) right command in if statement
if returns true
3) wrong command directly
TRAP!
error code found: 127
error message: trap.sh: line 77: xsxsxsxs: command not found
I would exit now but I carry on instead
So the idea is basically to create a method called ‘check’, then to add it before the command to debug in the if statement.
I cannot catch the error code in this way, but it doesn’t matter too much as long as I can get a message.
It would be nice to hear from you about that. Thanks
answered Oct 29, 2012 at 18:04
Luca BorrioneLuca Borrione
16k7 gold badges51 silver badges65 bronze badges
5
Contents
- 1 Problem
- 2 Solutions
- 2.1 Executed in subshell, exit on error
- 2.2 Executed in subshell, trap error
- 3 Caveat 1: `Exit on error’ ignoring subshell exit status
- 3.1 Solution: Generate error yourself if subshell fails
- 3.1.1 Example 1
- 3.1.2 Example 2
- 3.1 Solution: Generate error yourself if subshell fails
- 4 Caveat 2: `Exit on error’ not exitting subshell on error
- 4.1 Solution: Use logical operators (&&, ||) within subshell
- 4.1.1 Example
- 4.1 Solution: Use logical operators (&&, ||) within subshell
- 5 Caveat 3: `Exit on error’ not exitting command substition on error
- 5.1 Solution 1: Use logical operators (&&, ||) within command substitution
- 5.2 Solution 2: Enable posix mode
- 6 The tools
- 6.1 Exit on error
- 6.1.1 Specify `bash -e’ as the shebang interpreter
- 6.1.1.1 Example
- 6.1.2 Set ERR trap to exit
- 6.1.2.1 Example
- 6.1.1 Specify `bash -e’ as the shebang interpreter
- 6.1 Exit on error
- 7 Solutions revisited: Combining the tools
- 7.1 Executed in subshell, trap on exit
- 7.1.1 Rationale
- 7.2 Sourced in current shell
- 7.2.1 Todo
- 7.2.2 Rationale
- 7.2.2.1 `Exit’ trap in sourced script
- 7.2.2.2 `Break’ trap in sourced script
- 7.2.2.3 Trap in function in sourced script without `errtrace’
- 7.2.2.4 Trap in function in sourced script with ‘errtrace’
- 7.2.2.5 `Break’ trap in function in sourced script with `errtrace’
- 7.1 Executed in subshell, trap on exit
- 8 Test
- 9 See also
- 10 Journal
- 10.1 20210114
- 10.2 20060524
- 10.3 20060525
- 11 Comments
Problem
I want to catch errors in bash script using set -e
(or set -o errexit
or trap ERR
). What are best practices?
Solutions
See #Solutions revisited: Combining the tools for detailed explanations.
If the script is executed in a subshell, it’s relative easy: You don’t have to worry about backing up and restoring shell options and shell traps, because they’re automatically restored when you exit the subshell.
Executed in subshell, exit on error
Example script:
#!/bin/bash -eu # -e: Exit immediately if a command exits with a non-zero status. # -u: Treat unset variables as an error when substituting. (false) # Caveat 1: If an error occurs in a subshell, it isn't detected (false) || false # Solution: If you want to exit, you have to detect the error yourself (false; true) || false # Caveat 2: The return status of the ';' separated list is `true' (false && true) || false # Solution: If you want to control the last command executed, use `&&'
See also #Caveat 1: `Exit on error’ ignoring subshell exit status
Executed in subshell, trap error
#!/bin/bash -Eu # -E: ERR trap is inherited by shell functions. # -u: Treat unset variables as an error when substituting. # # Example script for handling bash errors. Exit on error. Trap exit. # This script is supposed to run in a subshell. # See also: http://fvue.nl/wiki/Bash:_Error_handling # Trap non-normal exit signals: 1/HUP, 2/INT, 3/QUIT, 15/TERM, ERR trap onexit 1 2 3 15 ERR #--- onexit() ----------------------------------------------------- # @param $1 integer (optional) Exit status. If not set, use `$?' function onexit() { local exit_status=${1:-$?} echo Exiting $0 with $exit_status exit $exit_status } # myscript # Allways call `onexit' at end of script onexit
Caveat 1: `Exit on error’ ignoring subshell exit status
The `-e’ setting does not exit if an error occurs within a subshell, for example with these subshell commands: (false)
or bash -c false
Example script caveat1.sh:
#!/bin/bash -e echo begin (false) echo end
Executing the script above gives:
$ ./caveat1.sh begin end $
Conclusion: the script didn’t exit after (false).
Solution: Generate error yourself if subshell fails
( SHELL COMMANDS ) || false
In the line above, the exit status of the subshell is checked. The subshell must exit with a zero status — indicating success, otherwise `false’ will run, generating an error in the current shell.
Note that within a bash `list’, with commands separated by a `;’, the return status is the exit status of the last command executed. Use the control operators `&&’ and `||’ if you want to control the last command executed:
$ (false; true) || echo foo $ (false && true) || echo foo foo $
Example 1
Example script example.sh:
#!/bin/bash -e echo begin (false) || false echo end
Executing the script above gives:
$ ./example.sh begin $
Conclusion: the script exits after false.
Example 2
Example bash commands:
$ trap 'echo error' ERR # Set ERR trap $ false # Non-zero exit status will be trapped error $ (false) # Non-zero exit status within subshell will not be trapped $ (false) || false # Solution: generate error yourself if subshell fails error $ trap - ERR # Reset ERR trap
Caveat 2: `Exit on error’ not exitting subshell on error
The `-e’ setting doesn’t always immediately exit the subshell `(…)’ when an error occurs. It appears the subshell behaves as a simple command and has the same restrictions as `-e’:
- Exit immediately if a simple command exits with a non-zero status, unless the subshell is part of the command list immediately following a `while’ or `until’ keyword, part of the test in an `if’ statement, part of the right-hand-side of a `&&’ or `||’ list, or if the command’s return status is being inverted using `!’
Example script caveat2.sh:
#!/bin/bash -e (false; echo A) # Subshell exits after `false' !(false; echo B) # Subshell doesn't exit after `false' true && (false; echo C) # Subshell exits after `false' (false; echo D) && true # Subshell doesn't exit after `false' (false; echo E) || false # Subshell doesn't exit after `false' if (false; echo F); then true; fi # Subshell doesn't exit after `false' while (false; echo G); do break; done # Subshell doesn't exit after `false' until (false; echo H); do break; done # Subshell doesn't exit after `false'
Executing the script above gives:
$ ./caveat2.sh B D E F G H
Solution: Use logical operators (&&, ||) within subshell
Use logical operators `&&’ or `||’ to control execution of commands within a subshell.
Example
#!/bin/bash -e (false && echo A) !(false && echo B) true && (false && echo C) (false && echo D) && true (false && echo E) || false if (false && echo F); then true; fi while (false && echo G); do break; done until (false && echo H); do break; done
Executing the script above gives no output:
$ ./example.sh $
Conclusion: the subshells do not output anything because the `&&’ operator is used instead of the command separator `;’ as in caveat2.sh.
Caveat 3: `Exit on error’ not exitting command substition on error
The `-e’ setting doesn’t immediately exit command substitution when an error occurs, except when bash is in posix mode:
$ set -e $ echo $(false; echo A) A
Solution 1: Use logical operators (&&, ||) within command substitution
$ set -e $ echo $(false || echo A)
Solution 2: Enable posix mode
When posix mode is enabled via set -o posix
, command substition will exit if `-e’ has been set in the
parent shell.
$ set -e $ set -o posix $ echo $(false; echo A)
Enabling posix might have other effects though?
The tools
Exit on error
Bash can be told to exit immediately if a command fails. From the bash manual («set -e»):
- «Exit immediately if a simple command (see SHELL GRAMMAR above) exits with a non-zero status. The shell does not exit if the command that fails is part of the command list immediately following a while or until keyword, part of the test in an if statement, part of a && or || list, or if the command’s return value is being inverted via !. A trap on ERR, if set, is executed before the shell exits.»
To let bash exit on error, different notations can be used:
- Specify `bash -e’ as shebang interpreter
- Start shell script with `bash -e’
- Use `set -e’ in shell script
- Use `set -o errexit’ in shell script
- Use `trap exit ERR’ in shell script
Specify `bash -e’ as the shebang interpreter
You can add `-e’ to the shebang line, the first line of your shell script:
#!/bin/bash -e
This will execute the shell script with `-e’ active. Note `-e’ can be overridden by invoking bash explicitly (without `-e’):
$ bash shell_script
Example
Create this shell script example.sh and make it executable with chmod u+x example.sh
:
#!/bin/bash -e echo begin false # This should exit bash because `false' returns error echo end # This should never be reached
Example run:
$ ./example.sh begin $ bash example.sh begin end $
Set ERR trap to exit
By setting an ERR trap you can catch errors as well:
trap command ERR
By setting the command to `exit’, bash exits if an error occurs.
trap exit ERR
Example
Example script example.sh
#!/bin/bash trap exit ERR echo begin false echo end
Example run:
$ ./example.sh begin $
The non-zero exit status of `false’ is catched by the error trap. The error trap exits and `echo end’ is never reached.
Solutions revisited: Combining the tools
Executed in subshell, trap on exit
#!/bin/bash # --- subshell_trap.sh ------------------------------------------------- # Example script for handling bash errors. Exit on error. Trap exit. # This script is supposed to run in a subshell. # See also: http://fvue.nl/wiki/Bash:_Error_handling # Let shell functions inherit ERR trap. Same as `set -E'. set -o errtrace # Trigger error when expanding unset variables. Same as `set -u'. set -o nounset # Trap non-normal exit signals: 1/HUP, 2/INT, 3/QUIT, 15/TERM, ERR # NOTE1: - 9/KILL cannot be trapped. #+ - 0/EXIT isn't trapped because: #+ - with ERR trap defined, trap would be called twice on error #+ - with ERR trap defined, syntax errors exit with status 0, not 2 # NOTE2: Setting ERR trap does implicit `set -o errexit' or `set -e'. trap onexit 1 2 3 15 ERR #--- onexit() ----------------------------------------------------- # @param $1 integer (optional) Exit status. If not set, use `$?' function onexit() { local exit_status=${1:-$?} echo Exiting $0 with $exit_status exit $exit_status } # myscript # Allways call `onexit' at end of script onexit
Rationale
+-------+ +----------+ +--------+ +------+ | shell | | subshell | | script | | trap | +-------+ +----------+ +--------+ +------+ : : : : +-+ +-+ +-+ error +-+ | | | | | |-------->| | | | exit | | | ! | | | |<-----------------------------------+ +-+ : : : : : : :
Figure 1. Trap in executed script
When a script is executed from a shell, bash will create a subshell in which the script is run. If a trap catches an error, and the trap says `exit’, this will cause the subshell to exit.
Sourced in current shell
If the script is sourced (included) in the current shell, you have to worry about restoring shell options and shell traps. If they aren’t restored, they might cause problems in other programs which rely on specific settings.
#!/bin/bash #--- listing6.inc.sh --------------------------------------------------- # Demonstration of ERR trap being reset by foo_deinit() with the use # of `errtrace'. # Example run: # # $ set +o errtrace # Make sure errtrace is not set (bash default) # $ trap - ERR # Make sure no ERR trap is set (bash default) # $ . listing6.inc.sh # Source listing6.inc.sh # $ foo # Run foo() # foo_init # Entered `trap-loop' # trapped # This is always executed - with or without a trap occurring # foo_deinit # $ trap # Check if ERR trap is reset. # $ set -o | grep errtrace # Check if the `errtrace' setting is... # errtrace off # ...restored. # $ # # See: http://fvue.nl/wiki/Bash:_Error_handling function foo_init { echo foo_init fooOldErrtrace=$(set +o | grep errtrace) set -o errtrace trap 'echo trapped; break' ERR # Set ERR trap } function foo_deinit { echo foo_deinit trap - ERR # Reset ERR trap eval $fooOldErrtrace # Restore `errtrace' setting unset fooOldErrtrace # Delete global variable } function foo { foo_init # `trap-loop' while true; do echo Entered `trap-loop' false echo This should never be reached because the `false' above is trapped break done echo This is always executed - with or without a trap occurring foo_deinit }
Todo
- an existing ERR trap must be restored and called
- test if the `trap-loop’ is reached if the script breaks from a nested loop
Rationale
`Exit’ trap in sourced script
When the script is sourced in the current shell, it’s not possible to use `exit’ to terminate the program: This would terminate the current shell as well, as shown in the picture underneath.
+-------+ +--------+ +------+ | shell | | script | | trap | +-------+ +--------+ +------+ : : : +-+ +-+ error +-+ | | | |-------->| | | | | | | | | | exit | | | | <------------------------------------------+ : : :
Figure 2. `Exit’ trap in sourced script
When a script is sourced from a shell, bash will run the script in the current shell. If a trap catches an error, and the trap says `exit’, this will cause the current shell to exit.
`Break’ trap in sourced script
A solution is to introduce a main loop in the program, which is terminated by a `break’ statement within the trap.
+-------+ +--------+ +--------+ +------+ | shell | | script | | `loop' | | trap | +-------+ +--------+ +--------+ +------+ : : : : +-+ +-+ +-+ error +-+ | | | | | |------->| | | | | | | | | | | | | | break | | | | | | return | |<----------------------+ | |<----------+ : : +-+ : : : : : : :
Figure 3. `Break’ trap in sourced script
When a script is sourced from a shell, e.g. . ./script
, bash will run the script in the current shell. If a trap catches an error, and the trap says `break’, this will cause the `loop’ to break and to return to the script.
For example:
#!/bin/bash #--- listing3.sh ------------------------------------------------------- # See: http://fvue.nl/wiki/Bash:_Error_handling trap 'echo trapped; break' ERR; # Set ERR trap function foo { echo foo; false; } # foo() exits with error # `trap-loop' while true; do echo Entered `trap-loop' foo echo This is never reached break done echo This is always executed - with or without a trap occurring trap - ERR # Reset ERR trap
Listing 3. `Break’ trap in sourced script
When a script is sourced from a shell, e.g. ./script
, bash will run the script in the current shell. If a trap catches an error, and the trap says `break’, this will cause the `loop’ to break and to return to the script.
Example output:
$> source listing3.sh Entered `trap-loop' foo trapped This is always executed after a trap $>
Trap in function in sourced script without `errtrace’
A problem arises when the trap is reset from within a function of a sourced script. From the bash manual, set -o errtrace
or set -E
:
If set, any trap on `ERR’ is inherited by shell functions, command
substitutions, and commands executed in a subshell environment.
The `ERR’ trap is normally not inherited in such cases.
So with errtrace
not set, a function does not know of any `ERR’ trap set, and thus the function is unable to reset the `ERR’ trap. For example, see listing 4 underneath.
#!/bin/bash #--- listing4.inc.sh --------------------------------------------------- # Demonstration of ERR trap not being reset by foo_deinit() # Example run: # # $> set +o errtrace # Make sure errtrace is not set (bash default) # $> trap - ERR # Make sure no ERR trap is set (bash default) # $> . listing4.inc.sh # Source listing4.inc.sh # $> foo # Run foo() # foo_init # foo # foo_deinit # This should've reset the ERR trap... # $> trap # but the ERR trap is still there: # trap -- 'echo trapped' ERR # $> trap # See: http://fvue.nl/wiki/Bash:_Error_handling function foo_init { echo foo_init trap 'echo trapped' ERR;} # Set ERR trap function foo_deinit { echo foo_deinit trap - ERR ;} # Reset ERR trap function foo { foo_init echo foo foo_deinit ;}
Listing 4. Trap in function in sourced script
foo_deinit()
is unable to unset the ERR trap, because errtrace
is not set.
Trap in function in sourced script with ‘errtrace’
The solution is to set -o errtrace
. See listing 5 underneath:
#!/bin/bash #--- listing5.inc.sh --------------------------------------------------- # Demonstration of ERR trap being reset by foo_deinit() with the use # of `errtrace'. # Example run: # # $> set +o errtrace # Make sure errtrace is not set (bash default) # $> trap - ERR # Make sure no ERR trap is set (bash default) # $> . listing5.inc.sh # Source listing5.inc.sh # $> foo # Run foo() # foo_init # foo # foo_deinit # This should reset the ERR trap... # $> trap # and it is indeed. # $> set +o | grep errtrace # And the `errtrace' setting is restored. # $> # # See: http://fvue.nl/wiki/Bash:_Error_handling function foo_init { echo foo_init fooOldErrtrace=$(set +o | grep errtrace) set -o errtrace trap 'echo trapped' ERR # Set ERR trap } function foo_deinit { echo foo_deinit trap - ERR # Reset ERR trap eval($fooOldErrtrace) # Restore `errtrace' setting fooOldErrtrace= # Delete global variable } function foo { foo_init echo foo foo_deinit ;}
`Break’ trap in function in sourced script with `errtrace’
Everything combined in listing 6 underneath:
#!/bin/bash #--- listing6.inc.sh --------------------------------------------------- # Demonstration of ERR trap being reset by foo_deinit() with the use # of `errtrace'. # Example run: # # $> set +o errtrace # Make sure errtrace is not set (bash default) # $> trap - ERR # Make sure no ERR trap is set (bash default) # $> . listing6.inc.sh # Source listing6.inc.sh # $> foo # Run foo() # foo_init # Entered `trap-loop' # trapped # This is always executed - with or without a trap occurring # foo_deinit # $> trap # Check if ERR trap is reset. # $> set -o | grep errtrace # Check if the `errtrace' setting is... # errtrace off # ...restored. # $> # # See: http://fvue.nl/wiki/Bash:_Error_handling function foo_init { echo foo_init fooOldErrtrace=$(set +o | grep errtrace) set -o errtrace trap 'echo trapped; break' ERR # Set ERR trap } function foo_deinit { echo foo_deinit trap - ERR # Reset ERR trap eval $fooOldErrtrace # Restore `errtrace' setting unset fooOldErrtrace # Delete global variable } function foo { foo_init # `trap-loop' while true; do echo Entered `trap-loop' false echo This should never be reached because the `false' above is trapped break done echo This is always executed - with or without a trap occurring foo_deinit }
Test
#!/bin/bash # Tests # An erroneous command should have exit status 127. # The erroneous command should be trapped by the ERR trap. #erroneous_command # A simple command exiting with a non-zero status should have exit status #+ <> 0, in this case 1. The simple command is trapped by the ERR trap. #false # Manually calling 'onexit' #onexit # Manually calling 'onexit' with exit status #onexit 5 # Killing a process via CTRL-C (signal 2/SIGINT) is handled via the SIGINT trap # NOTE: `sleep' cannot be killed via `kill' plus 1/SIGHUP, 2/SIGINT, 3/SIGQUIT #+ or 15/SIGTERM. #echo $$; sleep 20 # Killing a process via 1/SIGHUP, 2/SIGQUIT, 3/SIGQUIT or 15/SIGTERM is #+ handled via the respective trap. # NOTE: Unfortunately, I haven't found a way to retrieve the signal number from #+ within the trap function. echo $$; while true; do :; done # A syntax error is not trapped, but should have exit status 2 #fi # An unbound variable is not trapped, but should have exit status 1 # thanks to 'set -u' #echo $foo # Executing `false' within a function should exit with 1 because of `set -E' #function foo() { # false # true #} # foo() #foo echo End of script # Allways call 'onexit' at end of script onexit
See also
- Bash: Err trap not reset
- Solution for
trap - ERR
not resetting ERR trap.
Journal
20210114
Another caveat: exit (or an error-trap) executed within «process substitution» doesn’t end outer process. The script underneath keeps outputting «loop1»:
#!/bin/bash # This script outputs "loop1" forever, while I hoped it would exit all while-loops set -o pipefail set -Eeu while true; do echo loop1 while read FOO; do echo loop2 echo FOO: $FOO done < <( exit 1 ) done
The ‘< <()’ notation is called process substitution.
See also:
- https://mywiki.wooledge.org/ProcessSubstitution
- https://unix.stackexchange.com/questions/128560/how-do-i-capture-the-exit-code-handle-errors-correctly-when-using-process-subs
- https://superuser.com/questions/696855/why-doesnt-a-bash-while-loop-exit-when-piping-to-terminated-subcommand
Workaround: Use «Here Strings» ([n]<<<word):
#!/bin/bash # This script will exit correctly if building up $rows results in an error set -Eeu rows=$(exit 1) while true; do echo loop1 while read FOO; do echo loop2 echo FOO: $FOO done <<< "$rows" done
20060524
#!/bin/bash #--- traptest.sh -------------------------------------------- # Example script for trapping bash errors. # NOTE: Why doesn't this scripts catch syntax errors? # Exit on all errors set -e # Trap exit trap trap_exit_handler EXIT # Handle exit trap function trap_exit_handler() { # Backup exit status if you're interested... local exit_status=$? # Change value of $? true echo $? #echo trap_handler $exit_status } # trap_exit_handler() # An erroneous command will trigger a bash error and, because # of 'set -e', will 'exit 127' thus falling into the exit trap. #erroneous_command # The same goes for a command with a false return status #false # A manual exit will also fall into the exit trap #exit 5 # A syntax error isn't catched? fi # Disable exit trap trap - EXIT exit 0
Normally, a syntax error exits with status 2, but when both ‘set -e’ and ‘trap EXIT’ are defined, my script exits with status 0. How can I have both ‘errexit’ and ‘trap EXIT’ enabled, *and* catch syntax errors
via exit status? Here’s an example script (test.sh):
set -e trap 'echo trapped: $?' EXIT fi $> bash test.sh; echo $?: $? test.sh: line 3: syntax error near unexpected token `fi' trapped: 0 $?: 0
More trivia:
- With the line ‘#set -e’ commented, bash traps 258 and returns an exit status of 2:
trapped: 258 $?: 2
- With the line ‘#trap ‘echo trapped $?’ EXIT’ commented, bash returns an exit status of 2:
$?: 2
- With a bogus function definition on top, bash returns an exit status of 2, but no exit trap is executed:
function foo() { foo=bar } set -e trap 'echo trapped: $?' EXIT fi
fred@linux:~>bash test.sh; echo $?: $? test.sh: line 4: syntax error near unexpected token `fi' test.sh: line 4: `fi' $?: 2
20060525
Example of a ‘cleanup’ script
trap
Writing Robust Bash Shell Scripts
#!/bin/bash #--- cleanup.sh --------------------------------------------------------------- # Example script for trapping bash errors. # NOTE: Use 'cleanexit [status]' instead of 'exit [status]' # Trap not-normal exit signals: 1/HUP, 2/INT, 3/QUIT, 15/TERM # @see catch_sig() trap catch_sig 1 2 3 15 # Trap errors (simple commands exiting with a non-zero status) # @see catch_err() trap catch_err ERR #--- cleanexit() -------------------------------------------------------------- # Wrapper around 'exit' to cleanup on exit. # @param $1 integer Exit status. If $1 not defined, exit status of global #+ variable 'EXIT_STATUS' is used. If neither $1 or #+ 'EXIT_STATUS' defined, exit with status 0 (success). function cleanexit() { echo "Exiting with ${1:-${EXIT_STATUS:-0}}" exit ${1:-${EXIT_STATUS:-0}} } # cleanexit() #--- catch_err() -------------------------------------------------------------- # Catch ERR trap. # This traps simple commands exiting with a non-zero status. # See also: info bash | "Shell Builtin Commands" | "The Set Builtin" | "-e" function catch_err() { local exit_status=$? echo "Inside catch_err" cleanexit $exit_status } # catch_err() #--- catch_sig() -------------------------------------------------------------- # Catch signal trap. # Trap not-normal exit signals: 1/HUP, 2/INT, 3/QUIT, 15/TERM # @NOTE1: Non-trapped signals are 0/EXIT, 9/KILL. function catch_sig() { local exit_status=$? echo "Inside catch_sig" cleanexit $exit_status } # catch_sig() # An erroneous command should have exit status 127. # The erroneous command should be trapped by the ERR trap. #erroneous_command # A command returning false should have exit status <> 0 # The false returning command should be trapped by the ERR trap. #false # Manually calling 'cleanexit' #cleanexit # Manually calling 'cleanexit' with exit status #cleanexit 5 # Killing a process via CTRL-C is handled via the SIGINT trap #sleep 20 # A syntax error is not trapped, but should have exit status 2 #fi # Allways call 'cleanexit' at end of script cleanexit
blog comments powered by
Advertisement
blog comments powered by
Introduction
A shell script can run into problems during its execution, resulting in an error signal that interrupts the script unexpectedly.
Errors occur due to a faulty script design, user actions, or system failures. A script that fails may leave behind temporary files that cause trouble when a user restarts the script.
This tutorial will show you how to use the trap
command to ensure your scripts always exit predictably.
Prerequisites
- Access to the terminal/command line.
- A text editor (Nano, Vi/Vim, etc.).
Bash trap Syntax
The syntax for the trap
command is:
trap [options] "[arguments]" [signals]
The command has the following components:
- Options provide added functionality to the command.
- Arguments are the commands
trap
executes upon detecting a signal. Unless the command is only one word, it should be enclosed with quotation marks (" "
). If the argument contains more than one command, separate them with a semicolon (;
). - Signals are asynchronous notifications sent by the system, usually indicating a user-generated or system-related interruption. Signals can be called by their name or number.
Bash trap Options
The trap
command accepts the following options:
-p
— Displays signal commands.-l
— Prints a list of all the signals and their numbers.
Below is the complete list of the 64 signals and their numbers:
# | Signal | # | Signal | # | Signal |
---|---|---|---|---|---|
1 | SIGHUP | 23 | SIGURG | 45 | SIGRTMIN+11 |
2 | SIGINT | 24 | SIGXCPU | 46 | SIGRTMIN+12 |
3 | SIGQUIT | 25 | SIGXFSZ | 47 | SIGRTMIN+13 |
4 | SIGILL | 26 | SIGVTALRM | 48 | SIGRTMIN+14 |
5 | SIGTRAP | 27 | SIGPROF | 49 | SIGRTMIN+15 |
6 | SIGABRT | 28 | SIGWINCH | 50 | SIGRTMAX-14 |
7 | SIGBUS | 29 | SIGIO | 51 | SIGRTMAX-13 |
8 | SIGFPE | 30 | SIGPWR | 52 | SIGRTMAX-12 |
9 | SIGKILL | 31 | SIGSYS | 53 | SIGRTMAX-11 |
10 | SIGUSR1 | 32 | SIGWAITING | 54 | SIGRTMAX-10 |
11 | SIGSEGV | 33 | SIGLWP | 55 | SIGRTMAX-9 |
12 | SIGUSR2 | 34 | SIGRTMIN | 56 | SIGRTMAX-8 |
13 | SIGPIPE | 35 | SIGRTMIN+1 | 57 | SIGRTMAX-7 |
14 | SIGALRM | 36 | SIGRTMIN+2 | 58 | SIGRTMAX-6 |
15 | SIGTERM | 37 | SIGRTMIN+3 | 59 | SIGRTMAX-5 |
16 | SIGSTKFLT | 38 | SIGRTMIN+4 | 60 | SIGRTMAX-4 |
17 | SIGCHLD | 39 | SIGRTMIN+5 | 61 | SIGRTMAX-3 |
18 | SIGCONT | 40 | SIGRTMIN+6 | 62 | SIGRTMAX-2 |
19 | SIGSTOP | 41 | SIGRTMIN+7 | 63 | SIGRTMAX-1 |
20 | SIGTSTP | 42 | SIGRTMIN+8 | 64 | SIGRTMAX |
21 | SIGTTIN | 43 | SIGRTMIN+9 | ||
22 | SIGTTOU | 44 | SIGRTMIN+10 |
Note: Signals 32 and 33 are not supported in Linux, and the trap -l
command does not display them in the output.
The signals most commonly used with the trap
command are:
SIGHUP (1)
— Clean tidy-upSIGINT (2)
— InterruptSIGQUIT (3)
— QuitSIGABRT (6)
— CancelSIGALRM (14)
— Alarm clockSIGTERM (15)
— Terminate
Note: The SIG
prefix in signal names is optional. For example, SIGTERM
signal can also be written as TERM
.
How to Use trap in Bash
A typical scenario for using the trap
command is catching the SIGINT
signal. This signal is sent by the system when the user interrupts the execution of the script by pressing Ctrl+C.
The following example script prints the word «Test» every second until the user interrupts it with Ctrl+C. The script then prints a message and quits.
trap "echo The script is terminated; exit" SIGINT
while true
do
echo Test
sleep 1
done
The while
loop in the example above executes infinitely. The first line of the script contains the trap
command and the instructions to wait for the SIGINT
signal, then print the message and exit the script.
The trap
command is frequently used to clean up temporary files if the script exits due to interruption. The following example defines the cleanup
function, which prints a message, removes all the files added to the $TRASH
variable, and exits the script.
$TRASH=$(mktemp -t tmp.XXXXXXXXXX)
trap cleanup 1 2 3 6
cleanup()
{
echo "Removing temporary files:"
rm -rf "$TRASH"
exit
}
...
The trap
in the example above executes the cleanup
function when it detects one of the four signals: SIGHUP
, SIGINT
, SIGQUIT
, or SIGABRT
. The signals are referred to by their number.
You can also use trap
to ensure the user cannot interrupt the script execution. This feature is important when executing sensitive commands whose interruption may permanently damage the system. The syntax for disabling a signal is:
trap "" [signal]
Double quotation marks mean that no command will be executed. For example, to trap the SIGINT
and SIGABRT
signals, type:
trap "" SIGINT SIGABRT
[a command that must not be interrupted]
If you wish to re-enable the signals at any time during the script, reset the rules by using the dash symbol:
trap - SIGINT SIGABRT
[a command that can be interrupted]
Note: The SIGKILL
signal cannot be trapped. It always immediately interrupts the script.
Conclusion
After reading this tutorial, you know how to use the trap
command to ensure your bash script always exits properly. If you are interested in more Bash-related topics, read How to Run a Bash Script.
It’s easy to detect when a shell script starts, but it’s not always easy to know when it stops. A script might end normally, just as its author intends it to end, but it could also fail due to an unexpected fatal error. Sometimes it’s beneficial to preserve the remnants of whatever was in progress when a script failed, and other times it’s inconvenient. Either way, detecting the end of a script and reacting to it in some pre-calculated manner is why the Bash trap
directive exists.
Responding to failure
Here’s an example of how one failure in a script can lead to future failures. Say you have written a program that creates a temporary directory in /tmp
so that it can unarchive and process files before bundling them back together in a different format:
#!/usr/bin/env bash
CWD=`pwd`
TMP=${TMP:-/tmp/tmpdir}
## create tmp dir
mkdir "${TMP}"
## extract files to tmp
tar xf "${1}" --directory "${TMP}"
## move to tmpdir and run commands
pushd "${TMP}"
for IMG in *.jpg; do
mogrify -verbose -flip -flop "${IMG}"
done
tar --create --file "${1%.*}".tar *.jpg
## move back to origin
popd
## bundle with bzip2
bzip2 --compress "${TMP}"/"${1%.*}".tar
--stdout > "${1%.*}".tbz
## clean up
/usr/bin/rm -r /tmp/tmpdir
Most of the time, the script works as expected. However, if you accidentally run it on an archive filled with PNG files instead of the expected JPEG files, it fails halfway through. One failure leads to another, and eventually, the script exits without reaching its final directive to remove the temporary directory. As long as you manually remove the directory, you can recover quickly, but if you aren’t around to do that, then the next time the script runs, it has to deal with an existing temporary directory full of unpredictable leftover files.
One way to combat this is to reverse and double-up on the logic by adding a precautionary removal to the start of the script. While valid, that relies on brute force instead of structure. A more elegant solution is trap
.
Catching signals with trap
The trap
keyword catches signals that may happen during execution. You’ve used one of these signals if you’ve ever used the kill
or killall
commands, which call SIGTERM
by default. There are many other signals that shells respond to, and you can see most of them with trap -l
(as in «list»):
$ trap --list
1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL 5) SIGTRAP
6) SIGABRT 7) SIGBUS 8) SIGFPE 9) SIGKILL 10) SIGUSR1
11) SIGSEGV 12) SIGUSR2 13) SIGPIPE 14) SIGALRM 15) SIGTERM
16) SIGSTKFLT 17) SIGCHLD 18) SIGCONT 19) SIGSTOP 20) SIGTSTP
21) SIGTTIN 22) SIGTTOU 23) SIGURG 24) SIGXCPU 25) SIGXFSZ
26) SIGVTALRM 27) SIGPROF 28) SIGWINCH 29) SIGIO 30) SIGPWR
31) SIGSYS 34) SIGRTMIN 35) SIGRTMIN+1 36) SIGRTMIN+2 37) SIGRTMIN+3
38) SIGRTMIN+4 39) SIGRTMIN+5 40) SIGRTMIN+6 41) SIGRTMIN+7 42) SIGRTMIN+8
43) SIGRTMIN+9 44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12 47) SIGRTMIN+13
48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14 51) SIGRTMAX-13 52) SIGRTMAX-12
53) SIGRTMAX-11 54) SIGRTMAX-10 55) SIGRTMAX-9 56) SIGRTMAX-8 57) SIGRTMAX-7
58) SIGRTMAX-6 59) SIGRTMAX-5 60) SIGRTMAX-4 61) SIGRTMAX-3 62) SIGRTMAX-2
63) SIGRTMAX-1 64) SIGRTMAX
Any of these signals may be anticipated with trap
. In addition to these, trap
recognizes:
EXIT
: Occurs when the shell process itself exitsERR
: Occurs when a command (such as tar or mkdir) or a built-in command (such as pushd or cd) completes with a non-zero statusDEBUG
: A Boolean representing debug mode
To set a trap in Bash, use trap
followed by a list of commands you want to be executed, followed by a list of signals to trigger it.
For instance, this trap detects a SIGINT
, the signal sent when a user presses Ctrl+C while a process is running:
trap "{ echo 'Terminated with Ctrl+C'; }" SIGINT
The example script with temporary directory problems can be fixed with a trap detecting SIGINT
, errors, and successful exits:
#!/usr/bin/env bash
CWD=`pwd`
TMP=${TMP:-/tmp/tmpdir}
trap
"{ /usr/bin/rm -r "${TMP}" ; exit 255; }"
SIGINT SIGTERM ERR EXIT
## create tmp dir
mkdir "${TMP}"
tar xf "${1}" --directory "${TMP}"
## move to tmp and run commands
pushd "${TMP}"
for IMG in *.jpg; do
mogrify -verbose -flip -flop "${IMG}"
done
tar --create --file "${1%.*}".tar *.jpg
## move back to origin
popd
## zip tar
bzip2 --compress $TMP/"${1%.*}".tar
--stdout > "${1%.*}".tbz
For complex actions, you can simplify trap
statements with Bash functions.
Traps in Bash
Traps are useful to ensure that your scripts end cleanly, whether they run successfully or not. It’s never safe to rely completely on automated garbage collection, so this is a good habit to get into in general. Try using them in your scripts, and see what they can do!
This work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.
In this article, I present a few tricks to handle error conditions—Some strictly don’t fall under the category of error handling (a reactive way to handle the unexpected) but also some techniques to avoid errors before they happen.
Case study: Simple script that downloads a hardware report from multiple hosts and inserts it into a database.
Say that you have a cron
job on each one of your Linux systems, and you have a script to collect the hardware information from each:
#!/bin/bash
# Script to collect the status of lshw output from home servers
# Dependencies:
# * LSHW: http://ezix.org/project/wiki/HardwareLiSter
# * JQ: http://stedolan.github.io/jq/
#
# On each machine you can run something like this from cron (Don't know CRON, no worries: https://crontab-generator.org/)
# 0 0 * * * /usr/sbin/lshw -json -quiet > /var/log/lshw-dump.json
# Author: Jose Vicente Nunez
#
declare -a servers=(
dmaf5
)
DATADIR="$HOME/Documents/lshw-dump"
/usr/bin/mkdir -p -v "$DATADIR"
for server in ${servers[*]}; do
echo "Visiting: $server"
/usr/bin/scp -o logLevel=Error ${server}:/var/log/lshw-dump.json ${DATADIR}/lshw-$server-dump.json &
done
wait
for lshw in $(/usr/bin/find $DATADIR -type f -name 'lshw-*-dump.json'); do
/usr/bin/jq '.["product","vendor", "configuration"]' $lshw
done
If everything goes well, then you collect your files in parallel because you don’t have more than ten systems. You can afford to ssh to all of them at the same time and then show the hardware details of each one.
Visiting: dmaf5
lshw-dump.json 100% 54KB 136.9MB/s 00:00
"DMAF5 (Default string)"
"BESSTAR TECH LIMITED"
{
"boot": "normal",
"chassis": "desktop",
"family": "Default string",
"sku": "Default string",
"uuid": "00020003-0004-0005-0006-000700080009"
}
Here are some possibilities of why things went wrong:
- Your report didn’t run because the server was down
- You couldn’t create the directory where the files need to be saved
- The tools you need to run the script are missing
- You can’t collect the report because your remote machine crashed
- One or more of the reports is corrupt
The current version of the script has a problem—It will run from the beginning to the end, errors or not:
./collect_data_from_servers.sh
Visiting: macmini2
Visiting: mac-pro-1-1
Visiting: dmaf5
lshw-dump.json 100% 54KB 48.8MB/s 00:00
scp: /var/log/lshw-dump.json: No such file or directory
scp: /var/log/lshw-dump.json: No such file or directory
parse error: Expected separator between values at line 3, column 9
Next, I demonstrate a few things to make your script more robust and in some times recover from failure.
The nuclear option: Failing hard, failing fast
The proper way to handle errors is to check if the program finished successfully or not, using return codes. It sounds obvious but return codes, an integer number stored in bash $?
or $!
variable, have sometimes a broader meaning. The bash man page tells you:
For the shell’s purposes, a command which exits with a zero exit
status has succeeded. An exit status of zero indicates success.
A non-zero exit status indicates failure. When a command
terminates on a fatal signal N, bash uses the value of 128+N as
the exit status.
As usual, you should always read the man page of the scripts you’re calling, to see what the conventions are for each of them. If you’ve programmed with a language like Java or Python, then you’re most likely familiar with their exceptions, different meanings, and how not all of them are handled the same way.
If you add set -o errexit
to your script, from that point forward it will abort the execution if any command exists with a code != 0
. But errexit
isn’t used when executing functions inside an if
condition, so instead of remembering that exception, I rather do explicit error handling.
Take a look at version two of the script. It’s slightly better:
1 #!/bin/bash
2 # Script to collect the status of lshw output from home servers
3 # Dependencies:
4 # * LSHW: http://ezix.org/project/wiki/HardwareLiSter
5 # * JQ: http://stedolan.github.io/jq/
6 #
7 # On each machine you can run something like this from cron (Don't know CRON, no worries: https://crontab-generator.org/ )
8 # 0 0 * * * /usr/sbin/lshw -json -quiet > /var/log/lshw-dump.json
9 Author: Jose Vicente Nunez
10 #
11 set -o errtrace # Enable the err trap, code will get called when an error is detected
12 trap "echo ERROR: There was an error in ${FUNCNAME-main context}, details to follow" ERR
13 declare -a servers=(
14 macmini2
15 mac-pro-1-1
16 dmaf5
17 )
18
19 DATADIR="$HOME/Documents/lshw-dump"
20 if [ ! -d "$DATADIR" ]; then
21 /usr/bin/mkdir -p -v "$DATADIR"|| "FATAL: Failed to create $DATADIR" && exit 100
22 fi
23 declare -A server_pid
24 for server in ${servers[*]}; do
25 echo "Visiting: $server"
26 /usr/bin/scp -o logLevel=Error ${server}:/var/log/lshw-dump.json ${DATADIR}/lshw-$server-dump.json &
27 server_pid[$server]=$! # Save the PID of the scp of a given server for later
28 done
29 # Iterate through all the servers and:
30 # Wait for the return code of each
31 # Check the exit code from each scp
32 for server in ${!server_pid[*]}; do
33 wait ${server_pid[$server]}
34 test $? -ne 0 && echo "ERROR: Copy from $server had problems, will not continue" && exit 100
35 done
36 for lshw in $(/usr/bin/find $DATADIR -type f -name 'lshw-*-dump.json'); do
37 /usr/bin/jq '.["product","vendor", "configuration"]' $lshw
38 done
Here’s what changed:
- Lines 11 and 12, I enable error trace and added a ‘trap’ to tell the user there was an error and there is turbulence ahead. You may want to kill your script here instead, I’ll show you why that may not be the best.
- Line 20, if the directory doesn’t exist, then try to create it on line 21. If directory creation fails, then exit with an error.
- On line 27, after running each background job, I capture the PID and associate that with the machine (1:1 relationship).
- On lines 33-35, I wait for the
scp
task to finish, get the return code, and if it’s an error, abort. - On line 37, I check that the file could be parsed, otherwise, I exit with an error.
So how does the error handling look now?
Visiting: macmini2
Visiting: mac-pro-1-1
Visiting: dmaf5
lshw-dump.json 100% 54KB 146.1MB/s 00:00
scp: /var/log/lshw-dump.json: No such file or directory
ERROR: There was an error in main context, details to follow
ERROR: Copy from mac-pro-1-1 had problems, will not continue
scp: /var/log/lshw-dump.json: No such file or directory
As you can see, this version is better at detecting errors but it’s very unforgiving. Also, it doesn’t detect all the errors, does it?
When you get stuck and you wish you had an alarm
The code looks better, except that sometimes the scp
could get stuck on a server (while trying to copy a file) because the server is too busy to respond or just in a bad state.
Another example is to try to access a directory through NFS where $HOME
is mounted from an NFS server:
/usr/bin/find $HOME -type f -name '*.csv' -print -fprint /tmp/report.txt
And you discover hours later that the NFS mount point is stale and your script is stuck.
A timeout is the solution. And, GNU timeout comes to the rescue:
/usr/bin/timeout --kill-after 20.0s 10.0s /usr/bin/find $HOME -type f -name '*.csv' -print -fprint /tmp/report.txt
Here you try to regularly kill (TERM signal) the process nicely after 10.0 seconds after it has started. If it’s still running after 20.0 seconds, then send a KILL signal (kill -9
). If in doubt, check which signals are supported in your system (kill -l
, for example).
If this isn’t clear from my dialog, then look at the script for more clarity.
/usr/bin/time /usr/bin/timeout --kill-after=10.0s 20.0s /usr/bin/sleep 60s
real 0m20.003s
user 0m0.000s
sys 0m0.003s
Back to the original script to add a few more options and you have version three:
1 #!/bin/bash
2 # Script to collect the status of lshw output from home servers
3 # Dependencies:
4 # * Open SSH: http://www.openssh.com/portable.html
5 # * LSHW: http://ezix.org/project/wiki/HardwareLiSter
6 # * JQ: http://stedolan.github.io/jq/
7 # * timeout: https://www.gnu.org/software/coreutils/
8 #
9 # On each machine you can run something like this from cron (Don't know CRON, no worries: https://crontab-generator.org/)
10 # 0 0 * * * /usr/sbin/lshw -json -quiet > /var/log/lshw-dump.json
11 # Author: Jose Vicente Nunez
12 #
13 set -o errtrace # Enable the err trap, code will get called when an error is detected
14 trap "echo ERROR: There was an error in ${FUNCNAME-main context}, details to follow" ERR
15
16 declare -a dependencies=(/usr/bin/timeout /usr/bin/ssh /usr/bin/jq)
17 for dependency in ${dependencies[@]}; do
18 if [ ! -x $dependency ]; then
19 echo "ERROR: Missing $dependency"
20 exit 100
21 fi
22 done
23
24 declare -a servers=(
25 macmini2
26 mac-pro-1-1
27 dmaf5
28 )
29
30 function remote_copy {
31 local server=$1
32 echo "Visiting: $server"
33 /usr/bin/timeout --kill-after 25.0s 20.0s
34 /usr/bin/scp
35 -o BatchMode=yes
36 -o logLevel=Error
37 -o ConnectTimeout=5
38 -o ConnectionAttempts=3
39 ${server}:/var/log/lshw-dump.json ${DATADIR}/lshw-$server-dump.json
40 return $?
41 }
42
43 DATADIR="$HOME/Documents/lshw-dump"
44 if [ ! -d "$DATADIR" ]; then
45 /usr/bin/mkdir -p -v "$DATADIR"|| "FATAL: Failed to create $DATADIR" && exit 100
46 fi
47 declare -A server_pid
48 for server in ${servers[*]}; do
49 remote_copy $server &
50 server_pid[$server]=$! # Save the PID of the scp of a given server for later
51 done
52 # Iterate through all the servers and:
53 # Wait for the return code of each
54 # Check the exit code from each scp
55 for server in ${!server_pid[*]}; do
56 wait ${server_pid[$server]}
57 test $? -ne 0 && echo "ERROR: Copy from $server had problems, will not continue" && exit 100
58 done
59 for lshw in $(/usr/bin/find $DATADIR -type f -name 'lshw-*-dump.json'); do
60 /usr/bin/jq '.["product","vendor", "configuration"]' $lshw
61 done
What are the changes?:
- Between lines 16-22, check if all the required dependency tools are present. If it cannot execute, then ‘Houston we have a problem.’
- Created a
remote_copy
function, which uses a timeout to make sure thescp
finishes no later than 45.0s—line 33. - Added a connection timeout of 5 seconds instead of the TCP default—line 37.
- Added a retry to
scp
on line 38—3 attempts that wait 1 second between each.
There other ways to retry when there’s an error.
Waiting for the end of the world-how and when to retry
You noticed there’s an added retry to the scp
command. But that retries only for failed connections, what if the command fails during the middle of the copy?
Sometimes you want to just fail because there’s very little chance to recover from an issue. A system that requires hardware fixes, for example, or you can just fail back to a degraded mode—meaning that you’re able to continue your system work without the updated data. In those cases, it makes no sense to wait forever but only for a specific amount of time.
Here are the changes to the remote_copy
, to keep this brief (version four):
#!/bin/bash
# Omitted code for clarity...
declare REMOTE_FILE="/var/log/lshw-dump.json"
declare MAX_RETRIES=3
# Blah blah blah...
function remote_copy {
local server=$1
local retries=$2
local now=1
status=0
while [ $now -le $retries ]; do
echo "INFO: Trying to copy file from: $server, attempt=$now"
/usr/bin/timeout --kill-after 25.0s 20.0s
/usr/bin/scp
-o BatchMode=yes
-o logLevel=Error
-o ConnectTimeout=5
-o ConnectionAttempts=3
${server}:$REMOTE_FILE ${DATADIR}/lshw-$server-dump.json
status=$?
if [ $status -ne 0 ]; then
sleep_time=$(((RANDOM % 60)+ 1))
echo "WARNING: Copy failed for $server:$REMOTE_FILE. Waiting '${sleep_time} seconds' before re-trying..."
/usr/bin/sleep ${sleep_time}s
else
break # All good, no point on waiting...
fi
((now=now+1))
done
return $status
}
DATADIR="$HOME/Documents/lshw-dump"
if [ ! -d "$DATADIR" ]; then
/usr/bin/mkdir -p -v "$DATADIR"|| "FATAL: Failed to create $DATADIR" && exit 100
fi
declare -A server_pid
for server in ${servers[*]}; do
remote_copy $server $MAX_RETRIES &
server_pid[$server]=$! # Save the PID of the scp of a given server for later
done
# Iterate through all the servers and:
# Wait for the return code of each
# Check the exit code from each scp
for server in ${!server_pid[*]}; do
wait ${server_pid[$server]}
test $? -ne 0 && echo "ERROR: Copy from $server had problems, will not continue" && exit 100
done
# Blah blah blah, process the files you just copied...
How does it look now? In this run, I have one system down (mac-pro-1-1) and one system without the file (macmini2). You can see that the copy from server dmaf5 works right away, but for the other two, there’s a retry for a random time between 1 and 60 seconds before exiting:
INFO: Trying to copy file from: macmini2, attempt=1
INFO: Trying to copy file from: mac-pro-1-1, attempt=1
INFO: Trying to copy file from: dmaf5, attempt=1
scp: /var/log/lshw-dump.json: No such file or directory
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for macmini2:/var/log/lshw-dump.json. Waiting '60 seconds' before re-trying...
ssh: connect to host mac-pro-1-1 port 22: No route to host
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for mac-pro-1-1:/var/log/lshw-dump.json. Waiting '32 seconds' before re-trying...
INFO: Trying to copy file from: mac-pro-1-1, attempt=2
ssh: connect to host mac-pro-1-1 port 22: No route to host
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for mac-pro-1-1:/var/log/lshw-dump.json. Waiting '18 seconds' before re-trying...
INFO: Trying to copy file from: macmini2, attempt=2
scp: /var/log/lshw-dump.json: No such file or directory
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for macmini2:/var/log/lshw-dump.json. Waiting '3 seconds' before re-trying...
INFO: Trying to copy file from: macmini2, attempt=3
scp: /var/log/lshw-dump.json: No such file or directory
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for macmini2:/var/log/lshw-dump.json. Waiting '6 seconds' before re-trying...
INFO: Trying to copy file from: mac-pro-1-1, attempt=3
ssh: connect to host mac-pro-1-1 port 22: No route to host
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for mac-pro-1-1:/var/log/lshw-dump.json. Waiting '47 seconds' before re-trying...
ERROR: There was an error in main context, details to follow
ERROR: Copy from mac-pro-1-1 had problems, will not continue
If I fail, do I have to do this all over again? Using a checkpoint
Suppose that the remote copy is the most expensive operation of this whole script and that you’re willing or able to re-run this script, maybe using cron
or doing so by hand two times during the day to ensure you pick up the files if one or more systems are down.
You could, for the day, create a small ‘status cache’, where you record only the successful processing operations per machine. If a system is in there, then don’t bother to check again for that day.
Some programs, like Ansible, do something similar and allow you to retry a playbook on a limited number of machines after a failure (--limit @/home/user/site.retry
).
A new version (version five) of the script has code to record the status of the copy (lines 15-33):
15 declare SCRIPT_NAME=$(/usr/bin/basename $BASH_SOURCE)|| exit 100
16 declare YYYYMMDD=$(/usr/bin/date +%Y%m%d)|| exit 100
17 declare CACHE_DIR="/tmp/$SCRIPT_NAME/$YYYYMMDD"
18 # Logic to clean up the cache dir on daily basis is not shown here
19 if [ ! -d "$CACHE_DIR" ]; then
20 /usr/bin/mkdir -p -v "$CACHE_DIR"|| exit 100
21 fi
22 trap "/bin/rm -rf $CACHE_DIR" INT KILL
23
24 function check_previous_run {
25 local machine=$1
26 test -f $CACHE_DIR/$machine && return 0|| return 1
27 }
28
29 function mark_previous_run {
30 machine=$1
31 /usr/bin/touch $CACHE_DIR/$machine
32 return $?
33 }
Did you notice the trap on line 22? If the script is interrupted (killed), I want to make sure the whole cache is invalidated.
And then, add this new helper logic into the remote_copy
function (lines 52-81):
52 function remote_copy {
53 local server=$1
54 check_previous_run $server
55 test $? -eq 0 && echo "INFO: $1 ran successfully before. Not doing again" && return 0
56 local retries=$2
57 local now=1
58 status=0
59 while [ $now -le $retries ]; do
60 echo "INFO: Trying to copy file from: $server, attempt=$now"
61 /usr/bin/timeout --kill-after 25.0s 20.0s
62 /usr/bin/scp
63 -o BatchMode=yes
64 -o logLevel=Error
65 -o ConnectTimeout=5
66 -o ConnectionAttempts=3
67 ${server}:$REMOTE_FILE ${DATADIR}/lshw-$server-dump.json
68 status=$?
69 if [ $status -ne 0 ]; then
70 sleep_time=$(((RANDOM % 60)+ 1))
71 echo "WARNING: Copy failed for $server:$REMOTE_FILE. Waiting '${sleep_time} seconds' before re-trying..."
72 /usr/bin/sleep ${sleep_time}s
73 else
74 break # All good, no point on waiting...
75 fi
76 ((now=now+1))
77 done
78 test $status -eq 0 && mark_previous_run $server
79 test $? -ne 0 && status=1
80 return $status
81 }
The first time it runs, a new new message for the cache directory is printed out:
./collect_data_from_servers.v5.sh
/usr/bin/mkdir: created directory '/tmp/collect_data_from_servers.v5.sh'
/usr/bin/mkdir: created directory '/tmp/collect_data_from_servers.v5.sh/20210612'
ERROR: There was an error in main context, details to follow
INFO: Trying to copy file from: macmini2, attempt=1
ERROR: There was an error in main context, details to follow
If you run it again, then the script knows that dma5f is good to go, no need to retry the copy:
./collect_data_from_servers.v5.sh
INFO: dmaf5 ran successfully before. Not doing again
ERROR: There was an error in main context, details to follow
INFO: Trying to copy file from: macmini2, attempt=1
ERROR: There was an error in main context, details to follow
INFO: Trying to copy file from: mac-pro-1-1, attempt=1
Imagine how this speeds up when you have more machines that should not be revisited.
Leaving crumbs behind: What to log, how to log, and verbose output
If you’re like me, I like a bit of context to correlate with when something goes wrong. The echo
statements on the script are nice but what if you could add a timestamp to them.
If you use logger
, you can save the output on journalctl
for later review (even aggregation with other tools out there). The best part is that you show the power of journalctl
right away.
So instead of just doing echo
, you can also add a call to logger
like this using a new bash function called ‘message
’:
SCRIPT_NAME=$(/usr/bin/basename $BASH_SOURCE)|| exit 100
FULL_PATH=$(/usr/bin/realpath ${BASH_SOURCE[0]})|| exit 100
set -o errtrace # Enable the err trap, code will get called when an error is detected
trap "echo ERROR: There was an error in ${FUNCNAME[0]-main context}, details to follow" ERR
declare CACHE_DIR="/tmp/$SCRIPT_NAME/$YYYYMMDD"
function message {
message="$1"
func_name="${2-unknown}"
priority=6
if [ -z "$2" ]; then
echo "INFO:" $message
else
echo "ERROR:" $message
priority=0
fi
/usr/bin/logger --journald<<EOF
MESSAGE_ID=$SCRIPT_NAME
MESSAGE=$message
PRIORITY=$priority
CODE_FILE=$FULL_PATH
CODE_FUNC=$func_name
EOF
}
You can see that you can store separate fields as part of the message, like the priority, the script that produced the message, etc.
So how is this useful? Well, you could get
the messages between 1:26 PM and 1:27 PM, only errors (priority=0
) and only for our script (collect_data_from_servers.v6.sh
) like this, output in JSON format:
journalctl --since 13:26 --until 13:27 --output json-pretty PRIORITY=0 MESSAGE_ID=collect_data_from_servers.v6.sh
{
"_BOOT_ID" : "dfcda9a1a1cd406ebd88a339bec96fb6",
"_AUDIT_LOGINUID" : "1000",
"SYSLOG_IDENTIFIER" : "logger",
"PRIORITY" : "0",
"_TRANSPORT" : "journal",
"_SELINUX_CONTEXT" : "unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023",
"__REALTIME_TIMESTAMP" : "1623518797641880",
"_AUDIT_SESSION" : "3",
"_GID" : "1000",
"MESSAGE_ID" : "collect_data_from_servers.v6.sh",
"MESSAGE" : "Copy failed for macmini2:/var/log/lshw-dump.json. Waiting '45 seconds' before re-trying...",
"_CAP_EFFECTIVE" : "0",
"CODE_FUNC" : "remote_copy",
"_MACHINE_ID" : "60d7a3f69b674aaebb600c0e82e01d05",
"_COMM" : "logger",
"CODE_FILE" : "/home/josevnz/BashError/collect_data_from_servers.v6.sh",
"_PID" : "41832",
"__MONOTONIC_TIMESTAMP" : "25928272252",
"_HOSTNAME" : "dmaf5",
"_SOURCE_REALTIME_TIMESTAMP" : "1623518797641843",
"__CURSOR" : "s=97bb6295795a4560ad6fdedd8143df97;i=1f826;b=dfcda9a1a1cd406ebd88a339bec96fb6;m=60972097c;t=5c494ed383898;x=921c71966b8943e3",
"_UID" : "1000"
}
Because this is structured data, other logs collectors can go through all your machines, aggregate your script logs, and then you not only have data but also the information.
You can take a look at the whole version six of the script.
Don’t be so eager to replace your data until you’ve checked it.
If you noticed from the very beginning, I’ve been copying a corrupted JSON file over and over:
Parse error: Expected separator between values at line 4, column 11
ERROR parsing '/home/josevnz/Documents/lshw-dump/lshw-dmaf5-dump.json'
That’s easy to prevent. Copy the file into a temporary location and if the file is corrupted, then don’t attempt to replace the previous version (and leave the bad one for inspection. lines 99-107 of version seven of the script):
function remote_copy {
local server=$1
check_previous_run $server
test $? -eq 0 && message "$1 ran successfully before. Not doing again" && return 0
local retries=$2
local now=1
status=0
while [ $now -le $retries ]; do
message "Trying to copy file from: $server, attempt=$now"
/usr/bin/timeout --kill-after 25.0s 20.0s
/usr/bin/scp
-o BatchMode=yes
-o logLevel=Error
-o ConnectTimeout=5
-o ConnectionAttempts=3
${server}:$REMOTE_FILE ${DATADIR}/lshw-$server-dump.json.$$
status=$?
if [ $status -ne 0 ]; then
sleep_time=$(((RANDOM % 60)+ 1))
message "Copy failed for $server:$REMOTE_FILE. Waiting '${sleep_time} seconds' before re-trying..." ${FUNCNAME[0]}
/usr/bin/sleep ${sleep_time}s
else
break # All good, no point on waiting...
fi
((now=now+1))
done
if [ $status -eq 0 ]; then
/usr/bin/jq '.' ${DATADIR}/lshw-$server-dump.json.$$ > /dev/null 2>&1
status=$?
if [ $status -eq 0 ]; then
/usr/bin/mv -v -f ${DATADIR}/lshw-$server-dump.json.$$ ${DATADIR}/lshw-$server-dump.json && mark_previous_run $server
test $? -ne 0 && status=1
else
message "${DATADIR}/lshw-$server-dump.json.$$ Is corrupted. Leaving for inspection..." ${FUNCNAME[0]}
fi
fi
return $status
}
Choose the right tools for the task and prep your code from the first line
One very important aspect of error handling is proper coding. If you have bad logic in your code, no amount of error handling will make it better. To keep this short and bash-related, I’ll give you below a few hints.
You should ALWAYS check for error syntax before running your script:
bash -n $my_bash_script.sh
Seriously. It should be as automatic as performing any other test.
Read the bash man page and get familiar with must-know options, like:
set -xv
my_complicated_instruction1
my_complicated_instruction2
my_complicated_instruction3
set +xv
Use ShellCheck to check your bash scripts
It’s very easy to miss simple issues when your scripts start to grow large. ShellCheck is one of those tools that saves you from making mistakes.
shellcheck collect_data_from_servers.v7.sh
In collect_data_from_servers.v7.sh line 15:
for dependency in ${dependencies[@]}; do
^----------------^ SC2068: Double quote array expansions to avoid re-splitting elements.
In collect_data_from_servers.v7.sh line 16:
if [ ! -x $dependency ]; then
^---------^ SC2086: Double quote to prevent globbing and word splitting.
Did you mean:
if [ ! -x "$dependency" ]; then
...
If you’re wondering, the final version of the script, after passing ShellCheck is here. Squeaky clean.
You noticed something with the background scp processes
You probably noticed that if you kill the script, it leaves some forked processes behind. That isn’t good and this is one of the reasons I prefer to use tools like Ansible or Parallel to handle this type of task on multiple hosts, letting the frameworks do the proper cleanup for me. You can, of course, add more code to handle this situation.
This bash script could potentially create a fork bomb. It has no control of how many processes to spawn at the same time, which is a big problem in a real production environment. Also, there is a limit on how many concurrent ssh sessions you can have (let alone consume bandwidth). Again, I wrote this fictional example in bash to show you how you can always improve a program to better handle errors.
Let’s recap
[ Download now: A sysadmin’s guide to Bash scripting. ]
1. You must check the return code of your commands. That could mean deciding to retry until a transitory condition improves or to short-circuit the whole script.
2. Speaking of transitory conditions, you don’t need to start from scratch. You can save the status of successful tasks and then retry from that point forward.
3. Bash ‘trap’ is your friend. Use it for cleanup and error handling.
4. When downloading data from any source, assume it’s corrupted. Never overwrite your good data set with fresh data until you have done some integrity checks.
5. Take advantage of journalctl and custom fields. You can perform sophisticated searches looking for issues, and even send that data to log aggregators.
6. You can check the status of background tasks (including sub-shells). Just remember to save the PID and wait on it.
7. And finally: Use a Bash lint helper like ShellCheck. You can install it on your favorite editor (like VIM or PyCharm). You will be surprised how many errors go undetected on Bash scripts…
If you enjoyed this content or would like to expand on it, contact the team at enable-sysadmin@redhat.com.