Bash raise error

I want to raise an error in a Bash script with message "Test cases Failed !!!". How to do this in Bash? For example: if [ condition ]; then raise error "Test cases failed !!!" fi

There are a couple more ways with which you can approach this problem. Assuming one of your requirement is to run a shell script/function containing a few shell commands and check if the script ran successfully and throw errors in case of failures.

The shell commands in generally rely on exit-codes returned to let the shell know if it was successful or failed due to some unexpected events.

So what you want to do falls upon these two categories

  • exit on error
  • exit and clean-up on error

Depending on which one you want to do, there are shell options available to use. For the first case, the shell provides an option with set -e and for the second you could do a trap on EXIT

Should I use exit in my script/function?

Using exit generally enhances readability In certain routines, once you know the answer, you want to exit to the calling routine immediately. If the routine is defined in such a way that it doesn’t require any further cleanup once it detects an error, not exiting immediately means that you have to write more code.

So in cases if you need to do clean-up actions on script to make the termination of the script clean, it is preferred to not to use exit.

Should I use set -e for error on exit?

No!

set -e was an attempt to add «automatic error detection» to the shell. Its goal was to cause the shell to abort any time an error occurred, but it comes with a lot of potential pitfalls for example,

  • The commands that are part of an if test are immune. In the example, if you expect it to break on the test check on the non-existing directory, it wouldn’t, it goes through to the else condition

    set -e
    f() { test -d nosuchdir && echo no dir; }
    f
    echo survived
    
  • Commands in a pipeline other than the last one, are immune. In the example below, because the most recently executed (rightmost) command’s exit code is considered ( cat) and it was successful. This could be avoided by setting by the set -o pipefail option but its still a caveat.

    set -e
    somecommand that fails | cat -
    echo survived 
    

Recommended for use — trap on exit

The verdict is if you want to be able to handle an error instead of blindly exiting, instead of using set -e, use a trap on the ERR pseudo signal.

The ERR trap is not to run code when the shell itself exits with a non-zero error code, but when any command run by that shell that is not part of a condition (like in if cmd, or cmd ||) exits with a non-zero exit status.

The general practice is we define an trap handler to provide additional debug information on which line and what cause the exit. Remember the exit code of the last command that caused the ERR signal would still be available at this point.

cleanup() {
    exitcode=$?
    printf 'error condition hitn' 1>&2
    printf 'exit code returned: %sn' "$exitcode"
    printf 'the command executing at the time of the error was: %sn' "$BASH_COMMAND"
    printf 'command present on line: %d' "${BASH_LINENO[0]}"
    # Some more clean up code can be added here before exiting
    exit $exitcode
}

and we just use this handler as below on top of the script that is failing

trap cleanup ERR

Putting this together on a simple script that contained false on line 15, the information you would be getting as

error condition hit
exit code returned: 1
the command executing at the time of the error was: false
command present on line: 15

The trap also provides options irrespective of the error to just run the cleanup on shell completion (e.g. your shell script exits), on signal EXIT. You could also trap on multiple signals at the same time. The list of supported signals to trap on can be found on the trap.1p — Linux manual page

Another thing to notice would be to understand that none of the provided methods work if you are dealing with sub-shells are involved in which case, you might need to add your own error handling.

  • On a sub-shell with set -e wouldn’t work. The false is restricted to the sub-shell and never gets propagated to the parent shell. To do the error handling here, add your own logic to do (false) || false

    set -e
    (false)
    echo survived
    
  • The same happens with trap also. The logic below wouldn’t work for the reasons mentioned above.

    trap 'echo error' ERR
    (false)
    

Answer by Juliet Hodges

When you raise an exception you stop the program’s execution. ,You can also use something like exit xxx where xxx is the error code you may want to return to the operating system (from 0 to 255). Here 125 and 64 are just random codes you can exit with. When you need to indicate to the OS that the program stopped abnormally (eg. an error occurred), you need to pass a non-zero exit code to exit. ,Note that exit 1 indicates that the program stop execution due to an unspecified error. You can customize this if you like.,This will append the error to the log file and continue execution. If you want to stop execution when critical errors occur, you can exit the script:

You can do the following:

echo "Error!" > logfile.log
exit 125

Or the following:

echo "Error!" 1>&2
exit 64

Answer by Chris Murphy

When you raise an exception you stop the program’s execution. ,This depends on where you want the error message be stored. ,Ignoring specific errors in a shell script,Catching error codes in a shell pipe

You can do the following:

echo "Error!" > logfile.log
exit 125

Or the following:

echo "Error!" 1>&2
exit 64

If your test case runner returns a non-zero code for failed tests, you can simply write:

test_handler test_case_x; test_result=$?
if ((test_result != 0)); then
  printf '%sn' "Test case x failed" >&2  # write error message to stderr
  exit 1                                  # or exit $test_result
fi

Or even shorter:

if ! test_handler test_case_x; then
  printf '%sn' "Test case x failed" >&2
  exit 1
fi

Or the shortest:

test_handler test_case_x || { printf '%sn' "Test case x failed" >&2; exit 1; }

To exit with test_handler’s exit code:

test_handler test_case_x || { ec=$?; printf '%sn' "Test case x failed" >&2; exit $ec; }

If you want to take a more comprehensive approach, you can have an error handler:

exit_if_error() {
  local exit_code=$1
  shift
  [[ $exit_code ]] &&               # do nothing if no error code passed
    ((exit_code != 0)) && {         # do nothing if error code is 0
      printf 'ERROR: %sn' "[email protected]" >&2 # we can use better logging here
      exit "$exit_code"             # we could also check to make sure
                                    # error code is numeric when passed
    }
}

then invoke it after running your test case:

run_test_case test_case_x
exit_if_error $? "Test case x failed"

or

run_test_case test_case_x || exit_if_error $? "Test case x failed"

The commands that are part of an if test are immune. In the example, if you expect it to break on the test check on the non-existing directory, it wouldn’t, it goes through to the else condition

set -e
f() { test -d nosuchdir && echo no dir; }
f
echo survived

Commands in a pipeline other than the last one, are immune. In the example below, because the most recently executed (rightmost) command’s exit code is considered ( cat) and it was successful. This could be avoided by setting by the set -o pipefail option but its still a caveat.

set -e
somecommand that fails | cat -
echo survived 

The general practice is we define an trap handler to provide additional debug information on which line and what cause the exit. Remember the exit code of the last command that caused the ERR signal would still be available at this point.

cleanup() {
    exitcode=$?
    printf 'error condition hitn' 1>&2
    printf 'exit code returned: %sn' "$exitcode"
    printf 'the command executing at the time of the error was: %sn' "$BASH_COMMAND"
    printf 'command present on line: %d' "${BASH_LINENO[0]}"
    # Some more clean up code can be added here before exiting
    exit $exitcode
}

and we just use this handler as below on top of the script that is failing

trap cleanup ERR

Putting this together on a simple script that contained false on line 15, the information you would be getting as

error condition hit
exit code returned: 1
the command executing at the time of the error was: false
command present on line: 15

On a sub-shell with set -e wouldn’t work. The false is restricted to the sub-shell and never gets propagated to the parent shell. To do the error handling here, add your own logic to do (false) || false

set -e
(false)
echo survived

The same happens with trap also. The logic below wouldn’t work for the reasons mentioned above.

trap 'echo error' ERR
(false)

Answer by Trenton Walsh

In this lesson, we’re going to look at handling errors during script execution.,Finally, we can further simplify our script by using the AND and OR
control operators. To explain how they work, here is a quote from the bash man page:,In this version, we examine the exit status of the cd command
and if it’s not zero, we print an error message on standard error and
terminate the script with an exit status of 1.,Writing Shell Scripts

Why is this such a bad way of doing it? It’s not, if nothing goes wrong.
The two lines change the working directory to the name contained in
$some_directory and delete the files in that directory. That’s
the intended behavior. But what happens if the directory named in
$some_directory doesn’t exist? In that case, the cd
command will fail and the script executes the rm command on the
current working directory. Not the intended behavior!

$some_directory

Why is this such a bad way of doing it? It’s not, if nothing goes wrong.
The two lines change the working directory to the name contained in
$some_directory and delete the files in that directory. That’s
the intended behavior. But what happens if the directory named in
$some_directory doesn’t exist? In that case, the cd
command will fail and the script executes the rm command on the
current working directory. Not the intended behavior!

$some_directory

Why is this such a bad way of doing it? It’s not, if nothing goes wrong.
The two lines change the working directory to the name contained in
$some_directory and delete the files in that directory. That’s
the intended behavior. But what happens if the directory named in
$some_directory doesn’t exist? In that case, the cd
command will fail and the script executes the rm command on the
current working directory. Not the intended behavior!

cd

Why is this such a bad way of doing it? It’s not, if nothing goes wrong.
The two lines change the working directory to the name contained in
$some_directory and delete the files in that directory. That’s
the intended behavior. But what happens if the directory named in
$some_directory doesn’t exist? In that case, the cd
command will fail and the script executes the rm command on the
current working directory. Not the intended behavior!

rm

Answer by Alisson McCann

I want to raise an error in a Bash script with message «Test cases Failed !!!». How to do this in Bash?,This depends on where you want the error message be stored. ,However, in your example, you could just use the similar command pkill to kill all matching processes:,When you raise an exception you stop the program’s execution.

For example:

if [ condition ]; then
    raise error "Test cases failed !!!"
fi

Answer by Porter Williams

So how does the error handling look now?,If you enjoyed this content or would like to expand on it, contact the team at [email protected],Read a guide to human communication for sysadmins,[ Download now: A sysadmin’s guide to Bash scripting. ]

Say that you have a cron job on each one of your Linux systems, and you have a script to collect the hardware information from each:

#!/bin/bash
# Script to collect the status of lshw output from home servers
# Dependencies:
# * LSHW: http://ezix.org/project/wiki/HardwareLiSter
# * JQ: http://stedolan.github.io/jq/
#
# On each machine you can run something like this from cron (Don't know CRON, no worries: https://crontab-generator.org/)
# 0 0 * * * /usr/sbin/lshw -json -quiet > /var/log/lshw-dump.json
# Author: Jose Vicente Nunez
#
declare -a servers=(
dmaf5
)

DATADIR="$HOME/Documents/lshw-dump"

/usr/bin/mkdir -p -v "$DATADIR"
for server in ${servers[*]}; do
    echo "Visiting: $server"
    /usr/bin/scp -o logLevel=Error ${server}:/var/log/lshw-dump.json ${DATADIR}/lshw-$server-dump.json &
done
wait
for lshw in $(/usr/bin/find $DATADIR -type f -name 'lshw-*-dump.json'); do
    /usr/bin/jq '.["product","vendor", "configuration"]' $lshw
done

If everything goes well, then you collect your files in parallel because you don’t have more than ten systems. You can afford to ssh to all of them at the same time and then show the hardware details of each one.

Visiting: dmaf5
lshw-dump.json                                                                                         100%   54KB 136.9MB/s   00:00    
"DMAF5 (Default string)"
"BESSTAR TECH LIMITED"
{
  "boot": "normal",
  "chassis": "desktop",
  "family": "Default string",
  "sku": "Default string",
  "uuid": "00020003-0004-0005-0006-000700080009"
}

The current version of the script has a problem—It will run from the beginning to the end, errors or not:

./collect_data_from_servers.sh 
Visiting: macmini2
Visiting: mac-pro-1-1
Visiting: dmaf5
lshw-dump.json                                                                                         100%   54KB  48.8MB/s   00:00    
scp: /var/log/lshw-dump.json: No such file or directory
scp: /var/log/lshw-dump.json: No such file or directory
parse error: Expected separator between values at line 3, column 9

Take a look at version two of the script. It’s slightly better:

1 #!/bin/bash
2 # Script to collect the status of lshw output from home servers
3 # Dependencies:
4 # * LSHW: http://ezix.org/project/wiki/HardwareLiSter
5 # * JQ: http://stedolan.github.io/jq/
6 #
7 # On each machine you can run something like this from cron (Don't know CRON, no worries: https://crontab-generator.org/        ) 
8 # 0 0 * * * /usr/sbin/lshw -json -quiet > /var/log/lshw-dump.json
9   Author: Jose Vicente Nunez
10 #
11 set -o errtrace # Enable the err trap, code will get called when an error is detected
12 trap "echo ERROR: There was an error in ${FUNCNAME-main context}, details to follow" ERR
13 declare -a servers=(
14 macmini2
15 mac-pro-1-1
16 dmaf5
17 )
18  
19 DATADIR="$HOME/Documents/lshw-dump"
20 if [ ! -d "$DATADIR" ]; then 
21    /usr/bin/mkdir -p -v "$DATADIR"|| "FATAL: Failed to create $DATADIR" && exit 100
22 fi 
23 declare -A server_pid
24 for server in ${servers[*]}; do
25    echo "Visiting: $server"
26    /usr/bin/scp -o logLevel=Error ${server}:/var/log/lshw-dump.json ${DATADIR}/lshw-$server-dump.json &
27   server_pid[$server]=$! # Save the PID of the scp  of a given server for later
28 done
29 # Iterate through all the servers and:
30 # Wait for the return code of each
31 # Check the exit code from each scp
32 for server in ${!server_pid[*]}; do
33    wait ${server_pid[$server]}
34    test $? -ne 0 && echo "ERROR: Copy from $server had problems, will not continue" && exit 100
35 done
36 for lshw in $(/usr/bin/find $DATADIR -type f -name 'lshw-*-dump.json'); do
37    /usr/bin/jq '.["product","vendor", "configuration"]' $lshw
38 done

So how does the error handling look now?

Visiting: macmini2
Visiting: mac-pro-1-1
Visiting: dmaf5
lshw-dump.json                                                                                         100%   54KB 146.1MB/s   00:00    
scp: /var/log/lshw-dump.json: No such file or directory
ERROR: There was an error in main context, details to follow
ERROR: Copy from mac-pro-1-1 had problems, will not continue
scp: /var/log/lshw-dump.json: No such file or directory

Answer by Braylon Mills

With this OR statement, <command2> is executed if and only if <command1> returns a non-zero exit status. So you can replace <command2> with your own error handling routine. For example:,As the first line of defense, it is always recommended to check the exit status of a command, as a non-zero exit status typically indicates some type of error. For example:,Here we define several custom bash functions to mimic the semantic of try and catch statements. The throw() function is supposed to raise a custom (non-zero) exception. We need set +e, so that the non-zero returned by throw() will not terminate a bash script. Inside catch(), we store the value of exception raised by throw() in a bash variable exception_code, so that we can handle the exception in a user-defined fashion.,Once called with -e option, the set command causes the bash shell to exit immediately if any subsequent command exits with a non-zero status (caused by an error condition). The +e option turns the shell back to the default mode. set -e is equivalent to set -o errexit. Likewise, set +e is a shorthand command for set +o errexit.

As the first line of defense, it is always recommended to check the exit status of a command, as a non-zero exit status typically indicates some type of error. For example:

if ! some_command; then
    echo "some_command returned an error"
fi

Another (more compact) way to trigger error handling based on an exit status is to use an OR list:

<command1> || <command2>

With this OR statement, <command2> is executed if and only if <command1> returns a non-zero exit status. So you can replace <command2> with your own error handling routine. For example:

error_exit()
{
    echo "Error: $1"
    exit 1
}

run-some-bad-command || error_exit "Some error occurred"

Bash provides a built-in variable called $?, which tells you the exit status of the last executed command. Note that when a bash function is called, $? reads the exit status of the last command called inside the function. Since some non-zero exit codes have special meanings, you can handle them selectively. For example:

# run some command
status=$?
if [ $status -eq 1 ]; then
    echo "General error"
elif [ $status -eq 2 ]; then
    echo "Misuse of shell builtins"
elif [ $status -eq 126 ]; then
    echo "Command invoked cannot execute"
elif [ $status -eq 128 ]; then
    echo "Invalid argument"
fi

This default shell behavior may not be desirable for some bash script. For example, if your script contains a critical code block where no error is allowed, you want your script to exit immediately upon encountering any error inside that code block. To activate this «exit-on-error» behavior in bash, you can use the set command as follows.

set -e
#
# some critical code block where no error is allowed
#
set +e

However, one special error condition not captured by set -e is when an error occurs somewhere inside a pipeline of commands. This is because a pipeline returns a non-zero status only if the last command in the pipeline fails. Any error produced by previous command(s) in the pipeline is not visible outside the pipeline, and so does not kill a bash script. For example:

set -e
true | false | true   
echo "This will be printed"  # "false" inside the pipeline not detected

If you want any failure in pipelines to also exit a bash script, you need to add -o pipefail option. For example:

set -o pipefail -e
true | false | true          # "false" inside the pipeline detected correctly
echo "This will not be printed"

Therefore, to protect a critical code block against any type of command errors or pipeline errors, use the following pair of set commands.

set -o pipefail -e
#
# some critical code block where no error or pipeline error is allowed
#
set +o pipefail +e

To be able to detect and handle different types of errors/exceptions more flexibly, you will need try/catch statements, which however are missing in bash. At least we can mimic the behaviors of try/catch as shown in this trycatch.sh script:

function try()
{
    [[ $- = *e* ]]; SAVED_OPT_E=$?
    set +e
}

function throw()
{
    exit $1
}

function catch()
{
    export exception_code=$?
    (( $SAVED_OPT_E )) && set +e
    return $exception_code
}

Perhaps an example bash script will make it clear how trycatch.sh works. See the example below that utilizes trycatch.sh.

# Include trybatch.sh as a library
source ./trycatch.sh

# Define custom exception types
export ERR_BAD=100
export ERR_WORSE=101
export ERR_CRITICAL=102

try
(
    echo "Start of the try block"

    # When a command returns a non-zero, a custom exception is raised.
    run-command || throw $ERR_BAD
    run-command2 || throw $ERR_WORSE
    run-command3 || throw $ERR_CRITICAL

    # This statement is not reached if there is any exception raised
    # inside the try block.
    echo "End of the try block"
)
catch || {
    case $exception_code in
        $ERR_BAD)
            echo "This error is bad"
        ;;
        $ERR_WORSE)
            echo "This error is worse"
        ;;
        $ERR_CRITICAL)
            echo "This error is critical"
        ;;
        *)
            echo "Unknown error: $exit_code"
            throw $exit_code    # re-throw an unhandled exception
        ;;
    esac
}

Answer by Kieran Wise

Use set -e to set exit-on-error mode: if a simple command returns a nonzero status (indicating failure), the shell exits.,Since cd returns a non-zero status on failure, you could do:,Beware that set -e doesn’t always kick in. Commands in test positions are allowed to fail (e.g. if failing_command, failing_command || fallback). Commands in subshell only lead to exiting the subshell, not the parent: set -e; (false); echo foo displays foo.,Connect and share knowledge within a single location that is structured and easy to search.

Your script changes directories as it runs, which means it won’t work
with a series of relative pathnames. You then commented later that you
only wanted to check for directory existence, not the ability to use
cd, so answers don’t need to use cd at all. Revised. Using tput
and colours from man terminfo:

#!/bin/bash -u
# OUTPUT-COLORING
red=$( tput setaf 1 )
green=$( tput setaf 2 )
NC=$( tput setaf 0 )      # or perhaps: tput sgr0

# FUNCTIONS
# directoryExists - Does the directory exist?
function directoryExists {
    # was: do the cd in a sub-shell so it doesn't change our own PWD
    # was: if errmsg=$( cd -- "$1" 2>&1 ) ; then
    if [ -d "$1" ] ; then
        # was: echo "${green}$1${NC}"
        printf "%sn" "${green}$1${NC}"
    else
        # was: echo "${red}$1${NC}"
        printf "%sn" "${red}$1${NC}"
        # was: optional: printf "%sn" "${red}$1 -- $errmsg${NC}"
    fi
}

Answer by Hope Gomez

There are two more-reasonable alternative behaviours, and I strongly suggest you set one of them: shopt -s nullglob will replace the glob expression with the empty string if it doesn’t match, and shopt -s failglob will raise an error.,Writing robust Bash scripts is tricky, but not impossible. Start your scripts with set -euo pipefail; shopt -s inherit_errexit nullglob compat»${BASH_COMPAT=42}» and use ShellCheck, and you’ll be 90% of the way there!,However, if there’s no match, the glob expression isn’t replaced, it’s just passed to the command as-is, typically producing an error or unexpected behaviour:
,Thank you very much for a great post!
I found following still doesn’t exit with the setting suggested in the post.

cat example.txt

Answer by Fisher Strickland

Sample output:

Traceback (most recent call last):
  File "/tmp/a.sh", line 38, in main
  File "/tmp/a.sh", line 23, in func1
  File "/tmp/a.sh", line 27, in func2
  File "/tmp/a.sh", line 31, in func3
  File "/tmp/a.sh", line 35, in func4
Exception: SomeError

There are a couple more ways with which you can approach this problem. Assuming one of your requirement is to run a shell script/function containing a few shell commands and check if the script ran successfully and throw errors in case of failures.

The shell commands in generally rely on exit-codes returned to let the shell know if it was successful or failed due to some unexpected events.

So what you want to do falls upon these two categories

  • exit on error
  • exit and clean-up on error

Depending on which one you want to do, there are shell options available to use. For the first case, the shell provides an option with set -e and for the second you could do a trap on EXIT

Should I use exit in my script/function?

Using exit generally enhances readability In certain routines, once you know the answer, you want to exit to the calling routine immediately. If the routine is defined in such a way that it doesn’t require any further cleanup once it detects an error, not exiting immediately means that you have to write more code.

So in cases if you need to do clean-up actions on script to make the termination of the script clean, it is preferred to not to use exit.

Should I use set -e for error on exit?

No!

set -e was an attempt to add «automatic error detection» to the shell. Its goal was to cause the shell to abort any time an error occurred, but it comes with a lot of potential pitfalls for example,

  • The commands that are part of an if test are immune. In the example, if you expect it to break on the test check on the non-existing directory, it wouldn’t, it goes through to the else condition

    set -e
    f() { test -d nosuchdir && echo no dir; }
    f
    echo survived
    
  • Commands in a pipeline other than the last one, are immune. In the example below, because the most recently executed (rightmost) command’s exit code is considered ( cat) and it was successful. This could be avoided by setting by the set -o pipefail option but its still a caveat.

    set -e
    somecommand that fails | cat -
    echo survived 
    

Recommended for use — trap on exit

The verdict is if you want to be able to handle an error instead of blindly exiting, instead of using set -e, use a trap on the ERR pseudo signal.

The ERR trap is not to run code when the shell itself exits with a non-zero error code, but when any command run by that shell that is not part of a condition (like in if cmd, or cmd ||) exits with a non-zero exit status.

The general practice is we define an trap handler to provide additional debug information on which line and what cause the exit. Remember the exit code of the last command that caused the ERR signal would still be available at this point.

cleanup() {
    exitcode=$?
    printf 'error condition hitn' 1>&2
    printf 'exit code returned: %sn' "$exitcode"
    printf 'the command executing at the time of the error was: %sn' "$BASH_COMMAND"
    printf 'command present on line: %d' "${BASH_LINENO[0]}"
    # Some more clean up code can be added here before exiting
    exit $exitcode
}

and we just use this handler as below on top of the script that is failing

trap cleanup ERR

Putting this together on a simple script that contained false on line 15, the information you would be getting as

error condition hit
exit code returned: 1
the command executing at the time of the error was: false
command present on line: 15

The trap also provides options irrespective of the error to just run the cleanup on shell completion (e.g. your shell script exits), on signal EXIT. You could also trap on multiple signals at the same time. The list of supported signals to trap on can be found on the trap.1p — Linux manual page

Another thing to notice would be to understand that none of the provided methods work if you are dealing with sub-shells are involved in which case, you might need to add your own error handling.

  • On a sub-shell with set -e wouldn’t work. The false is restricted to the sub-shell and never gets propagated to the parent shell. To do the error handling here, add your own logic to do (false) || false

    set -e
    (false)
    echo survived
    
  • The same happens with trap also. The logic below wouldn’t work for the reasons mentioned above.

    trap 'echo error' ERR
    (false)
    

Basic error handling

If your test case runner returns a non-zero code for failed tests, you can simply write:

test_handler test_case_x; test_result=$?
if ((test_result != 0)); then
  printf '%sn' "Test case x failed" >&2  # write error message to stderr
  exit 1                                  # or exit $test_result
fi

Or even shorter:

if ! test_handler test_case_x; then
  printf '%sn' "Test case x failed" >&2
  exit 1
fi

Or the shortest:

test_handler test_case_x || { printf '%sn' "Test case x failed" >&2; exit 1; }

To exit with test_handler’s exit code:

test_handler test_case_x || { ec=$?; printf '%sn' "Test case x failed" >&2; exit $ec; }

Advanced error handling

If you want to take a more comprehensive approach, you can have an error handler:

exit_if_error() {
  local exit_code=$1
  shift
  [[ $exit_code ]] &&               # do nothing if no error code passed
    ((exit_code != 0)) && {         # do nothing if error code is 0
      printf 'ERROR: %sn' "[email protected]" >&2 # we can use better logging here
      exit "$exit_code"             # we could also check to make sure
                                    # error code is numeric when passed
    }
}

then invoke it after running your test case:

run_test_case test_case_x
exit_if_error $? "Test case x failed"

or

run_test_case test_case_x || exit_if_error $? "Test case x failed"

The advantages of having an error handler like exit_if_error are:

  • we can standardize all the error handling logic such as logging, printing a stack trace, notification, doing cleanup etc., in one place
  • by making the error handler get the error code as an argument, we can spare the caller from the clutter of if blocks that test exit codes for errors
  • if we have a signal handler (using trap), we can invoke the error handler from there

Error handling and logging library

Here is a complete implementation of error handling and logging:

https://github.com/codeforester/base/blob/master/lib/stdlib.sh


Related posts

  • Error handling in Bash
  • The ‘caller’ builtin command on Bash Hackers Wiki
  • Are there any standard exit status codes in Linux?
  • BashFAQ/105 — Why doesn’t set -e (or set -o errexit, or trap ERR) do what I expected?
  • Equivalent of __FILE__, __LINE__ in Bash
  • Is there a TRY CATCH command in Bash
  • To add a stack trace to the error handler, you may want to look at this post: Trace of executed programs called by a Bash script
  • Ignoring specific errors in a shell script
  • Catching error codes in a shell pipe
  • How do I manage log verbosity inside a shell script?
  • How to log function name and line number in Bash?
  • Is double square brackets [[ ]] preferable over single square brackets [ ] in Bash?

This depends on where you want the error message be stored.

You can do the following:

echo "Error!" > logfile.log
exit 125

Or the following:

echo "Error!" 1>&2
exit 64

When you raise an exception you stop the program’s execution.

You can also use something like exit xxx where xxx is the error code you may want to return to the operating system (from 0 to 255). Here 125 and 64 are just random codes you can exit with. When you need to indicate to the OS that the program stopped abnormally (eg. an error occurred), you need to pass a non-zero exit code to exit.

As @chepner pointed out, you can do exit 1, which will mean an unspecified error.

In this article, I present a few tricks to handle error conditions—Some strictly don’t fall under the category of error handling (a reactive way to handle the unexpected) but also some techniques to avoid errors before they happen.

Case study: Simple script that downloads a hardware report from multiple hosts and inserts it into a database.

Say that you have a cron job on each one of your Linux systems, and you have a script to collect the hardware information from each:

#!/bin/bash
# Script to collect the status of lshw output from home servers
# Dependencies:
# * LSHW: http://ezix.org/project/wiki/HardwareLiSter
# * JQ: http://stedolan.github.io/jq/
#
# On each machine you can run something like this from cron (Don't know CRON, no worries: https://crontab-generator.org/)
# 0 0 * * * /usr/sbin/lshw -json -quiet > /var/log/lshw-dump.json
# Author: Jose Vicente Nunez
#
declare -a servers=(
dmaf5
)

DATADIR="$HOME/Documents/lshw-dump"

/usr/bin/mkdir -p -v "$DATADIR"
for server in ${servers[*]}; do
    echo "Visiting: $server"
    /usr/bin/scp -o logLevel=Error ${server}:/var/log/lshw-dump.json ${DATADIR}/lshw-$server-dump.json &
done
wait
for lshw in $(/usr/bin/find $DATADIR -type f -name 'lshw-*-dump.json'); do
    /usr/bin/jq '.["product","vendor", "configuration"]' $lshw
done

If everything goes well, then you collect your files in parallel because you don’t have more than ten systems. You can afford to ssh to all of them at the same time and then show the hardware details of each one.

Visiting: dmaf5
lshw-dump.json                                                                                         100%   54KB 136.9MB/s   00:00    
"DMAF5 (Default string)"
"BESSTAR TECH LIMITED"
{
  "boot": "normal",
  "chassis": "desktop",
  "family": "Default string",
  "sku": "Default string",
  "uuid": "00020003-0004-0005-0006-000700080009"
}

Here are some possibilities of why things went wrong:

  • Your report didn’t run because the server was down
  • You couldn’t create the directory where the files need to be saved
  • The tools you need to run the script are missing
  • You can’t collect the report because your remote machine crashed
  • One or more of the reports is corrupt

The current version of the script has a problem—It will run from the beginning to the end, errors or not:

./collect_data_from_servers.sh 
Visiting: macmini2
Visiting: mac-pro-1-1
Visiting: dmaf5
lshw-dump.json                                                                                         100%   54KB  48.8MB/s   00:00    
scp: /var/log/lshw-dump.json: No such file or directory
scp: /var/log/lshw-dump.json: No such file or directory
parse error: Expected separator between values at line 3, column 9

Next, I demonstrate a few things to make your script more robust and in some times recover from failure.

The nuclear option: Failing hard, failing fast

The proper way to handle errors is to check if the program finished successfully or not, using return codes. It sounds obvious but return codes, an integer number stored in bash $? or $! variable, have sometimes a broader meaning. The bash man page tells you:

For the shell’s purposes, a command which exits with a zero exit
status has succeeded. An exit status of zero indicates success.
A non-zero exit status indicates failure. When a command
terminates on a fatal signal N, bash uses the value of 128+N as
the exit status.

As usual, you should always read the man page of the scripts you’re calling, to see what the conventions are for each of them. If you’ve programmed with a language like Java or Python, then you’re most likely familiar with their exceptions, different meanings, and how not all of them are handled the same way.

If you add set -o errexit to your script, from that point forward it will abort the execution if any command exists with a code != 0. But errexit isn’t used when executing functions inside an if condition, so instead of remembering that exception, I rather do explicit error handling.

Take a look at version two of the script. It’s slightly better:

1 #!/bin/bash
2 # Script to collect the status of lshw output from home servers
3 # Dependencies:
4 # * LSHW: http://ezix.org/project/wiki/HardwareLiSter
5 # * JQ: http://stedolan.github.io/jq/
6 #
7 # On each machine you can run something like this from cron (Don't know CRON, no worries: https://crontab-generator.org/        ) 
8 # 0 0 * * * /usr/sbin/lshw -json -quiet > /var/log/lshw-dump.json
9   Author: Jose Vicente Nunez
10 #
11 set -o errtrace # Enable the err trap, code will get called when an error is detected
12 trap "echo ERROR: There was an error in ${FUNCNAME-main context}, details to follow" ERR
13 declare -a servers=(
14 macmini2
15 mac-pro-1-1
16 dmaf5
17 )
18  
19 DATADIR="$HOME/Documents/lshw-dump"
20 if [ ! -d "$DATADIR" ]; then 
21    /usr/bin/mkdir -p -v "$DATADIR"|| "FATAL: Failed to create $DATADIR" && exit 100
22 fi 
23 declare -A server_pid
24 for server in ${servers[*]}; do
25    echo "Visiting: $server"
26    /usr/bin/scp -o logLevel=Error ${server}:/var/log/lshw-dump.json ${DATADIR}/lshw-$server-dump.json &
27   server_pid[$server]=$! # Save the PID of the scp  of a given server for later
28 done
29 # Iterate through all the servers and:
30 # Wait for the return code of each
31 # Check the exit code from each scp
32 for server in ${!server_pid[*]}; do
33    wait ${server_pid[$server]}
34    test $? -ne 0 && echo "ERROR: Copy from $server had problems, will not continue" && exit 100
35 done
36 for lshw in $(/usr/bin/find $DATADIR -type f -name 'lshw-*-dump.json'); do
37    /usr/bin/jq '.["product","vendor", "configuration"]' $lshw
38 done

Here’s what changed:

  • Lines 11 and 12, I enable error trace and added a ‘trap’ to tell the user there was an error and there is turbulence ahead. You may want to kill your script here instead, I’ll show you why that may not be the best.
  • Line 20, if the directory doesn’t exist, then try to create it on line 21. If directory creation fails, then exit with an error.
  • On line 27, after running each background job, I capture the PID and associate that with the machine (1:1 relationship).
  • On lines 33-35, I wait for the scp task to finish, get the return code, and if it’s an error, abort.
  • On line 37, I check that the file could be parsed, otherwise, I exit with an error.

So how does the error handling look now?

Visiting: macmini2
Visiting: mac-pro-1-1
Visiting: dmaf5
lshw-dump.json                                                                                         100%   54KB 146.1MB/s   00:00    
scp: /var/log/lshw-dump.json: No such file or directory
ERROR: There was an error in main context, details to follow
ERROR: Copy from mac-pro-1-1 had problems, will not continue
scp: /var/log/lshw-dump.json: No such file or directory

As you can see, this version is better at detecting errors but it’s very unforgiving. Also, it doesn’t detect all the errors, does it?

When you get stuck and you wish you had an alarm

The code looks better, except that sometimes the scp could get stuck on a server (while trying to copy a file) because the server is too busy to respond or just in a bad state.

Another example is to try to access a directory through NFS where $HOME is mounted from an NFS server:

/usr/bin/find $HOME -type f -name '*.csv' -print -fprint /tmp/report.txt

And you discover hours later that the NFS mount point is stale and your script is stuck.

A timeout is the solution. And, GNU timeout comes to the rescue:

/usr/bin/timeout --kill-after 20.0s 10.0s /usr/bin/find $HOME -type f -name '*.csv' -print -fprint /tmp/report.txt

Here you try to regularly kill (TERM signal) the process nicely after 10.0 seconds after it has started. If it’s still running after 20.0 seconds, then send a KILL signal (kill -9). If in doubt, check which signals are supported in your system (kill -l, for example).

If this isn’t clear from my dialog, then look at the script for more clarity.

/usr/bin/time /usr/bin/timeout --kill-after=10.0s 20.0s /usr/bin/sleep 60s
real    0m20.003s
user    0m0.000s
sys     0m0.003s

Back to the original script to add a few more options and you have version three:

 1 #!/bin/bash
  2 # Script to collect the status of lshw output from home servers
  3 # Dependencies:
  4 # * Open SSH: http://www.openssh.com/portable.html
  5 # * LSHW: http://ezix.org/project/wiki/HardwareLiSter
  6 # * JQ: http://stedolan.github.io/jq/
  7 # * timeout: https://www.gnu.org/software/coreutils/
  8 #
  9 # On each machine you can run something like this from cron (Don't know CRON, no worries: https://crontab-generator.org/)
 10 # 0 0 * * * /usr/sbin/lshw -json -quiet > /var/log/lshw-dump.json
 11 # Author: Jose Vicente Nunez
 12 #
 13 set -o errtrace # Enable the err trap, code will get called when an error is detected
 14 trap "echo ERROR: There was an error in ${FUNCNAME-main context}, details to follow" ERR
 15 
 16 declare -a dependencies=(/usr/bin/timeout /usr/bin/ssh /usr/bin/jq)
 17 for dependency in ${dependencies[@]}; do
 18     if [ ! -x $dependency ]; then
 19         echo "ERROR: Missing $dependency"
 20         exit 100
 21     fi
 22 done
 23 
 24 declare -a servers=(
 25 macmini2
 26 mac-pro-1-1
 27 dmaf5
 28 )
 29 
 30 function remote_copy {
 31     local server=$1
 32     echo "Visiting: $server"
 33     /usr/bin/timeout --kill-after 25.0s 20.0s 
 34         /usr/bin/scp 
 35             -o BatchMode=yes 
 36             -o logLevel=Error 
 37             -o ConnectTimeout=5 
 38             -o ConnectionAttempts=3 
 39             ${server}:/var/log/lshw-dump.json ${DATADIR}/lshw-$server-dump.json
 40     return $?
 41 }
 42 
 43 DATADIR="$HOME/Documents/lshw-dump"
 44 if [ ! -d "$DATADIR" ]; then
 45     /usr/bin/mkdir -p -v "$DATADIR"|| "FATAL: Failed to create $DATADIR" && exit 100
 46 fi
 47 declare -A server_pid
 48 for server in ${servers[*]}; do
 49     remote_copy $server &
 50     server_pid[$server]=$! # Save the PID of the scp  of a given server for later
 51 done
 52 # Iterate through all the servers and:
 53 # Wait for the return code of each
 54 # Check the exit code from each scp
 55 for server in ${!server_pid[*]}; do
 56     wait ${server_pid[$server]}
 57     test $? -ne 0 && echo "ERROR: Copy from $server had problems, will not continue" && exit 100
 58 done
 59 for lshw in $(/usr/bin/find $DATADIR -type f -name 'lshw-*-dump.json'); do
 60     /usr/bin/jq '.["product","vendor", "configuration"]' $lshw
 61 done

What are the changes?:

  • Between lines 16-22, check if all the required dependency tools are present. If it cannot execute, then ‘Houston we have a problem.’
  • Created a remote_copy function, which uses a timeout to make sure the scp finishes no later than 45.0s—line 33.
  • Added a connection timeout of 5 seconds instead of the TCP default—line 37.
  • Added a retry to scp on line 38—3 attempts that wait 1 second between each.

There other ways to retry when there’s an error.

Waiting for the end of the world-how and when to retry

You noticed there’s an added retry to the scp command. But that retries only for failed connections, what if the command fails during the middle of the copy?

Sometimes you want to just fail because there’s very little chance to recover from an issue. A system that requires hardware fixes, for example, or you can just fail back to a degraded mode—meaning that you’re able to continue your system work without the updated data. In those cases, it makes no sense to wait forever but only for a specific amount of time.

Here are the changes to the remote_copy, to keep this brief (version four):

#!/bin/bash
# Omitted code for clarity...
declare REMOTE_FILE="/var/log/lshw-dump.json"
declare MAX_RETRIES=3

# Blah blah blah...

function remote_copy {
    local server=$1
    local retries=$2
    local now=1
    status=0
    while [ $now -le $retries ]; do
        echo "INFO: Trying to copy file from: $server, attempt=$now"
        /usr/bin/timeout --kill-after 25.0s 20.0s 
            /usr/bin/scp 
                -o BatchMode=yes 
                -o logLevel=Error 
                -o ConnectTimeout=5 
                -o ConnectionAttempts=3 
                ${server}:$REMOTE_FILE ${DATADIR}/lshw-$server-dump.json
        status=$?
        if [ $status -ne 0 ]; then
            sleep_time=$(((RANDOM % 60)+ 1))
            echo "WARNING: Copy failed for $server:$REMOTE_FILE. Waiting '${sleep_time} seconds' before re-trying..."
            /usr/bin/sleep ${sleep_time}s
        else
            break # All good, no point on waiting...
        fi
        ((now=now+1))
    done
    return $status
}

DATADIR="$HOME/Documents/lshw-dump"
if [ ! -d "$DATADIR" ]; then
    /usr/bin/mkdir -p -v "$DATADIR"|| "FATAL: Failed to create $DATADIR" && exit 100
fi
declare -A server_pid
for server in ${servers[*]}; do
    remote_copy $server $MAX_RETRIES &
    server_pid[$server]=$! # Save the PID of the scp  of a given server for later
done

# Iterate through all the servers and:
# Wait for the return code of each
# Check the exit code from each scp
for server in ${!server_pid[*]}; do
    wait ${server_pid[$server]}
    test $? -ne 0 && echo "ERROR: Copy from $server had problems, will not continue" && exit 100
done

# Blah blah blah, process the files you just copied...

How does it look now? In this run, I have one system down (mac-pro-1-1) and one system without the file (macmini2). You can see that the copy from server dmaf5 works right away, but for the other two, there’s a retry for a random time between 1 and 60 seconds before exiting:

INFO: Trying to copy file from: macmini2, attempt=1
INFO: Trying to copy file from: mac-pro-1-1, attempt=1
INFO: Trying to copy file from: dmaf5, attempt=1
scp: /var/log/lshw-dump.json: No such file or directory
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for macmini2:/var/log/lshw-dump.json. Waiting '60 seconds' before re-trying...
ssh: connect to host mac-pro-1-1 port 22: No route to host
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for mac-pro-1-1:/var/log/lshw-dump.json. Waiting '32 seconds' before re-trying...
INFO: Trying to copy file from: mac-pro-1-1, attempt=2
ssh: connect to host mac-pro-1-1 port 22: No route to host
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for mac-pro-1-1:/var/log/lshw-dump.json. Waiting '18 seconds' before re-trying...
INFO: Trying to copy file from: macmini2, attempt=2
scp: /var/log/lshw-dump.json: No such file or directory
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for macmini2:/var/log/lshw-dump.json. Waiting '3 seconds' before re-trying...
INFO: Trying to copy file from: macmini2, attempt=3
scp: /var/log/lshw-dump.json: No such file or directory
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for macmini2:/var/log/lshw-dump.json. Waiting '6 seconds' before re-trying...
INFO: Trying to copy file from: mac-pro-1-1, attempt=3
ssh: connect to host mac-pro-1-1 port 22: No route to host
ERROR: There was an error in main context, details to follow
WARNING: Copy failed for mac-pro-1-1:/var/log/lshw-dump.json. Waiting '47 seconds' before re-trying...
ERROR: There was an error in main context, details to follow
ERROR: Copy from mac-pro-1-1 had problems, will not continue

If I fail, do I have to do this all over again? Using a checkpoint

Suppose that the remote copy is the most expensive operation of this whole script and that you’re willing or able to re-run this script, maybe using cron or doing so by hand two times during the day to ensure you pick up the files if one or more systems are down.

You could, for the day, create a small ‘status cache’, where you record only the successful processing operations per machine. If a system is in there, then don’t bother to check again for that day.

Some programs, like Ansible, do something similar and allow you to retry a playbook on a limited number of machines after a failure (--limit @/home/user/site.retry).

A new version (version five) of the script has code to record the status of the copy (lines 15-33):

15 declare SCRIPT_NAME=$(/usr/bin/basename $BASH_SOURCE)|| exit 100
16 declare YYYYMMDD=$(/usr/bin/date +%Y%m%d)|| exit 100
17 declare CACHE_DIR="/tmp/$SCRIPT_NAME/$YYYYMMDD"
18 # Logic to clean up the cache dir on daily basis is not shown here
19 if [ ! -d "$CACHE_DIR" ]; then
20   /usr/bin/mkdir -p -v "$CACHE_DIR"|| exit 100
21 fi
22 trap "/bin/rm -rf $CACHE_DIR" INT KILL
23
24 function check_previous_run {
25  local machine=$1
26  test -f $CACHE_DIR/$machine && return 0|| return 1
27 }
28
29 function mark_previous_run {
30    machine=$1
31    /usr/bin/touch $CACHE_DIR/$machine
32    return $?
33 }

Did you notice the trap on line 22? If the script is interrupted (killed), I want to make sure the whole cache is invalidated.

And then, add this new helper logic into the remote_copy function (lines 52-81):

52 function remote_copy {
53    local server=$1
54    check_previous_run $server
55    test $? -eq 0 && echo "INFO: $1 ran successfully before. Not doing again" && return 0
56    local retries=$2
57    local now=1
58    status=0
59    while [ $now -le $retries ]; do
60        echo "INFO: Trying to copy file from: $server, attempt=$now"
61        /usr/bin/timeout --kill-after 25.0s 20.0s 
62            /usr/bin/scp 
63                -o BatchMode=yes 
64                -o logLevel=Error 
65                -o ConnectTimeout=5 
66               -o ConnectionAttempts=3 
67                ${server}:$REMOTE_FILE ${DATADIR}/lshw-$server-dump.json
68        status=$?
69        if [ $status -ne 0 ]; then
70            sleep_time=$(((RANDOM % 60)+ 1))
71            echo "WARNING: Copy failed for $server:$REMOTE_FILE. Waiting '${sleep_time} seconds' before re-trying..."
72            /usr/bin/sleep ${sleep_time}s
73        else
74            break # All good, no point on waiting...
75        fi
76        ((now=now+1))
77    done
78    test $status -eq 0 && mark_previous_run $server
79    test $? -ne 0 && status=1
80    return $status
81 }

The first time it runs, a new new message for the cache directory is printed out:

./collect_data_from_servers.v5.sh
/usr/bin/mkdir: created directory '/tmp/collect_data_from_servers.v5.sh'
/usr/bin/mkdir: created directory '/tmp/collect_data_from_servers.v5.sh/20210612'
ERROR: There was an error in main context, details to follow
INFO: Trying to copy file from: macmini2, attempt=1
ERROR: There was an error in main context, details to follow

If you run it again, then the script knows that dma5f is good to go, no need to retry the copy:

./collect_data_from_servers.v5.sh
INFO: dmaf5 ran successfully before. Not doing again
ERROR: There was an error in main context, details to follow
INFO: Trying to copy file from: macmini2, attempt=1
ERROR: There was an error in main context, details to follow
INFO: Trying to copy file from: mac-pro-1-1, attempt=1

Imagine how this speeds up when you have more machines that should not be revisited.

Leaving crumbs behind: What to log, how to log, and verbose output

If you’re like me, I like a bit of context to correlate with when something goes wrong. The echo statements on the script are nice but what if you could add a timestamp to them.

If you use logger, you can save the output on journalctl for later review (even aggregation with other tools out there). The best part is that you show the power of journalctl right away.

So instead of just doing echo, you can also add a call to logger like this using a new bash function called ‘message’:

SCRIPT_NAME=$(/usr/bin/basename $BASH_SOURCE)|| exit 100
FULL_PATH=$(/usr/bin/realpath ${BASH_SOURCE[0]})|| exit 100
set -o errtrace # Enable the err trap, code will get called when an error is detected
trap "echo ERROR: There was an error in ${FUNCNAME[0]-main context}, details to follow" ERR
declare CACHE_DIR="/tmp/$SCRIPT_NAME/$YYYYMMDD"

function message {
    message="$1"
    func_name="${2-unknown}"
    priority=6
    if [ -z "$2" ]; then
        echo "INFO:" $message
    else
        echo "ERROR:" $message
        priority=0
    fi
    /usr/bin/logger --journald<<EOF
MESSAGE_ID=$SCRIPT_NAME
MESSAGE=$message
PRIORITY=$priority
CODE_FILE=$FULL_PATH
CODE_FUNC=$func_name
EOF
}

You can see that you can store separate fields as part of the message, like the priority, the script that produced the message, etc.

So how is this useful? Well, you could get the messages between 1:26 PM and 1:27 PM, only errors (priority=0) and only for our script (collect_data_from_servers.v6.sh) like this, output in JSON format:

journalctl --since 13:26 --until 13:27 --output json-pretty PRIORITY=0 MESSAGE_ID=collect_data_from_servers.v6.sh
{
        "_BOOT_ID" : "dfcda9a1a1cd406ebd88a339bec96fb6",
        "_AUDIT_LOGINUID" : "1000",
        "SYSLOG_IDENTIFIER" : "logger",
        "PRIORITY" : "0",
        "_TRANSPORT" : "journal",
        "_SELINUX_CONTEXT" : "unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023",
        "__REALTIME_TIMESTAMP" : "1623518797641880",
        "_AUDIT_SESSION" : "3",
        "_GID" : "1000",
        "MESSAGE_ID" : "collect_data_from_servers.v6.sh",
        "MESSAGE" : "Copy failed for macmini2:/var/log/lshw-dump.json. Waiting '45 seconds' before re-trying...",
        "_CAP_EFFECTIVE" : "0",
        "CODE_FUNC" : "remote_copy",
        "_MACHINE_ID" : "60d7a3f69b674aaebb600c0e82e01d05",
        "_COMM" : "logger",
        "CODE_FILE" : "/home/josevnz/BashError/collect_data_from_servers.v6.sh",
        "_PID" : "41832",
        "__MONOTONIC_TIMESTAMP" : "25928272252",
        "_HOSTNAME" : "dmaf5",
        "_SOURCE_REALTIME_TIMESTAMP" : "1623518797641843",
        "__CURSOR" : "s=97bb6295795a4560ad6fdedd8143df97;i=1f826;b=dfcda9a1a1cd406ebd88a339bec96fb6;m=60972097c;t=5c494ed383898;x=921c71966b8943e3",
        "_UID" : "1000"
}

Because this is structured data, other logs collectors can go through all your machines, aggregate your script logs, and then you not only have data but also the information.

You can take a look at the whole version six of the script.

Don’t be so eager to replace your data until you’ve checked it.

If you noticed from the very beginning, I’ve been copying a corrupted JSON file over and over:

Parse error: Expected separator between values at line 4, column 11
ERROR parsing '/home/josevnz/Documents/lshw-dump/lshw-dmaf5-dump.json'

That’s easy to prevent. Copy the file into a temporary location and if the file is corrupted, then don’t attempt to replace the previous version (and leave the bad one for inspection. lines 99-107 of version seven of the script):

function remote_copy {
    local server=$1
    check_previous_run $server
    test $? -eq 0 && message "$1 ran successfully before. Not doing again" && return 0
    local retries=$2
    local now=1
    status=0
    while [ $now -le $retries ]; do
        message "Trying to copy file from: $server, attempt=$now"
        /usr/bin/timeout --kill-after 25.0s 20.0s 
            /usr/bin/scp 
                -o BatchMode=yes 
                -o logLevel=Error 
                -o ConnectTimeout=5 
                -o ConnectionAttempts=3 
                ${server}:$REMOTE_FILE ${DATADIR}/lshw-$server-dump.json.$$
        status=$?
        if [ $status -ne 0 ]; then
            sleep_time=$(((RANDOM % 60)+ 1))
            message "Copy failed for $server:$REMOTE_FILE. Waiting '${sleep_time} seconds' before re-trying..." ${FUNCNAME[0]}
            /usr/bin/sleep ${sleep_time}s
        else
            break # All good, no point on waiting...
        fi
        ((now=now+1))
    done
    if [ $status -eq 0 ]; then
        /usr/bin/jq '.' ${DATADIR}/lshw-$server-dump.json.$$ > /dev/null 2>&1
        status=$?
        if [ $status -eq 0 ]; then
            /usr/bin/mv -v -f ${DATADIR}/lshw-$server-dump.json.$$ ${DATADIR}/lshw-$server-dump.json && mark_previous_run $server
            test $? -ne 0 && status=1
        else
            message "${DATADIR}/lshw-$server-dump.json.$$ Is corrupted. Leaving for inspection..." ${FUNCNAME[0]}
        fi
    fi
    return $status
}

Choose the right tools for the task and prep your code from the first line

One very important aspect of error handling is proper coding. If you have bad logic in your code, no amount of error handling will make it better. To keep this short and bash-related, I’ll give you below a few hints.

You should ALWAYS check for error syntax before running your script:

bash -n $my_bash_script.sh

Seriously. It should be as automatic as performing any other test.

Read the bash man page and get familiar with must-know options, like:

set -xv
my_complicated_instruction1
my_complicated_instruction2
my_complicated_instruction3
set +xv

Use ShellCheck to check your bash scripts

It’s very easy to miss simple issues when your scripts start to grow large. ShellCheck is one of those tools that saves you from making mistakes.

shellcheck collect_data_from_servers.v7.sh

In collect_data_from_servers.v7.sh line 15:
for dependency in ${dependencies[@]}; do
                  ^----------------^ SC2068: Double quote array expansions to avoid re-splitting elements.


In collect_data_from_servers.v7.sh line 16:
    if [ ! -x $dependency ]; then
              ^---------^ SC2086: Double quote to prevent globbing and word splitting.

Did you mean: 
    if [ ! -x "$dependency" ]; then
...

If you’re wondering, the final version of the script, after passing ShellCheck is here. Squeaky clean.

You noticed something with the background scp processes

You probably noticed that if you kill the script, it leaves some forked processes behind. That isn’t good and this is one of the reasons I prefer to use tools like Ansible or Parallel to handle this type of task on multiple hosts, letting the frameworks do the proper cleanup for me. You can, of course, add more code to handle this situation.

This bash script could potentially create a fork bomb. It has no control of how many processes to spawn at the same time, which is a big problem in a real production environment. Also, there is a limit on how many concurrent ssh sessions you can have (let alone consume bandwidth). Again, I wrote this fictional example in bash to show you how you can always improve a program to better handle errors.

Let’s recap

[ Download now: A sysadmin’s guide to Bash scripting. ]

1.  You must check the return code of your commands. That could mean deciding to retry until a transitory condition improves or to short-circuit the whole script.
2.  Speaking of transitory conditions, you don’t need to start from scratch. You can save the status of successful tasks and then retry from that point forward.
3.  Bash ‘trap’ is your friend. Use it for cleanup and error handling.
4.  When downloading data from any source, assume it’s corrupted. Never overwrite your good data set with fresh data until you have done some integrity checks.
5.  Take advantage of journalctl and custom fields. You can perform sophisticated searches looking for issues, and even send that data to log aggregators.
6.  You can check the status of background tasks (including sub-shells). Just remember to save the PID and wait on it.
7.  And finally: Use a Bash lint helper like  ShellCheck. You can install it on your favorite editor (like VIM or PyCharm). You will be surprised how many errors go undetected on Bash scripts…

If you enjoyed this content or would like to expand on it, contact the team at enable-sysadmin@redhat.com.

bash (Bourne Again shell) is the standard GNU shell, a powerful tool for the advanced and professional user.
This shell is a so-called superset of the Bourne shell, a set of add-ons and plug-ins.
This means that the Bourne Again shell is compatible with the Bourne shell: commands that work in sh, also work in bash.

However, the reverse is not always the case.

The bash interprets user commands, which are either directly entered by the user, or which can be read from a file called the shell script or shell program.

Shell scripts are interpreted, not compiled, so the shell reads commands from the script line per line and searches for those commands on the system.

Below my own brief collection of notes about bash and bash scripting.


Variables

There are no data types. A variable in bash can contain a number, a character, a string of characters.

You have no need to declare a variable, just assigning a value to its reference will create it.

#!/usr/bin/env bash

NAME="Andrea"
echo "Hello $NAME!"

Some more examples:

NAME="Andrea"
echo $NAME
echo "$NAME"
echo "${NAME}!"
String quotes
NAME="Andrea"
echo "Hi $NAME" #=> Hi Andrea
echo 'Hi $NAME' #=> Hi $NAME
Shell execution
echo "I'm in $(pwd)"
echo "I'm in `pwd`" # Same
Conditional execution
git commit && git push
git commit || echo "Commit failed"

Parameter expansions

Basics

The ‘$’ character introduces parameter expansion, command substitution, or arithmetic expansion. The parameter name or symbol to be expanded may be enclosed in braces, which are optional but serve to protect the variable to be expanded from characters immediately following it which could be interpreted as part of the name.

name="Andrea"
echo ${name}
echo ${name/A/a}    #=> "andrea" (substitution)
echo ${name:0:2}    #=> "An" (slicing)
echo ${name::2}     #=> "An" (slicing)
echo ${name::-1}    #=> "Andre" (slicing)
echo ${food:-Cake}  #=> $food or "Cake"
length=2
echo ${name:0:length} #=> "An"
STR="/path/to/foo.cpp"
echo ${STR%.cpp}         # /path/to/foo
echo ${STR%.cpp}.o       # /path/to/foo.o

echo ${STR##*.}          # cpp (extension)
echo ${STR##*/}          # foo.cpp (basepath)

echo ${STR#*/}           # path/to/foo.cpp
echo ${STR##*/}          # foo.cpp

echo ${STR/foo/bar}      # /path/to/bar.cpp
STR="Hello world"
echo ${STR:6:5}          # "world"
echo ${STR:-5:5}         # "world"
SRC="/path/to/foo.cpp"
BASE=${STR##*/}          #=> "foo.cpp" (basepath)
DIR=${SRC%$BASE}         #=> "/path/to" (dirpath)
Substitution
${FOO%suffix}            Remove suffix
${FOO#prefix}            Remove prefix
${FOO%%suffix}           Remove long suffix
${FOO##prefix}           Remove long prefix
${FOO/from/to}           Replace first match
${FOO//from/to}          Replace all
${FOO/%from/to}          Replace suffix
${FOO/#from/to}          Replace prefix
printf
printf "Hello %s, I'm %s: my bowl is empty!" Andrea Ivo #=> "Hello Andrea, I'm Ivo: my bowl is empty!

Comments

# Single line comment

: '
This is a
multi line
comment
'
Substrings
${parameter:offset:length}

The Substring Expansion expands to up to length characters of the value of parameter starting at the character specified by offset.

Example:

${FOO:0:3}   Substring (position, length)
${FOO:-3:3}  Substring from the right
Length
${#FOO}  Length of $FOO
Default values
${FOO:-val}      $FOO, or val if not set
${FOO:=val}      Set $FOO to val if not set
${FOO:+val}      val if $FOO is set
${FOO:?message}  Show error message and exit if $FOO is not set

Loops

Basic ‘for’ loop

The for loop is a little bit different from other programming languages. Basically, it let’s you iterate over a series of ‘words’ within a string.

for i in /etc/rc.*; do
  echo $i
done
Ranges
for i in {1..5}; do
  echo "Welcome $i"
done

With step size

for i in {5..50..5}; do
  echo "Welcome $i"
done
Reading lines with ‘while’ loop
cat file.txt | while read line; do
  echo $line
done
Endless ‘while’ loop
while true; do
  ···some code...
done

Functions

As in almost any programming language, you can use functions to group pieces of code in a more logical way or practice the divine art of recursion.

Defining functions
myfunc() {
  echo "hello $1"
}

Alternate syntax:

function myfunc() {
  echo "hello $1"
}
Returning values
myfunc() {
  local myresult='some value'
  echo $myresult
}

result=$(myfunc)
Raise errors
myfunc() {
  return 1
}

if myfunc; then
  echo "success"
else
  echo "failure"
fi
Arguments
$#   Number of arguments
$*   All arguments
$@   All arguments, starting from first
$1   First argument
Trap errors
trap 'echo Error at about $LINENO' ERR

or

traperr() {
  echo "ERROR: ${BASH_SOURCE[1]} at about ${BASH_LINENO[0]}"
}
set -o errtrace
trap traperr ERR

Conditionals

Conditionals let you decide whether to perform an action or not by evaluating an expression.

Basic conditions
[ -z STRING ]           Empty string
[ -n STRING ]           Not empty string
[ NUM -eq NUM ]         Equal
[ NUM -ne NUM ]         Not equal
[ NUM -lt NUM ]         Less than
[ NUM -le NUM ]         Less than or equal
[ NUM -gt NUM ]         Greater than
[ NUM -ge NUM ]         Greater than or equal

[[ STRING =~ STRING ]]  Regexp

(( NUM < NUM ))         Numeric conditions

[ ! EXPR ]              Not
[ X ] && [ Y ]          And
[ X ] || [ Y ]          Or
File conditions
[ -e FILE ]             Exists
[ -r FILE ]             Readable
[ -h FILE ]             Symlink
[ -d FILE ]             Directory
[ -w FILE ]             Writable
[ -s FILE ]             Size is > 0 bytes
[ -f FILE ]             File
[ -x FILE ]             Executable
[ FILE1 -nt FILE2 ]     1 is more recent than 2
[ FILE1 -ot FILE2 ]     2 is more recent than 1
[ FILE1 -ef FILE2 ]     Same files
Case/switch
case "$1" in
  start | up)
    vagrant up
    ;;

  *)
    echo "Usage: $0 {start|stop|ssh}"
    ;;
esac
Other examples
# String
if [ -z "$string" ]; then
  echo "String is empty"
elif [ -n "$string" ]; then
  echo "String is not empty"
fi

# Combinations
if [ X ] && [ Y ]; then
  ...some code...
fi

# Regex
if [[ "A" =~ "." ]]

if (( $a < $b ))

if [ -e "file.txt" ]; then
  echo "file exists"
fi

Arrays

Bash provides one-dimensional indexed and associative array variables.

Any variable may be used as an indexed array and there is no maximum limit on the size of an array.

Defining arrays
Fruits=('Apple' 'Banana' 'Orange')

Fruits[0]="Apple"
Fruits[1]="Banana"
Fruits[2]="Orange"
Working with arrays
echo ${Fruits[0]}     # Element #0
echo ${Fruits[@]}     # All elements, space-separated
echo ${#Fruits[@]}    # Number of elements
echo ${#Fruits}       # String length of the 1st element
echo ${#Fruits[3]}    # String length of the Nth element
echo ${Fruits[@]:3:2} # Range (from position 3, length 2)
Operations
Fruits=("${Fruits[@]}" "Watermelon")     # Push
Fruits=( ${Fruits[@]/Ap*/} )             # Remove by regex match
unset Fruits[2]                          # Remove one item
Fruits=("${Fruits[@]}")                  # Duplicate
Fruits=("${Fruits[@]}" "${Veggies[@]}")  # Concatenate
lines=(`cat "logfile"`)                  # Read from file
Iteration
for i in "${arrayName[@]}"; do
   echo $i
done

Options

Some useful option

set -o noclobber    # Avoid overlay files (echo "hi" > foo)
set -o errexit      # Used to exit upon error, avoiding cascading errors
set -o pipefail     # Unveils hidden failures
set -o nounset      # Exposes unset variables

In the debug phase, it could be usefull add the -x option:

#!/bin/bash -x

That produces interesting output information.


History

Commands
history                Show history
shopt -s histverify    Don’t execute expanded result immediately
Expansions

The History library provides a history expansion feature that makes easy to repeat commands, insert the arguments to a previous command into the current input line, or fix errors in previous commands quickly.

!$         Expand last parameter of most recent command
!*         Expand all parameters of most recent command
!-n        Expand nth most recent command
!n         Expand nth command in history
!<command> Expand most recent invocation of command <command>

and

!!:s/<FROM>/<TO>/       Replace first occurrence of <FROM> to <TO> in most recent command
!!:gs/<FROM>/<TO>/      Replace all occurrences of <FROM> to <TO> in most recent command
!$:t                    Expand only basename from last parameter of most recent command
!$:h                    Expand only directory from last parameter of most recent command

!! and !$ can be replaced with any valid expansion.

and finally

!!:n              Expand only nth token from most recent command (command is 0; first param is 1)
!!:n-m            Expand range of tokens from most recent command
!!:n-$            Expand the token to last from most recent command

!! can be replaced with any valid expansion i.e. !cat, !-2, !42, etc.

Some miscellaneous

Numeric calculations
$((a + 200))         # Add 200 to $a

$((RANDOM%=200))     # Random number 0..200
Redirection
python hello.py > output.txt      # stdout to (file)
python hello.py >> output.txt     # stdout to (file), append
python hello.py 2> error.log      # stderr to (file)
python hello.py 2>&1              # stderr to stdout
python hello.py 2>/dev/null       # stderr to (null)
python hello.py &>/dev/null       # stdout and stderr to (null)
Inspecting commands
command -V cd      #=> "cd is a shell builtin"
Source relative
source "${0%/*}/../share/foo.sh"
Directory of script
DIR="${0%/*}"
Getting options
while [[ "$1" =~ ^- && ! "$1" == "--" ]]; do case $1 in 
  -V | --version )
    echo $version
    exit
  ;;
  -s | --string )
    shift; string=$1
  ;;
  -f | --flag )
    flag=1
  ;;
esac; shift; done

if [[ "$1" == '--' ]]; then shift; fi
Heredoc

A block of code or text which can be redirected to the command script or interactive program is called here document or HereDoc.

cat <<END
hello world
END
Reading input
echo -n "Proceed? [y/n]: "
read ans
echo $ans
read -n 1 ans # Just one character
Special variables
$?     Exit status of last task
$!     PID of last background task
$$     PID of shell

References and further readings

  • GNU Bash
  • Bourne shell
  • Bash Reference Manual
  • Bash Guide for Beginners
  • BASH Programming — Introduction HOW-TO

Using grep on the results of ps is a bad idea in a script, since some proportion of the time it will also match the grep process you’ve just invoked. The command pgrep avoids this problem, so if you need to know the process ID, that’s a better option. (Note that, of course, there may be many processes matched.)

However, in your example, you could just use the similar command pkill to kill all matching processes:

pkill ruby

Incidentally, you should be aware that using -9 is overkill (ho ho) in almost every case — there’s some useful advice about that in the text of the «Useless Use of kill -9 form letter «:

No no no. Don’t use kill -9.

It doesn’t give the process a chance to cleanly:

  1. shut down socket connections
  2. clean up temp files
  3. inform its children that it is going away
  4. reset its terminal characteristics

and so on and so on and so on.

Generally, send 15, and wait a second or two, and if that doesn’t
work, send 2, and if that doesn’t work, send 1. If that doesn’t,
REMOVE THE BINARY because the program is badly behaved!

Don’t use kill -9. Don’t bring out the combine harvester just to tidy
up the flower pot.

Понравилась статья? Поделить с друзьями:
  • Bad installation no jre found in configuration file как исправить
  • Bad image ошибка 0xc000012f windows 10 как исправить
  • Bad image ошибка 0xc0000020 windows 10
  • Bad guys error sans
  • Bad gateway error code 502 эпик геймс