Error job failed exit code 137 - Исправление ошибок и поиск оптимальных решений проблем

Open

Issue created Jul 19, 2019 by Kathiresan@Kathiresan24

ERROR: Job failed: exit code 137

Summary

I got this error while CI. The runner finishes its 90% of works, then the pipeline failed and throwing an error like
ERROR: Job failed: exit code 137

Steps to reproduce

My project has 3 runners with docker executor on a single Linux machine.

Actual behavior

Failed with the error of ERROR: Job failed: exit code 137

Expected behavior

Need to run without failure.

Relevant logs and/or screenshots

job log

![![Screenshot_from_2019-07-19_12-02-18](/uploads/c2311cc626a2cf45215afb2936a5e887/Screenshot_from_2019-07-19_12-02-18.png)

Environment description

I am using specifc runner and docker executors
and my docker-info output:-

Containers: 39
Running: 2
Paused: 0
Stopped: 37
Images: 52
Server Version: 18.09.6
Storage Driver: overlay2
Kernel Version: 4.15.0-51-generic
Operating System: Ubuntu 16.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 15.52GiB

config.toml contents

check_interval = 0

[[runners]]
  name = "RUNNER#1"
  url = "https://example.com/"
  token = "123456xxxx"
  executor = "docker"
  [runners.docker]
    tls_verify = false
    image = "node:latest"
    privileged = false
    disable_cache = false
    volumes = ["/cache"]
    shm_size = 0
  [runners.cache]

[[runners]]
  name = "RUNNER#2"
  url = "https://example.com/"
  token = "123456xxxx"
  executor = "docker"
  [runners.docker]
    tls_verify = false
    image = "node:latest"
    privileged = false
    disable_cache = false
    volumes = ["/cache"]
    shm_size = 0
  [runners.cache]

[[runners]]
  name = "RUNNER#3"
  url = "https://example.com/"
  token = "123456xxxx"
  executor = "docker"
  [runners.docker]
    tls_verify = false
    image = "node:latest"
    privileged = false
    disable_cache = false
    volumes = ["/cache"]
    shm_size = 0
  [runners.cache]

Used GitLab Runner version

Version:      10.8.0
Git revision: 079aad9e
Git branch:   
GO version:   go1.8.7
Built:        2018-05-22T03:24:56+00:00
OS/Arch:      linux/amd64

Edited Jul 19, 2019 by Kathiresan

Источник

Содержание

What is Docker container exit code 137? #21083
Comments
BUG REPORT INFORMATION
How to prevent Docker containers from crashing with error 137
Docker containers crashing often?
What causes error 137 in Docker
How to debug error 137 in Docker
Reasons for ‘Out of memory’ error 137 in Docker
1. Docker container has run out of memory
2. Docker host has no free memory
How to resolve error 137 in Docker
1. Optimize the services
2. Mount config files from outside
3. Monitor the container usage
4. Add more RAM to the host machine
In short..
Looking for a stable Docker setup?
How to Fix Exit Code 137 | Memory Issues
What Is Exit Code 137?
Causes of Container Memory Issues
Container Memory Limit Exceeded
Application Memory Leak
Natural Increases in Load
Requesting More Memory Than Your Compute Nodes Can Provide
Running Too Many Containers Without Memory Limits
Preventing Pods and Containers From Causing Memory Issues
Setting Memory Limits
Investigating Application Problems
Using ContainIQ to Monitor and Debug Memory Problems
Final Thoughts

What is Docker container exit code 137? #21083

Is there an official exit code list?

The text was updated successfully, but these errors were encountered:

If you are reporting a new issue, make sure that we do not have any duplicates already open. You can ensure this by searching the issue list for this repository. If there is a duplicate, please close your issue and add a comment to the existing issue instead.

If you suspect your issue is a bug, please edit your issue description to include the BUG REPORT INFORMATION shown below. If you fail to provide this information within 7 days, we cannot debug your issue and will close it. We will, however, reopen it if you later provide the information.

For more information about reporting issues, see CONTRIBUTING.md.

You don’t have to include this information if this is a feature request

(This is an automated, informational response)

BUG REPORT INFORMATION

Use the commands below to provide key information from your environment:

docker version :
docker info :

Provide additional environment details (AWS, VirtualBox, physical, etc.):

List the steps to reproduce the issue:
1.
2.
3.

Describe the results you received:

Describe the results you expected:

Provide additional info you think is important:

Источник

How to prevent Docker containers from crashing with error 137

Docker systems can be used for a wide range of applications, from setting up development environments to hosting web instances.

Live Docker containers that crash often can end up ruining their purpose. As a result, a major concern faced by Docker providers is ensuring the container uptime.

Containers crash due to many reasons, the main issue being lack of enough memory. During a crash, container will show an exit code that explains the reason for its crash.

Docker containers crashing often?

Today we’ll see what causes Docker ‘Exited (137)’ crash message and how to fix it.

What causes error 137 in Docker

Docker crashes are often denoted by the message ‘Exited’ in the container ‘STATUS’ when listing the container using ‘docker ps -a’ command.

Error 137 in Docker denotes that the container was ‘KILL’ed by ‘oom-killer’ (Out of Memory). This happens when there isn’t enough memory in the container for running the process.

‘OOM killer’ is a proactive process that jumps in to save the system when its memory level goes too low, by killing the resource-abusive processes to free up memory for the system.

Here is a snippet that shows that the Mysql container Exited with error 137:

Docker container exited with error 137

When the MySQL process running in the container exceeded its memory usage, OOM-killer killed the container and it exited with code 137.

[ Running a Docker infrastructure doesn’t have to be hard, or costly. Get world class Docker management services at affordable pricing. ]

How to debug error 137 in Docker

Each Docker container has a log file associated with it. These log files store all the relevant information and updates related to that container.

Examining the container log file is vital in troubleshooting its crash. Details of the crash can be identified by checking the container logs using ‘docker logs’ command.

For the MySQL container that crashed, the log files showed the following information:

Docker error 137 – Out of memory

The logs in this case clearly shows that the MySQL container was killed due to the mysqld process taking up too much memory.

Reasons for ‘Out of memory’ error 137 in Docker

In Docker architecture, the containers are hosted in a single physical machine. Error 137 in Docker usually happens due to 2 main out-of-memory reasons:

1. Docker container has run out of memory

By default, Docker containers use the available memory in the host machine. To prevent a single container from abusing the host resources, we set memory limits per container.

But if the memory usage by the processes in the container exceeds this memory limit set for the container, the OOM-killer would kill the application and the container crashes.

An application can use up too much memory due to improper configuration, service not optimized, high traffic or usage or resource abuse by users.

Docker system architecture

2. Docker host has no free memory

The memory limit that can be allotted to the Docker containers is limited by the total available memory in the host machine which hosts them.

Many often, when usage and traffic increases, the available free memory may be insufficient for all the containers. As a result, containers may crash.

How to resolve error 137 in Docker

When a Docker container exits with OOM error, it shows that there is a lack of memory. But, the first resort should not be to increase the RAM in the host machine.

Improperly configured services, abusive processes or peak traffic can lead to memory shortage. So the first option is to identify the cause of this memory usage.

After identifying the cause, the following corrective actions can be done in the Docker system to avoid further such OOM crashes.

1. Optimize the services

Unoptimized applications could take up more than optimal memory. For instance, an improperly configured MySQL service can quickly consume the entire host memory.

So, the first step is to monitor the application running in the container and to optimize the service. This can be done by editing the configuration file or recompiling the service.

2. Mount config files from outside

It is always advisable to mount the config files of services from outside the Docker container. This will allow to edit them easily without recompiling the Docker image.

For instance, in MySQL docker image, “/etc/mysql/conf.d” can be mounted as a volume. Any configuration changes for MySQL service can be done without affecting that image.

3. Monitor the container usage

Monitoring the container’s memory usage to detect abusive users, resource-depleted processes, traffic spikes, etc. is very vital in Docker system management.

Depending on the traffic and resource usage of processes, memory limits for the containers can be changed to suit their business purpose better.

4. Add more RAM to the host machine

After optimizing the services and setting memory limits, if the containers are running at their maximum memory limits, then we should add more RAM.

Adding more RAM and ensuring enough swap memory in the host machine would help the containers to utilize that memory whenever there is a memory crunch.

In short..

Today we saw how to fix error 137 in Docker using a systematic debugging method. But there may be scenarios where the error code may not be displayed, especially when the container is launched from some shell script.

At Bobcares, examining Docker logs, optimizing applications, limiting resource usage for containers, monitoring the traffic, etc. are routinely done to avoid crashes.

Before deploying Docker containers for live hosting or development environment setup, we perform stress test by simulating the estimated peak traffic.

This helps us to minimize crashes in the live server. If you’d like to know how to manage your Docker resources efficiently for your business purpose, we’d be happy to talk to you.

Looking for a stable Docker setup?

Talk to our Docker specialists today to know how we can keep your containers top notch!

Источник

How to Fix Exit Code 137 | Memory Issues

Exit Code 137 errors happen when a container or pod was terminated because they used more memory than allowed. The purpose of this tutorial is to show readers how to fix Exit Code 137 related to memory issues.

Exit code 137 occurs when a process is terminated because itвЂ™s using too much memory. Your container or Kubernetes pod will be stopped to prevent the excessive resource consumption from affecting your hostвЂ™s reliability.

Processes that end with exit code 137 need to be investigated. The problem could be that your system simply needs more physical memory to meet user demands. However, there might also be a memory leak or sub-optimal programming inside your application thatвЂ™s causing resources to be consumed excessively.

In this article, youвЂ™ll learn how to identify and debug exit code 137 so your containers run reliably. This will reduce your maintenance overhead and help stop inconsistencies caused by services stopping unexpectedly. Although some causes of exit code 137 can be highly specific to your environment, most problems can be solved with a simple troubleshooting sequence.

What Is Exit Code 137?

All processes emit an exit code when they terminate. Exit codes provide a mechanism for informing the user, operating system, and other applications why the process stopped. Each code is a number between 0 and 255. The meaning of codes below 125 is application-dependent, while higher values have special meanings.

A 137 code is issued when a process is terminated externally because of its memory consumption. The operating systemвЂ™s out of memory manager (OOM) intervenes to stop the program before it destabilizes the host.

When you start a foreground program in your shell, you can read the ? variable to inspect the process exit code:

As this example returned 137 , you know that demo-binary was stopped because it used too much memory. The same thing happens for container processes, tooвЂ”when a memory limit is being approached, the process will be terminated, and a 137 code issued.

Pods running in Kubernetes will show a status of OOMKilled when they encounter a 137 exit code. Although this looks like any other Kubernetes status, itвЂ™s caused by the operating systemвЂ™s OOM killer terminating the podвЂ™s process. You can check for pods that have used too much memory by running KubectlвЂ™s get pods command:

$ kubectl get pods

NAME	READY	STATUS	RESTARTS	AGE
demo-pod	0/1	OOMKilled		2m05s

Memory consumption problems can affect anyone, not just organizations using Kubernetes. You could run into similar issues with Amazon ECS, RedHat OpenShift, Nomad, CloudFoundry, and plain Docker deployments. Regardless of the platform, if a container fails with a 137 exit code, the root cause will be the same: thereвЂ™s not enough memory to keep it running.

For example, you can view a stopped Docker containerвЂ™s exit code by running docker ps -a :

CONTAINER ID	IMAGE	COMMAND	CREATED	STATUS
cdefb9ca658c	demo-org/demo-image:latest	«demo-binary»	2 days ago	Exited (137) 1 day ago

The exit code is shown in brackets under the STATUS column. The 137 value confirms this container stopped because of a memory problem.

Causes of Container Memory Issues

Understanding the situations that lead to memory-related container terminations is the first step towards debugging exit code 137. Here are some of the most common issues that you might experience.

Container Memory Limit Exceeded

Kubernetes pods will be terminated when they try to use more memory than their configured limit allows. You might be able to resolve this situation by increasing the limit if your cluster has spare capacity available.

Application Memory Leak

Poorly optimized code can create memory leaks. A memory leak occurs when an application uses memory, but doesnвЂ™t release it when the operationвЂ™s complete. This causes the memory to gradually fill up, and will eventually consume all the available capacity.

Natural Increases in Load

Sometimes adding physical memory is the only way to solve a problem. Growing services that experience an increase in active users can reach a point where more memory is required to serve the increase in traffic.

Requesting More Memory Than Your Compute Nodes Can Provide

Kubernetes pods configured with memory resource requests can use more memory than the clusterвЂ™s nodeshave if limits arenвЂ™t also used. A request allows consumption overages because itвЂ™s only an indication of how much memory a pod will consume, and doesnвЂ™t prevent the pod from consuming more memory if itвЂ™s available.

Running Too Many Containers Without Memory Limits

Running several containers without memory limits can create unpredictable Kubernetes behavior when the nodeвЂ™s memory capacity is reached. Containers without limits have a greater chance of being killed, even if a neighboring container caused the capacity breach.

Preventing Pods and Containers From Causing Memory Issues

Debugging container memory issues in KubernetesвЂ”or any other orchestratorвЂ”can seem complex, but using the right tools and techniques helps make it less stressful. Kubernetes assigns memory to pods based on the requests and limits they declare. Unless it resides in a namespace with a default memory limit, a pod that doesnвЂ™t use these mechanisms can normally access limitless memory.

Setting Memory Limits

Pods without memory limits increase the chance of OOM kills and exit code 137 errors. These pods are able to use more memory than the node can provide, which poses a stability risk. When memory consumption gets close to the physical limit, the Linux kernel OOM killer intervenes to stop processes that are using too much memory.

Making sure each of your pods includes a memory limit is a good first step towards preventing OOM kill issues. HereвЂ™s a sample pod manifest:

The requests field indicates the pod wants 256 Mi of memory. Kubernetes will use this information to influence scheduling decisions, and will ensure that the pod is hosted by a node with at least 256 Mi of memory available. Requests help to reduce resource contention, ensuring your applications have the resources they need. ItвЂ™s important to note, though, that they donвЂ™t prevent the pod from using more memory if itвЂ™s available on the node.

This sample pod also includes a memory limit of 512 Mi. If memory consumption goes above 512 Mi, the pod becomes a candidate for termination. If thereвЂ™s too much memory pressure and Kubernetes needs to free up resources, the pod could be stopped. Setting limits on all of your pods helps prevent excessive memory consumption in one from affecting the others.

Investigating Application Problems

Once your pods have appropriate memory limits, you can start investigating why those limits are being reached. Start by analyzing traffic levels to identify anomalies as well as natural growth in your service. If memory use has grown in correlation with user activity, it could be time to scale your cluster with new nodes, or to add more memory to existing ones.

If your nodes have sufficient memory, youвЂ™ve set limits on all your pods, and service use has remained relatively steady, the problem is likely to be within your application. To figure out where, you need to look at the nature of your memory consumption issues: is usage suddenly spiking, or does it gradually increase over the course of the podвЂ™s lifetime?

A memory usage graph that shows large peaks can point to poorly optimized functions in your application. Specific parts of your codebase could be allocating a lot of memory to handle demanding user requests. You can usually work out the culprit by reviewing pod logs to determine which actions were taken around the time of the spike. It might be possible to refactor your code to use less memory, such as by explicitly freeing up variables and destroying objects after youвЂ™ve finished using them.

Memory graphs that show continual increases over time usually mean youвЂ™ve got a memory leak. These problems can be tricky to find, but reviewing application logs and running language-specific analysis tools can help you discover suspect code. Unchecked memory leaks will eventually fill all the available physical memory, forcing the OOM killer to stop processes so the capacity can be reclaimed.

Using ContainIQ to Monitor and Debug Memory Problems

Debugging Kubernetes problems manually is time-consuming and error-prone. You have to inspect pod status codes and retrieve their logs using terminal commands, which can create delays in your incident response. Kubernetes also lacks a built-in way of alerting you when memory consumptionвЂ™s growing. You might not know about spiraling resource usage until your pods start to terminate and knock parts of your service offline.

ContainIQ addresses these challenges by providing a complete monitoring solution for your Kubernetes cluster. With ContainIQ, you can view real-time metrics using visual dashboards, and create alerts for when limits are breached. The platform surfaces events within your cluster, such as pod OOM kills, and provides convenient access to the logs created by your containers.

You can start inspecting pod memory issues in ContainIQ by heading to the Events tab from your dashboard. This lets you search for cluster activity, or Kubernetes events, that led up to a pod being terminated. Try вЂњOOMKilledвЂќ as your search term to find a podвЂ™s termination event, then review the events that occurred immediately prior to the termination to understand why the container was stopped.

You can access a live overview of current memory usage compared to pod limits by going to the Nodes tab and scrolling down to the вЂњMEM Per NodeвЂќ graph. Toggle the Show Limits button to include limits on the graph. If the limits are higher than the available memory, this is a sign theyвЂ™re too permissive, and memory exhaustion could occur. Conversely, relatively low limits might mean there are pods in your cluster that havenвЂ™t been configured with a limit. This could also lead to skyrocketing memory use.

Finally, you can set up alerts to notify you when pods are terminated by the OOM killer. Click the New Monitor button in the top right of the screen, and choose the вЂњEventвЂќ alert type in the popup that appears. On the next screen, type вЂњOOMKilledвЂќ as the event reason. YouвЂ™ll now be notified each time a pod terminates with exit code 137. You can set up monitors that alert based on metrics, too, letting you detect high memory consumption before your containers are terminated.

Final Thoughts

Exit code 137 means a container or pod is trying to use more memory than itвЂ™s allowed. The process gets terminated to prevent memory usage ballooning indefinitely, which could cause your host system to become unstable.

Excessive memory usage can occur due to natural growth in your applicationвЂ™s use, or as the result of a memory leak in your code. ItвЂ™s important to set correct memory limits on your pods to guard against these issues; while reaching the limit will prompt termination with a 137 exit code, this mechanism is meant to protect you against worse problems that will occur if system memory is depleted entirely.

When youвЂ™re using Kubernetes, you should proactively monitor your cluster so youвЂ™re aware of normal memory consumption and can identify any spikes. ContainIQ is an all-inclusive solution for analyzing your clusterвЂ™s health that can track metrics like memory use and send you alerts when pods are close to their limits. This provides a single source of truth when youвЂ™re inspecting and debugging Kubernetes performance.

Источник

I need some advise on an issue I am facing with k8s 1.14 and running gitlab pipelines on it. Many jobs are throwing up exit code 137 errors and I found that it means that the container is being terminated abruptly.

Cluster information:

Kubernetes version: 1.14
Cloud being used: AWS EKS
Node: C5.4xLarge

After digging in, I found the below logs:

**kubelet: I0114 03:37:08.639450**  4721 image_gc_manager.go:300] [imageGCManager]: Disk usage on image filesystem is at 95% which is over the high threshold (85%). Trying to free 3022784921 bytes down to the low threshold (80%).

**kubelet: E0114 03:37:08.653132**  4721 kubelet.go:1282] Image garbage collection failed once. Stats initialization may not have completed yet: failed to garbage collect required amount of images. Wanted to free 3022784921 bytes, but freed 0 bytes

**kubelet: W0114 03:37:23.240990**  4721 eviction_manager.go:397] eviction manager: timed out waiting for pods runner-u4zrz1by-project-12123209-concurrent-4zz892_gitlab-managed-apps(d9331870-367e-11ea-b638-0673fa95f662) to be cleaned up

**kubelet: W0114 00:15:51.106881**   4781 eviction_manager.go:333] eviction manager: attempting to reclaim ephemeral-storage

**kubelet: I0114 00:15:51.106907**   4781 container_gc.go:85] attempting to delete unused containers

**kubelet: I0114 00:15:51.116286**   4781 image_gc_manager.go:317] attempting to delete unused images

**kubelet: I0114 00:15:51.130499**   4781 eviction_manager.go:344] eviction manager: must evict pod(s) to reclaim ephemeral-storage 

**kubelet: I0114 00:15:51.130648**   4781 eviction_manager.go:362] eviction manager: pods ranked for eviction:

 1. runner-u4zrz1by-project-10310692-concurrent-1mqrmt_gitlab-managed-apps(d16238f0-3661-11ea-b638-0673fa95f662)
 2. runner-u4zrz1by-project-10310692-concurrent-0hnnlm_gitlab-managed-apps(d1017c51-3661-11ea-b638-0673fa95f662)

 3. runner-u4zrz1by-project-13074486-concurrent-0dlcxb_gitlab-managed-apps(63d78af9-3662-11ea-b638-0673fa95f662)

 4. prometheus-deployment-66885d86f-6j9vt_prometheus(da2788bb-3651-11ea-b638-0673fa95f662)

 5. nginx-ingress-controller-7dcc95dfbf-ld67q_ingress-nginx(6bf8d8e0-35ca-11ea-b638-0673fa95f662)

And then the pods get terminated resulting in the exit code 137s.

Can anyone help me understand the reason and a possible solution to overcome this?

Thank you

DT.

3,0332 gold badges18 silver badges32 bronze badges

asked Jan 14, 2020 at 8:24

Exit Code 137 does not necessarily mean OOMKilled. It indicates failure as container received SIGKILL (some interrupt or ‘oom-killer’ [OUT-OF-MEMORY])

If pod got OOMKilled, you will see below line when you describe the pod

      State:        Terminated
      Reason:       OOMKilled

Edit on 2/2/2022
I see that you added **kubelet: I0114 03:37:08.639450** 4721 image_gc_manager.go:300] [imageGCManager]: Disk usage on image filesystem is at 95% which is over the high threshold (85%). Trying to free 3022784921 bytes down to the low threshold (80%). and must evict pod(s) to reclaim ephemeral-storage from the log. It usually happens when application pods are writing something to disk like log files. Admins can configure when (at what disk usage %) to do eviction.

answered Jan 16, 2020 at 6:20

ffran09ffran09

8077 silver badges9 bronze badges

137 mean that k8s kill container for some reason (may be it didn’t pass liveness probe)

Cod 137 is 128 + 9(SIGKILL) process was killed by external signal

answered Oct 23, 2020 at 7:12

werewolf werewolf

1311 silver badge2 bronze badges

The typical causes for this error code can be system out of RAM, or a health check has failed

answered Oct 13, 2020 at 1:30

Chris HalcrowChris Halcrow

27.6k16 gold badges166 silver badges192 bronze badges

Was able to solve the problem.

The nodes initially had 20G of ebs volume and on a c5.4xlarge instance type. I increased the ebs to 50 and 100G but that did not help as I kept seeing the below error:

«Disk usage on image filesystem is at 95% which is over the high
threshold (85%). Trying to free 3022784921 bytes down to the low
threshold (80%). «

I then changed the instance type to c5d.4xlarge which had 400GB of cache storage and gave 300GB of EBS. This solved the error.

Some of the gitlab jobs were for some java applications that were eating away lot of cache space and writing lot of logs.

answered Jan 16, 2020 at 6:12

YYashwanthYYashwanth

6301 gold badge4 silver badges14 bronze badges

Detailed Exit code 137

It denotes that the process was terminated by an external signal.
The number 137 is a sum of two numbers: 128+x, # where x is the signal number sent to the process that caused it to terminate.
In the example, x equals 9, which is the number of the SIGKILL signal, meaning the process was killed forcibly.

Hope this helps better.

answered Sep 23, 2022 at 14:06

GuptaGupta

8,1064 gold badges45 silver badges58 bronze badges

Check Jenkins’s master node memory and CPU profile. in my case, it was a master under high memory and CPU utilization, and slaves were getting restarted with 137.

answered Aug 19, 2021 at 10:45

Источник

For more information about reporting issues, see CONTRIBUTING.md.

You don’t have to include this information if this is a feature request

(This is an automated, informational response)

BUG REPORT INFORMATION

Use the commands below to provide key information from your environment:

docker version:
docker info:

Provide additional environment details (AWS, VirtualBox, physical, etc.):

List the steps to reproduce the issue:
1.
2.
3.

Describe the results you received:

Describe the results you expected:

Provide additional info you think is important:

———-END REPORT ———

#ENEEDMOREINFO

Источник

Docker systems can be used for a wide range of applications, from setting up development environments to hosting web instances.

Live Docker containers that crash often can end up ruining their purpose. As a result, a major concern faced by Docker providers is ensuring the container uptime.

Containers crash due to many reasons, the main issue being lack of enough memory. During a crash, container will show an exit code that explains the reason for its crash.

Today we’ll see what causes Docker ‘Exited (137)’ crash message and how to fix it.

What causes error 137 in Docker

Docker crashes are often denoted by the message ‘Exited’ in the container ‘STATUS’ when listing the container using ‘docker ps -a’ command.

Error 137 in Docker denotes that the container was ‘KILL’ed by ‘oom-killer’ (Out of Memory). This happens when there isn’t enough memory in the container for running the process.

‘OOM killer’ is a proactive process that jumps in to save the system when its memory level goes too low, by killing the resource-abusive processes to free up memory for the system.

Here is a snippet that shows that the Mysql container Exited with error 137:

Docker container exited with error 137

When the MySQL process running in the container exceeded its memory usage, OOM-killer killed the container and it exited with code 137.

[ Running a Docker infrastructure doesn’t have to be hard, or costly. Get world class Docker management services at affordable pricing. ]

How to debug error 137 in Docker

Each Docker container has a log file associated with it. These log files store all the relevant information and updates related to that container.

Examining the container log file is vital in troubleshooting its crash. Details of the crash can be identified by checking the container logs using ‘docker logs’ command.

For the MySQL container that crashed, the log files showed the following information:

Docker error 137 – Out of memory

The logs in this case clearly shows that the MySQL container was killed due to the mysqld process taking up too much memory.

Reasons for ‘Out of memory’ error 137 in Docker

In Docker architecture, the containers are hosted in a single physical machine. Error 137 in Docker usually happens due to 2 main out-of-memory reasons:

1. Docker container has run out of memory

By default, Docker containers use the available memory in the host machine. To prevent a single container from abusing the host resources, we set memory limits per container.

But if the memory usage by the processes in the container exceeds this memory limit set for the container, the OOM-killer would kill the application and the container crashes.

An application can use up too much memory due to improper configuration, service not optimized, high traffic or usage or resource abuse by users.

Docker system architecture

2. Docker host has no free memory

The memory limit that can be allotted to the Docker containers is limited by the total available memory in the host machine which hosts them.

Many often, when usage and traffic increases, the available free memory may be insufficient for all the containers. As a result, containers may crash.

[ Never let your business be affected by crashing containers! Our Docker experts take care of your infrastructure and promptly resolves all container issues. ]

How to resolve error 137 in Docker

When a Docker container exits with OOM error, it shows that there is a lack of memory. But, the first resort should not be to increase the RAM in the host machine.

Improperly configured services, abusive processes or peak traffic can lead to memory shortage. So the first option is to identify the cause of this memory usage.

After identifying the cause, the following corrective actions can be done in the Docker system to avoid further such OOM crashes.

1. Optimize the services

Unoptimized applications could take up more than optimal memory. For instance, an improperly configured MySQL service can quickly consume the entire host memory.

So, the first step is to monitor the application running in the container and to optimize the service. This can be done by editing the configuration file or recompiling the service.

2. Mount config files from outside

It is always advisable to mount the config files of services from outside the Docker container. This will allow to edit them easily without recompiling the Docker image.

For instance, in MySQL docker image, “/etc/mysql/conf.d” can be mounted as a volume. Any configuration changes for MySQL service can be done without affecting that image.

3. Monitor the container usage

Monitoring the container’s memory usage to detect abusive users, resource-depleted processes, traffic spikes, etc. is very vital in Docker system management.

Depending on the traffic and resource usage of processes, memory limits for the containers can be changed to suit their business purpose better.

4. Add more RAM to the host machine

After optimizing the services and setting memory limits, if the containers are running at their maximum memory limits, then we should add more RAM.

Adding more RAM and ensuring enough swap memory in the host machine would help the containers to utilize that memory whenever there is a memory crunch.

In short..

At Bobcares, examining Docker logs, optimizing applications, limiting resource usage for containers, monitoring the traffic, etc. are routinely done to avoid crashes.

Before deploying Docker containers for live hosting or development environment setup, we perform stress test by simulating the estimated peak traffic.

This helps us to minimize crashes in the live server. If you’d like to know how to manage your Docker resources efficiently for your business purpose, we’d be happy to talk to you.

Looking for a stable Docker setup?

Talk to our Docker specialists today to know how we can keep your containers top notch!

CLICK HERE FOR URGENT FIX!

var google_conversion_label = «owonCMyG5nEQ0aD71QM»;

Источник

Есть несколько контейнеров

root# docker-compose ps
      Name                    Command               State                 Ports
---------------------------------------------------------------------------------------------
back_async_1       /bin/sh -c python main.py        Up      0.0.0.0:8070->8070/tcp
back_beat_1        celery -A backend beat           Up
back_celery_1      celery -A backend worker - ...   Up
back_flower_1      /bin/sh -c celery -A backe ...   Up      0.0.0.0:5555->5555/tcp
back_mysql_1       /entrypoint.sh mysqld            Up      0.0.0.0:3300->3306/tcp, 33060/tcp
back_redis_1       docker-entrypoint.sh redis ...   Up      0.0.0.0:6379->6379/tcp
back_web_1         uwsgi --ini /app/uwsgi.ini       Up      0.0.0.0:8080->8080/tcp
back_ws_server_1   /bin/sh -c python main.py        Up      0.0.0.0:8060->8060/tcp

На сервере 4 гига оперативки.

Падает back_celery_1 с кодом 137, прочитал, что это может быть недостаток RAM, но мониторинг показывает что максимум использовалось 1.5 гига.
Падение происходит в промежутке от 3 до 6 часов полноценной работы. Сервер выполняет запросы к социальным сетям.

После запуска всех контейнеров

CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
ca8934bd9a75        back_redis_1        0.17%               1.223MiB / 1.977GiB   0.06%               1.77MB / 2.32MB     0B / 16.4kB         4
4014dd4a2ecc        back_celery_1       0.17%               517.8MiB / 1.977GiB   25.58%              1.52MB / 1.46MB     4.1kB / 4.1kB       21
d95f6fd88e96        back_web_1          0.00%               49.92MiB / 1.977GiB   2.47%               1.9kB / 2.4kB       0B / 0B             5
bda4eee57ded        back_beat_1         0.00%               42.23MiB / 1.977GiB   2.09%               7.6kB / 64kB        0B / 106kB          1
ce40f4a1b4af        back_flower_1       0.02%               46.61MiB / 1.977GiB   2.30%               938kB / 334kB       0B / 0B             7
b2d84df70cfe        back_async_1        0.01%               19.75MiB / 1.977GiB   0.98%               1.21kB / 0B         0B / 0B             2
430734923ffe        back_ws_server_1    0.02%               19.75MiB / 1.977GiB   0.98%               1.21kB / 0B         0B / 0B             2
f0c7143c246e        back_mysql_1        0.08%               201.8MiB / 1.977GiB   9.97%               94.6kB / 145kB      1.18MB / 14.6MB     30

Последние логи, но при мониторинге сервера видно, что использование ресурсов упало в 20:50. Возможно логи не записывались все это время.

$docker logs --details back_celery_1

[2018-02-28 17:56:16,151: INFO/ForkPoolWorker-18] Task stats.views.collect_stats[7d4730f8-ab6a-446d-b516-2d8d4ba0b9c8] succeeded in 42.99442209396511s: None
 Import Error

  -------------- celery@4014dd4a2ecc v4.1.0 (latentcall)
 ---- **** -----
 --- * ***  * -- Linux-4.4.0-34-generic-x86_64-with-debian-8.9 2018-02-28 13:16:20
 -- * - **** ---
 - ** ---------- [config]
 - ** ---------- .> app:         backend:0x7f9bd22247f0
 - ** ---------- .> transport:   redis://redis:6379/0
 - ** ---------- .> results:     disabled://
 - *** --- * --- .> concurrency: 20 (prefork)
 -- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
 --- ***** -----
  -------------- [queues]
                 .> celery           exchange=celery(direct) key=celery


 [tasks]
   . CallbackNotifier
   . FB posting
   . FB token status
   . MD posting
   . MD token status
   . OK posting
   . OK token status
   . TW posting
   . TW token status
   . VK posting
   . VK token status
   . api.controllers.message.scheduled_message
   . backend.celery.debug_task
   . stats.views.collect_stats

 /usr/local/lib/python3.4/site-packages/celery/platforms.py:795: RuntimeWarning: You're running the worker with superuser privileges: this is
 absolutely not recommended!

 Please specify a different user using the -u option.

 User information: uid=0 euid=0 gid=0 egid=0

   uid=uid, euid=euid, gid=gid, egid=egid,

Логи celery, содержащие ошибку о redis, буквально после 1 минуты логи celery перестали записываться.

[2018-02-28 17:55:34,221: CRITICAL/MainProcess] Unrecoverable error: ResponseError('MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-writes-on-bgsave-error option). Please check the Redis logs for details about the RDB error.',)
 Traceback (most recent call last):
   File "/usr/local/lib/python3.4/site-packages/celery/worker/worker.py", line 203, in start
     self.blueprint.start(self)
   File "/usr/local/lib/python3.4/site-packages/celery/bootsteps.py", line 119, in start
     step.start(parent)
   File "/usr/local/lib/python3.4/site-packages/celery/bootsteps.py", line 370, in start
     return self.obj.start()
   File "/usr/local/lib/python3.4/site-packages/celery/worker/consumer/consumer.py", line 320, in start
     blueprint.start(self)
   File "/usr/local/lib/python3.4/site-packages/celery/bootsteps.py", line 119, in start
     step.start(parent)
   File "/usr/local/lib/python3.4/site-packages/celery/worker/consumer/consumer.py", line 596, in start
     c.loop(*c.loop_args())
   File "/usr/local/lib/python3.4/site-packages/celery/worker/loops.py", line 88, in asynloop
     next(loop)
   File "/usr/local/lib/python3.4/site-packages/kombu/async/hub.py", line 354, in create_loop
     cb(*cbargs)
   File "/usr/local/lib/python3.4/site-packages/kombu/transport/redis.py", line 1040, in on_readable
     self.cycle.on_readable(fileno)
   File "/usr/local/lib/python3.4/site-packages/kombu/transport/redis.py", line 337, in on_readable
     chan.handlers[type]()
   File "/usr/local/lib/python3.4/site-packages/kombu/transport/redis.py", line 714, in _brpop_read
     **options)
   File "/usr/local/lib/python3.4/site-packages/redis/client.py", line 680, in parse_response
     response = connection.read_response()
   File "/usr/local/lib/python3.4/site-packages/redis/connection.py", line 629, in read_response
     raise response
 redis.exceptions.ResponseError: MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-writes-on-bgsave-error option). Please check the Redis logs for details about the RDB error.

Логи redis, после запуска

1:M 01 Mar 08:24:09.060 * Background saving started by pid 8738
 8738:C 01 Mar 08:24:09.060 # Failed opening the RDB file root (in server root dir /run) for saving: Permission denied
 1:M 01 Mar 08:24:09.160 # Background saving error
 1:C 01 Mar 08:24:16.265 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
 1:C 01 Mar 08:24:16.269 # Redis version=4.0.6, bits=64, commit=00000000, modified=0, pid=1, just started
 1:C 01 Mar 08:24:16.269 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
 1:M 01 Mar 08:24:16.270 * Running mode=standalone, port=6379.
 1:M 01 Mar 08:24:16.271 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
 1:M 01 Mar 08:24:16.271 # Server initialized
 1:M 01 Mar 08:24:16.271 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.

Настройки разрешений для redis и исполнение из-под redis пользователя

CMD ["chown", "redis:redis", "-R", "/etc"]
CMD ["chown", "redis:redis", "-R", "/var/lib"]
CMD ["chown", "redis:redis", "-R", "/run"]

CMD ["sudo", "chmod", "644", "/data/dump.rdb" ]
CMD ["sudo", "chmod", "755", "/etc" ]
CMD ["sudo", "chmod", "770", "/var/lib" ]
CMD ["sudo", "chmod", "770", "/run" ]

Кто сталкивался с подобным? Какие причины могут быть?

Источник

Last updated: 2022-08-01

My Apache Spark job on Amazon EMR fails with a «Container killed on request» stage failure:

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 3.0 failed 4 times, most recent failure: Lost task 2.3 in stage 3.0 (TID 23, ip-xxx-xxx-xx-xxx.compute.internal, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Container marked as failed: container_1516900607498_6585_01_000008 on host: ip-xxx-xxx-xx-xxx.compute.internal. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137

Short description

When a container (Spark executor) runs out of memory, YARN automatically kills it. This causes the «Container killed on request. Exit code is 137» error. These errors can happen in different job stages, both in narrow and wide transformations. YARN Containers can also be killed by the OS oom_reaper when the OS is running out of memory, causing the «Container killed on request. Exit code is 137» error.

Resolution

Use one or more of the following methods to resolve «Exit status: 137» stage failures.

Increase driver or executor memory

Increase container memory by tuning the spark.executor.memory or spark.driver.memory parameters (depending on which container caused the error).

On a running cluster:

Modify spark-defaults.conf on the master node. Example:

sudo vim /etc/spark/conf/spark-defaults.conf
spark.executor.memory 10g
spark.driver.memory 10g

For a single job:

Use the —executor-memory or —driver-memory option to increase memory when you run spark-submit. Example:

spark-submit --executor-memory 10g --driver-memory 10g ...

Add more Spark partitions

If you can’t increase container memory (for example, if you’re using maximizeResourceAllocation on the node), then increase the number of Spark partitions. Doing this reduces the amount of data that’s processed by a single Spark task, and that reduces the overall memory used by a single executor. Use the following Scala code to add more Spark partitions:

val numPartitions = 500
val newDF = df.repartition(numPartitions)

Increase the number of shuffle partitions

If the error happens during a wide transformation (for example join or groupBy), add more shuffle partitions. The default value is 200.

On a running cluster:

Modify spark-defaults.conf on the master node. Example:

sudo vim /etc/spark/conf/spark-defaults.conf
spark.sql.shuffle.partitions 500

For a single job:

Use the —conf spark.sql.shuffle.partitions option to add more shuffle partitions when you run spark-submit. Example:

spark-submit --conf spark.sql.shuffle.partitions=500 ...

Reduce the number of executor cores

Reducing the number of executor cores reduces the maximum number of tasks that the executor processes simultaneously. Doing this reduces the amount of memory that the container uses.

On a running cluster:

Modify spark-defaults.conf on the master node. Example:

sudo vim /etc/spark/conf/spark-defaults.conf
spark.executor.cores  1

For a single job:

Use the —executor-cores option to reduce the number of executor cores when you run spark-submit. Example:

spark-submit --executor-cores 1 ...

Increase instance size

YARN containers can also be killed by the OS oom_reaper when the OS is running out of memory. If this error happens due to oom_reaper, use a larger instance with more RAM. You can also lower yarn.nodemanager.resource.memory-mb to keep YARN containers from using up all of the Amazon EC2’s ram.

You can detect if the error is due to oom_reaper by reviewing your Amazon EMR Instance logs for the dmesg command output. Start by finding the core or task node where the killed YARN container was running. You can find this information by using the YARN Resource Manager UI or logs. Then, check the Amazon EMR Instance state logs on this node before and after the container was killed to see what killed the process.

In the following example, the process with ID 36787 corresponding to YARN container_165487060318_0001_01_000244 was killed by the kernel (Linux’s OOM killer):

# hows the kernel looking
dmesg | tail -n 25

[ 3910.032284] Out of memory: Kill process 36787 (java) score 96 or sacrifice child
[ 3910.043627] Killed process 36787 (java) total-vm:15864568kB, anon-rss:13876204kB, file-rss:0kB, shmem-rss:0kB
[ 3910.748373] oom_reaper: reaped process 36787 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Did this article help?

Do you need billing or technical support?

AWS support for Internet Explorer ends on 07/31/2022. Supported browsers are Chrome, Firefox, Edge, and Safari.
Learn more »

Источник