Openshift 137 error - Исправление ошибок и поиск оптимальных решений проблем

1 Non-specific error 97 Invalid user credentials 99 User does not exist 100 An application with specified name already exists 101 An application with specified name does not exist and cannot be operated on 102 A user with login already exists 103 Given namespace is already in use 104 User’s gear limit has been reached 105 Invalid application name 106 Invalid namespace 107 Invalid user login 108 Invalid SSH key 109 Invalid cartridge types 110 Invalid application type specified 111 Invalid action 112 Invalid API 113 Invalid auth key 114 Invalid auth iv 115 Too many cartridges of one type per user 116 Invalid SSH key type 117 Invalid SSH key name or tag 118 SSH key name does not exist 119 SSH key or key name not specified 120 SSH key name already exists 121 SSH key already exists 122 Last SSH key for user 123 No SSH key for user 124 Could not delete default or primary key 125 Invalid template 126 Invalid event 127 A domain with specified namespace does not exist and cannot be operated on 128 Could not delete domain because domain has valid applications 129 The application is not configured with this cartridge 130 Invalid parameters to estimates controller 131 Error during estimation 132 Insufficient Access Rights 133 Could not delete user 134 Invalid gear profile 135 Cartridge not found in the application 136 Cartridge already embedded in the application 137 Cartridge cannot be added or removed from the application 138 User deletion not permitted for normal or non-subaccount user 139 Could not delete user because user has valid domain or applications 140 Alias already in use 141 Unable to find nameservers for domain 150 A plan with specified id does not exist 151 Billing account was not found for user 152 Billing account status not active 153 User has more consumed gears than the new plan allows 154 User has gears that the new plan does not allow 155 Error getting account information from billing provider 156 Updating user plan on billing provider failed 157 Plan change not allowed for subaccount user 158 Domain already exists for user 159 User has additional filesystem storage that the new plan does not allow 160 User max gear limit capability does not match with current plan 161 User gear sizes capability does not match with current plan 162 User max untracked additional filesystem storage per gear capability does not match with current plan 163 Gear group does not exist 164 User is not allowed to change storage quota 165 Invalid storage quota value provided 166 Storage value not within allowed range 167 Invalid value for nolinks parameter 168 Invalid scaling factor provided. Value out of range. 169 Could not completely distribute scales_from to all groups 170 Could not resolve DNS 171 Could not obtain lock 172 Invalid or missing private key is required for SSL certificate 173 Alias does exist for this application 174 Invalid SSL certificate 175 User is not authorized to add private certificates 176 User has private certificates that the new plan does not allow 180 This command is not available in this application 181 User maximum tracked additional filesystem storage per gear capability does not match with current plan 182 User does not have gear_sizes capability provided by current plan 183 User does not have max_untracked_addtl_storage_per_gear capability provided by current plan 184 User does not have max_tracked_addtl_storage_per_gear capability provided by current plan 185 Cartridge X can not be added without cartridge Y 186 Invalid environment variables: expected array of hashes. 187 Invalid environment variable X. Valid keys name (required), value 188 Invalid environment variable name X: specified multiple times 189 Environment name X not found in application 190 Value not specified for environment variable X 191 Specify parameters name/value or environment_variables 192 Environment name X already exists in application 193 Environment variable deletion not allowed for this operation 194 Name can only contain letters, digits and underscore and cannot begin with a digit 210 Cannot override existing location for Git repository 211 Parent directory for Git repository does not exist 212 Could not find libra_id_rsa 213 Could not read from SSH configuration file 214 Could not write to SSH configuration file 215 Host could not be created or found 216 Error in Git pull 217 Destroy aborted 218 Not found response from request 219 Unable to communicate with server 220 Plan change is not allowed for this account 221 Plan change is not allowed at this time for this account. Wait a few minutes and try again. If problem persists contact Red Hat support. 253 Could not open configuration file 255 Usage error

Источник

What is OOMKilled (exit code 137)

The OOMKilled error, also indicated by exit code 137, means that a container or pod was terminated because they used more memory than allowed. OOM stands for “Out Of Memory”.

Kubernetes allows pods to limit the resources their containers are allowed to utilize on the host machine. A pod can specify a memory limit – the maximum amount of memory the container is allowed to use, and a memory request – the minimum memory the container is expected to use.

If a container uses more memory than its memory limit, it is terminated with an OOMKilled status. Similarly, if overall memory usage on all containers, or all pods on the node, exceeds the defined limit, one or more pods may be terminated.

You can identify the error by running the kubectl get pods command—the pod status will appear as Terminating.

NAME        READY    STATUS       RESTARTS    AGE
my-pod-1    0/1      OOMKilled    0           3m12s

We’ll provide a general process for identifying and resolving OOMKilled. More complex cases will require advanced diagnosis and troubleshooting, which is beyond the scope of this article.

How Does the OOM Killer Mechanism Work?

OOMKilled is actually not native to Kubernetes—it is a feature of the Linux Kernel, known as the OOM Killer, which Kubernetes uses to manage container lifecycles. The OOM Killer mechanism monitors node memory and selects processes that are taking up too much memory, and should be killed. It is important to realize that OOM Killer may kill a process even if there is free memory on the node.

The Linux kernel maintains an oom_score for each process running on the host. The higher this score, the greater the chance that the process will be killed. Another value, called oom_score_adj, allows users to customize the OOM process and define when processes should be terminated.

Kubernetes uses the oom_score_adj value when defining a Quality of Service (QoS) class for a pod. There are three QoS classes that may be assigned to a pod:

Guaranteed
Burstable
BestEffort

Each QoS class has a matching value for oom_score_adj:

Quality of Service	oom_score_adj
Guaranteed	-997
BestEffort	1000
Burstable	min(max(2, 1000—(1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999)

Because “Guaranteed” pods have a lower value, they are the last to be killed on a node that is running out of memory. “BestEffort” pods are the first to be killed.

A pod that is killed due to a memory issue is not necessarily evicted from a node—if the restart policy on the node is set to “Always”, it will try to restart the pod.

To see the QoS class of a pod, run the following command:

Kubectl get pod -o jsonpath=’{.status.qosClass}’

To see the oom_score of a pod:

Run kubectl exec -it /bin/bash
To see the oom_score, run cat/proc//oom_score
To see the oom_score_adj, run cat/proc//oom_score_adj

The pod with the lowest oom_score is the first to be killed when the node runs out of memory.

OOMKilled: Common Causes

The following table shows the common causes of this error and how to resolve it. However, note there are many more causes of OOMKilled errors, and many cases are difficult to diagnose and troubleshoot.

Cause	Resolution
Container memory limit was reached, and the application is experiencing higher load than normal	Increase memory limit in pod specifications
Container memory limit was reached, and application is experiencing a memory leak	Debug the application and resolve the memory leak
Node is overcommitted—this means the total memory used by pods is greater than node memory	Adjust memory requests (minimal threshold) and memory limits (maximal threshold) in your containers

OOMKilled: Diagnosis and Resolution

Step 1: Gather Information

Run kubectl describe pod [name] and save the content to a text file for future reference:

kubectl describe pod [name] /tmp/troubleshooting_describe_pod.txt

Step 2: Check Pod Events Output for Exit Code 137

Check the Events section of the describe pod text file, and look for the following message:

State:          Running
       Started:      Thu, 10 Oct 2019 11:14:13 +0200
       Last State:   Terminated
       Reason:       OOMKilled
       Exit Code:    137
       ...

Exit code 137 indicates that the container was terminated due to an out of memory issue. Now look through the events in the pod’s recent history, and try to determine what caused the OOMKilled error:

The pod was terminated because a container limit was reached.
The pod was terminated because the node was “overcommitted”—pods were scheduled to the node that, put together, request more memory than is available on the node.

Step 3: Troubleshooting

If the pod was terminated because container limit was reached:

Determine if your application really needs more memory. For example, if the application is a website that is experiencing additional load, it may need more memory than originally specified. In this case, to resolve the error, increase the memory limit for the container in the pod specification.
If memory use suddenly increases, and does not seem to be related to application loads, the application may be experiencing a memory leak. Debug the application and resolve the memory leak. In this case you should not increase the memory limit, because this will cause the application to use up too many resources on the nodes.

If the pod was terminated because of overcommit on the node:

Overcommit on a node can occur because pods are allowed to schedule on a node if their memory requests value—the minimal memory value—is less than the memory available on the node.
For example, Kubernetes may run 10 containers with a memory request value of 1 GB on a node with 10 GB memory. However, if these containers have a memory limit of 1.5 GB, some of the pods may use more than the minimum memory, and then the node will run out of memory and need to kill some of the pods.
You need to determine why Kubernetes decided to terminate the pod with the OOMKilled error, and adjust memory requests and limit values to ensure that the node is not overcommitted.

When adjusting memory requests and limits, keep in mind that when a node is overcommitted, Kubernetes kills nodes according to the following priority order:

Pods that do not have requests or limits
Pods that have requests, but not limits
Pods that are using more than their memory request value—minimal memory specified—but under their memory limit
Pods that are using more than their memory limit

To fully diagnose and resolve Kubernetes memory issues, you’ll need to monitor your environment, understand the memory behavior of pods and containers compared to the limits, and fine tune your settings. This can be a complex, unwieldy process without the right tooling.

Solving Kubernetes Errors Once and for All with Komodor

The troubleshooting process in Kubernetes is complex and, without the right tools, can be stressful, ineffective and time-consuming. Some best practices can help minimize the chances of things breaking down, but eventually something will go wrong—simply because it can.

This is the reason why we created Komodor, a tool that helps dev and ops teams stop wasting their precious time looking for needles in (hay)stacks every time things go wrong.

Acting as a single source of truth (SSOT) for all of your k8s troubleshooting needs, Komodor offers:

Change intelligence: Every issue is a result of a change. Within seconds we can help you understand exactly who did what and when.
In-depth visibility: A complete activity timeline, showing all code and config changes, deployments, alerts, code diffs, pod logs and etc. All within one pane of glass with easy drill-down options.
Insights into service dependencies: An easy way to understand cross-service changes and visualize their ripple effects across your entire system.
Seamless notifications: Direct integration with your existing communication channels (e.g., Slack) so you’ll have all the information you need, when you need it.

Источник

Whether you’re looking for a quick fix for something or gearing up for future troubleshooting, all of these are pretty standard errors you’ll run into as you’re developing on OpenShift. Below are the 10 most common ones I’ve seen when working with developers who are getting started with the platform.

But first:

Where to Look for Error Info

Pod/Container Logs

If your build or deployment started and failed halfway through, this is the best place to start. You can see build logs by looking here:

You can see deployment logs by looking at the specific deployment and either looking at that deployment’s logs or the pod’s logs directly. Click on the «1 pod» section to find that deployment’s pods, then click «Logs.»

Monitoring/Events

Most OpenShift objects include an «Events» tab so you can watch new events as they happen. You can also see all of the events happening in the project by clicking on «Monitoring» in the sidebar.

Most of the time, errors will be visible in either of those locations.

10 Common Errors

1. Missing configmap/secret/volume in deployment config

This will appear as a «RunContainerError» when your pods are attempting to spin up. If the required ConfigMap/Secret is missing, or if the key you’re looking for in a ConfigMap/Secret is missing, you’ll see this error under «Events.»

2. health check using the wrong port

This one is a bit harder to find generally, but if your application looks like it has spun up fine with no errors and then appears as Failed with the pods constantly restarting, the liveness probe might be hitting the wrong port. Your readiness probe should also hit the correct port, but it won’t restart the pod if it fails (the pod will just appear as «not ready»).

3. missing build secret for authenticating with source repo

If you’re seeing a Fetch source failed error when you try to build, you might need to set up a build secret to authenticate with your Git repo. This will either be a username and password (new-basicauth secret) or an SSH key (new-sshauth secret) depending on the URL.

4. PROJECT QUOTA EXCEEDED

The system admins for your OCP cluster usually set project quotas to keep individual projects from taking up too many resources. If you’ve already reached your project quota, trying to deploy a new container will fail. You can decrease replicas for other containers, reduce the resource requests/limits for each service, or get the OCP admins to increase your project quota.

5. rESOURCES OUTSIDE OF REQUEST/LIMIT RATIO

In addition to project quotas, sometimes OCP admins will add limits on what individual pods can request in terms of CPU and memory. Sometimes you’ll be inside the limit range for the pod but you’ll still get an error about your resource request/limit. This is because there can also be max/min ratios set on pod resources that require your request and limit values to be within a certain ratio.

6. build terminates with exit code 137

I really only see this on Maven builds, but it’s not limited to that. This is an Out Of Memory (OOM) error while trying to build. Increase the memory request and limit on your BuildConfig and this should go away.

7. image pull from external registry instead of internal

If you’re seeing an error that looks something like this:

There are a couple reasons that this could be happening. In most cases, you probably aren’t trying to get the «hello-world» image from registry.access.redhat.com but want to retrieve it from the internal OCP registry instead. If this is the case, you should take a look at your ImageChange Trigger in the DeploymentConfig and make sure that it is properly set up to update to the latest image when a new one is pushed to the internal registry.

To fix your current failed deployment, you’ll need to get the full docker pull spec for the image and copy that into your DeploymentConfig. Go to the ImageStream for the image you want and click «Actions» and «Edit YAML»:

In the YAML, search for «dockerImageReference» and copy its value. It will look something like:

Then paste this into your DeploymentConfig for that container:

This will ensure that your DeploymentConfig is pulling the correct image from the local registry instead of going out to an external registry.

8. environment variables are «invalid»

If you try to use «oc apply» to update an environment variable from a name/value pair to a «valueFrom» retrieved from a ConfigMap or Secret, or vice versa, you’ll get this error:

The DeploymentConfig «hello-world» is invalid: spec.template.spec.containers[0].env[0].valueFrom: Invalid value: «»: may not be specified when `value` is not empty

There are a couple bug reports out for this error, but there isn’t a fix as of this post. The easiest way to get rid of this error is to delete all of the environment variables for your DeploymentConfig and run the update again, or update them manually in the console rather than running an «oc apply.»

9. Deployment config always appears as «canceled»

The deployment number causes this error. In the above example, we’re currently running deployment #6, but somehow deployments #1 and #2 are more recent. The latest deployment/replication controller must always have the highest number. The reset to #1 generally happens if the DeploymentConfig is deleted and recreated with «oc delete»/»oc create» or «oc replace.» If this happens, the quickest way to get the new deployments running again is to delete all of the previous replication controllers and re-deploy.

10. build succeeds but fails to push image

Private registries sometimes require image push or image pull secrets for security purposes. If a BuildConfig doesn’t include this secret or includes the wrong one for a secured registry, you’ll see this error:

You can fix this by editing your BuildConfig and choosing «Show advanced options» to choose the right image push/pull secrets.

Источник

In the documentation here it shows a pod using too much memory and is promptly killed.
When this happens it shows the error reason as «OOM» and error code as 137 in the docs. When I go through similar steps myself, the termination reason is just «Error», though I do still get the 137 error code. Is there a reason this was changed?
OOM is very clear on what happened while «Error» can send people down a wild chase trying to
figure out what happened to their pod — hence me filing this issue.

For reference the script ran in my docker image just eats memory until it the container gets killed.

$kubectl version
Client Version: version.Info{Major:"1", Minor:"1", GitVersion:"v1.1.1", GitCommit:"92635e23dfafb2ddc828c8ac6c03c7a7205a84d8", GitTreeState:"clean"}
Server Version: version.Info{Major:"1", Minor:"1", GitVersion:"v1.1.3", GitCommit:"6a81b50c7e97bbe0ade075de55ab4fa34f049dc2", GitTreeState:"clean"}

$ kubectl get pod -o json  memtest
{
    "kind": "Pod",
    "apiVersion": "v1",
    "metadata": {
        "name": "memtest",
        "namespace": "default",
        "selfLink": "/api/v1/namespaces/default/pods/memtest",
        "uid": "c480d1ba-bec2-11e5-ad45-062d2421a4bd",
        "resourceVersion": "21949421",
        "creationTimestamp": "2016-01-19T15:39:03Z"
    },
    "spec": {
        "containers": [
            {
                "name": "memtest",
                "image": "nyxcharon/docker-stress:latest",
                "args": [
                    "python",
                    "/scripts/mem-fill"
                ],
                "resources": {
                    "limits": {
                        "memory": "10M"
                    },
                    "requests": {
                        "memory": "10M"
                    }
                },
                "terminationMessagePath": "/dev/termination-log",
                "imagePullPolicy": "Always"
            }
        ],
        "restartPolicy": "Never",
        "terminationGracePeriodSeconds": 30,
        "dnsPolicy": "ClusterFirst",
        "nodeName": "<IP removed>"
    },
    "status": {
        "phase": "Failed",
        "conditions": [
            {
                "type": "Ready",
                "status": "False",
                "lastProbeTime": null,
                "lastTransitionTime": null
            }
        ],
        "hostIP": "<IP Removed>",
        "startTime": "2016-01-19T15:39:03Z",
        "containerStatuses": [
            {
                "name": "memtest",
                "state": {
                    "terminated": {
                        "exitCode": 137,
                        "reason": "Error",
                        "startedAt": "2016-01-19T15:39:15Z",
                        "finishedAt": "2016-01-19T15:39:16Z",
                        "containerID": "docker://3dd77f77dfd6e715c8792c625f388e0b31cbd36ccdb4a11dafbb6d381bf83943"
                    }
                },
                "lastState": {},
                "ready": false,
                "restartCount": 0,
                "image": "nyxcharon/docker-stress:latest",
                "imageID": "docker://bacbb71b34e92ed2074621d86b10ec15a856f5918537c4d75b6f16925b5b93e7",
                "containerID": "docker://3dd77f77dfd6e715c8792c625f388e0b31cbd36ccdb4a11dafbb6d381bf83943"
            }
        ]
    }
}

$ kubectl describe pod memtest
Name:               memtest
Namespace:          default
Image(s):           nyxcharon/docker-stress:latest
Node:               <IP removed>
Start Time:         Tue, 19 Jan 2016 10:39:03 -0500
Labels:             <none>
Status:             Failed
Reason:
Message:
IP:
Replication Controllers:    <none>
Containers:
  memtest:
    Container ID:   docker://3dd77f77dfd6e715c8792c625f388e0b31cbd36ccdb4a11dafbb6d381bf83943
    Image:      nyxcharon/docker-stress:latest
    Image ID:       docker://bacbb71b34e92ed2074621d86b10ec15a856f5918537c4d75b6f16925b5b93e7
    QoS Tier:
      memory:   Guaranteed
    Limits:
      memory:   10M
    Requests:
      memory:       10M
    State:      Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Tue, 19 Jan 2016 10:39:15 -0500
      Finished:     Tue, 19 Jan 2016 10:39:16 -0500
    Ready:      False
    Restart Count:  0
    Environment Variables:
Conditions:
  Type      Status
  Ready     False
No volumes.
Events:
  FirstSeen LastSeen    Count   From                    SubobjectPath               Reason  Message
  ─────────   ────────    ───── ────                    ─────────────             ──────  ───────
  2m        2m      1   {kubelet <IP removed>}  implicitly required container POD   Pulled  Container image "gcr.io/google_containers/pause:0.8.0" already present on machine
  2m        2m      1   {scheduler }                                    Scheduled   Successfully assigned memtest to <IP removed>
  2m        2m      1   {kubelet <IP removed>}  implicitly required container POD   CreatedCreated with docker id 65a446677edd
  2m        2m      1   {kubelet <IP removed>}  spec.containers{memtest}        PullingPulling image "nyxcharon/docker-stress:latest"
  2m        2m      1   {kubelet <IP removed>}  implicitly required container POD   StartedStarted with docker id 65a446677edd
  2m        2m      1   {kubelet <IP removed>}  spec.containers{memtest}        Pulled  Successfully pulled image "nyxcharon/docker-stress:latest"
  2m        2m      1   {kubelet <IP removed>}  spec.containers{memtest}        CreatedCreated with docker id 3dd77f77dfd6
  2m        2m      1   {kubelet <IP removed>}  spec.containers{memtest}        StartedStarted with docker id 3dd77f77dfd6
  2m        2m      1   {kubelet <IP removed>}  implicitly required container POD   KillingKilling with docker id 65a446677edd

Here is the pod definition i’m using:

kind: Pod
apiVersion: v1
metadata:
  name: memtest
spec:
  containers:
  - name: memtest
    image: nyxcharon/docker-stress:latest
    args:
    - python
    - /scripts/mem-fill
    resources:
      limits:
        memory: 10M
  restartPolicy: Never

Источник

What is OOMKilled (exit code 137)

How Does the OOM Killer Mechanism Work?

OOMKilled: Common Causes

OOMKilled: Diagnosis and Resolution

Step 1: Gather Information

Step 2: Check Pod Events Output for Exit Code 137

Step 3: Troubleshooting

Solving Kubernetes Errors Once and for All with Komodor

Where to Look for Error Info

Pod/Container Logs

Monitoring/Events

10 Common Errors

1. Missing configmap/secret/volume in deployment config

2. health check using the wrong port

3. missing build secret for authenticating with source repo

4. PROJECT QUOTA EXCEEDED

5. rESOURCES OUTSIDE OF REQUEST/LIMIT RATIO

6. build terminates with exit code 137

7. image pull from external registry instead of internal

8. environment variables are «invalid»

9. Deployment config always appears as «canceled»

10. build succeeds but fails to push image

Читайте также: