Failed create pod sandbox rpc error code unknown desc failed to set up sandbox container

Last updated: 2023-01-09

Why is my Amazon EKS pod stuck in the ContainerCreating state with the error «failed to create pod sandbox»?

My Amazon Elastic Kubernetes Service (Amazon EKS) pod is stuck in the ContainerCreating state with the error «failed to create pod sandbox».


Your Amazon EKS pods might be stuck in the ContainerCreating state with a network connectivity error for several reasons. Use the following troubleshooting steps based on the error message that you get.

Error response from daemon: failed to start shim: fork/exec /usr/bin/containerd-shim: resource temporarily unavailable: unknown

This error occurs because of an operating system limitation that’s caused by the defined kernel settings for maximum PID or maximum number of files.

Run the following command to get information about your pod:

$ kubectl describe pod example_pod

Example output:

kubelet, ip-xx-xx-xx-xx.xx-xxxxx-x.compute.internal  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "example_pod": Error response from daemon: failed to start shim: fork/exec /usr/bin/containerd-shim: resource temporarily unavailable: unknown

To temporarily resolve the issue, restart the node.

To troubleshoot the issue, do the following:

  • Gather the node logs.
  • Review the Docker logs for the error «dockerd[4597]: runtime/cgo: pthread_create failed: Resource temporarily unavailable».
  • Review the Kubelet log for the following errors:
    • «kubelet[5267]: runtime: failed to create new OS thread (have 2 already; errno=11)»
    • «kubelet[5267]: runtime: may need to increase max user processes (ulimit -u)».
  • Identify the zombie processes by running the ps command. All the processes listed with the Z state in the output are the zombie processes.

Network plugin cni failed to set up pod network: add cmd: failed to assign an IP address to container

This error indicates that the Container Network Interface (CNI) can’t assign an IP address for the newly provisioned pod.

The following are reasons why the CNI fails to provide an IP address to the newly created pod:

  • The instance used the maximum allowed elastic network interfaces and IP addresses.
  • The Amazon Virtual Private Cloud (Amazon VPC) subnets have an IP address count of zero.

The following is an example of network interface IP address exhaustion:

Instance type    Maximum network interfaces    Private IPv4 addresses per interface    IPv6 addresses per interface
t3.medium        3                             6                                       6

In this example, the instance t3.medium has a maximum of 3 network interfaces, and each network interface has a maximum of 6 IP addresses. The first IP address is used for the node and isn’t assignable. This leaves 17 IP addresses that the network interface can allocate.

The Local IP Address Management daemon (ipamD) logs show the following message when the network interface runs out of IP addresses:

"ipamd/ipamd.go:1285","msg":"Total number of interfaces found: 3 "
"AssignIPv4Address: IP address pool stats: total: 17, assigned 17"
"AssignPodIPv4Address: ENI eni-abc123 does not have available addresses"

Run the following command to get information about your pod:

$ kubectl describe pod example_pod

Example output:

Warning FailedCreatePodSandBox 23m (x2203 over 113m) kubelet, ip-xx-xx-xx-xx.xx-xxxxx-x.compute.internal (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" network for pod "provisioning-XXXXXXXXXXXXXXX": networkPlugin cni failed to set up pod "provisioning-XXXXXXXXXXXXXXX" network: add cmd: failed to assign an IP address to container

Review the subnet to identify if the subnet ran out of free IP addresses. You can view available IP addresses for each subnet in the Amazon VPC console under the Subnets section.

IPv4 CIDR Block   Number of allocated ips 254   Free address count 0

To resolve this issue, scale down some of the workload to free up available IP addresses. If additional subnet capacity is available, then you can scale the node. You can also create an additional subnet. For more information, see How do I use multiple CIDR ranges with Amazon EKS? Follow the instructions in the Create subnets with a new CIDR range section.

Error while dialing dial tcp connect: connection refused

This error indicates that the aws-node pod failed to communicate with IPAM because the aws-node pod failed to run on the node.

Run the following commands to get information about the pod:

$ kubectl describe pod example_pod

$ kubectl describe pod/aws-node-XXXXX -n kube-system

Example outputs:

Warning  FailedCreatePodSandBox  51s  kubelet, ip-xx-xx-xx-xx.ec2.internal  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" network for pod "example_pod": NetworkPlugin cni failed to set up pod "example_pod" network: add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp connect: connection refused", failed to clean up sandbox container

"XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" network for pod "example_pod": NetworkPlugin cni failed to teardown pod "example_pod" network: del cmd: error received from DelNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp connect: connection refused"]

To troubleshoot this issue, verify that the aws-node pod is deployed and is in the Running state:

kubectl get pods --selector=k8s-app=aws-node -n kube-system

Note: Make sure that you’re running the correct version of the VPC CNI plugin for the cluster version.

The pods might be in Pending state due to Liveness and Readiness probe errors. Be sure that you have the latest recommended VPC CNI add-on version according to the compatibility table.

Run the following command to view the last log message from the aws-node pod:

kubectl -n kube-system exec -it aws-node-XXX-- tail -f /host/var/log/aws-routed-eni/ipamd.log | tee ipamd.log

The issue might also occur because the Dockershim mount point fails to mount. The following is an example message that you can receive when this issue occurs:

Getting running pod sandboxes from "unix:///var/run/dockershim.sock
Not able to get local pod sandboxes yet (attempt 1/5): rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial unix /var/run/dockershim.sock: connect: no such file or director

The preceding message indicates that the pod didn’t mount var/run/dockershim.sock.

To resolve this issue, try the following:

  • Restart the aws-node pod to remap the mount point.
  • Cordon the node, and scale the nodes in the node group.
  • Upgrade the Amazon VPC network interface to the latest cluster version that’s supported.

If you added the CNI as a managed plugin in the AWS Management Console, then the aws-node fails the probes. Managed plugins overwrite the service account. However, the service account isn’t configured with the selected role. To resolve this issue, turn off the plugin from the AWS Management Console, and create the service account using a manifest file. Or, edit the current aws-node service account to add the role that’s used on the managed plugin.

Network plugin cni failed to set up pod «my-app-xxbz-zz» network: failed to parse Kubernetes args: pod does not have label

You get this error because of either of the following reasons:

  • The pod isn’t running properly.
  • The certificate that the pod is using isn’t created successfully.

This error relates to the Amazon VPC admission controller webhook that’s required on Amazon EKS clusters to run Windows workloads. The webhook is a plugin that runs a pod in the kube-system namespace. The component runs on Linux nodes and allows networking for incoming pods on Windows nodes.

Run the following command to get the list of pods that are affected:

Example output:

my-app-xxx-zz        0/1     ContainerCreating   0          58m   <none>            ip-XXXXXXX.compute.internal   <none>
my-app-xxbz-zz       0/1     ContainerCreating   0          58m   <none>

Run the following command to get information about the pod:

$ kubectl describe pod my-app-xxbz-zz

Example output:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" network for pod "<POD_ANME>": networkPlugin cni failed to set up pod "example_pod" network: failed to parse Kubernetes args: pod does not have label

Reconciler worker 1 starting processing node ip-XXXXXXX.compute.internal.
Reconciler checking resource warmpool size 1 desired 3 on node ip-XXXXXXX.compute.internal.
Reconciler creating resource on node ip-XXXXXXX.compute.internal.
Reconciler failed to create resource on node ip-XXXXXXX.compute.internal: node has no open IP address slots.

Windows nodes support one network interface per node. Each Windows node can run as many pods as the available IP addresses per network interface, minus one. To resolve this issue, scale up the number of Windows nodes.

If the IP addresses aren’t the issue, then review the Amazon VPC admission controller pod event and logs.

Run the following command to confirm that the Amazon VPC admission controller pod is created:

$ kubectl get pods -n kube-system  OR kubectl get pods -n kube-system | grep "vpc-admission"

Example output:

vpc-admission-webhook-5bfd555984-fkj8z     1/1     Running   0          25m

Run the following command to get information about the pod:

$ kubectl describe pod vpc-admission-webhook-5bfd555984-fkj8z -n kube-system

Example output:

  Normal  Scheduled  27m   default-scheduler  Successfully assigned kube-system/vpc-admission-webhook-5bfd555984-fkj8z to ip-xx-xx-xx-xx.ec2.internal
  Normal  Pulling    27m   kubelet            Pulling image ""
  Normal  Pulled     27m   kubelet            Successfully pulled image "" in 1.299938222s
  Normal  Created    27m   kubelet            Created container vpc-admission-webhook
  Normal  Started    27m   kubelet            Started container vpc-admission-webhook

Run the following command to check the pod logs for any configuration issues:

$ kubectl logs vpc-admission-webhook-5bfd555984-fkj8z -n kube-system

Example output:

I1109 07:32:59.352298       1 main.go:72] Initializing vpc-admission-webhook version v0.2.7.
I1109 07:32:59.352866       1 webhook.go:145] Setting up webhook with OSLabelSelectorOverride: windows.
I1109 07:32:59.352908       1 main.go:105] Webhook Server started.
I1109 07:32:59.352933       1 main.go:96] Listening on :61800 for metrics and healthz
I1109 07:39:25.778144       1 webhook.go:289] Skip mutation for  as the target platform is .

The preceding output shows that the container started successfully. The pod then adds the label to the application pod. However, the manifest for the application pod must contain a node selector or affinity so that the pod is scheduled on the Windows nodes.

Other options to troubleshoot the issue include verifying the following:

  • You deployed the Amazon VPC admission controller pod in the kube-system namespace.
  • Logs or events aren’t pointing to an expired certificate. If the certificate is expired and Windows pods are stuck in the Container creating state, then you must delete and redeploy the pods.
  • There aren’t any timeouts or DNS-related issues.

If you don’t create the Amazon VPC admission controller, then turn on Windows support for your cluster.

Important: Amazon EKS doesn’t require you to turn on the Amazon VPC admission controller to support Windows node groups. If you turned on the Amazon VPC admission controller, then remove legacy Windows support from your data plane.

When I used calico as CNI and I faced a similar issue.

The container remained in creating state, I checked for /etc/cni/net.d and /opt/cni/bin on master both are present but not sure if this is required on worker node as well.

root@KubernetesMaster:/opt/cni/bin# kubectl get pods
NAME                   READY   STATUS              RESTARTS   AGE
nginx-5c7588df-5zds6   0/1     ContainerCreating   0          21m
root@KubernetesMaster:/opt/cni/bin# kubectl get nodes
NAME               STATUS   ROLES    AGE   VERSION
kubernetesmaster   Ready    master   26m   v1.13.4
kubernetesslave1   Ready    <none>   22m   v1.13.4

kubectl describe pods
Name:               nginx-5c7588df-5zds6
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               kubernetesslave1/
Start Time:         Sun, 17 Mar 2019 05:13:30 +0000
Labels:             app=nginx
Annotations:        <none>
Status:             Pending
Controlled By:      ReplicaSet/nginx-5c7588df
    Container ID:
    Image:          nginx
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
      /var/run/secrets/ from default-token-qtfbs (ro)
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-qtfbs
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations: for 300s
        for 300s
  Type     Reason                  Age                    From                       Message
  ----     ------                  ----                   ----                       -------
  Normal   Scheduled               18m                    default-scheduler          Successfully assigned default/nginx-5c7588df-5zds6 to kubernetesslave1
  Warning  FailedCreatePodSandBox  18m                    kubelet, kubernetesslave1  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "123d527490944d80f44b1976b82dbae5dc56934aabf59cf89f151736d7ea8adc" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Warning  FailedCreatePodSandBox  18m                    kubelet, kubernetesslave1  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "8cc5e62ebaab7075782c2248e00d795191c45906cc9579464a00c09a2bc88b71" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Warning  FailedCreatePodSandBox  18m                    kubelet, kubernetesslave1  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "30ffdeace558b0935d1ed3c2e59480e2dd98e983b747dacae707d1baa222353f" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Warning  FailedCreatePodSandBox  18m                    kubelet, kubernetesslave1  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "630e85451b6ce2452839c4cfd1ecb9acce4120515702edf29421c123cf231213" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Warning  FailedCreatePodSandBox  18m                    kubelet, kubernetesslave1  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "820b919b7edcfc3081711bb78b79d33e5be3f7dafcbad29fe46b6d7aa22227aa" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Warning  FailedCreatePodSandBox  18m                    kubelet, kubernetesslave1  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "abbfb5d2756f12802072039dec20ba52f546ae755aaa642a9a75c86577be589f" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Warning  FailedCreatePodSandBox  18m                    kubelet, kubernetesslave1  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "dfeb46ffda4d0f8a434f3f3af04328fcc4b6c7cafaa62626e41b705b06d98cc4" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Warning  FailedCreatePodSandBox  18m                    kubelet, kubernetesslave1  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "9ae3f47bb0282a56e607779d3267127ee8b0ae1d7f416f5a184682119203b1c8" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Warning  FailedCreatePodSandBox  18m                    kubelet, kubernetesslave1  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "546d07f1864728b2e2675c066775f94d658e221ada5fb4ed6bf6689ec7b8de23" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Normal   SandboxChanged          18m (x12 over 18m)     kubelet, kubernetesslave1  Pod sandbox changed, it will be killed and re-created.
  Warning  FailedCreatePodSandBox  3m39s (x829 over 18m)  kubelet, kubernetesslave1  (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "f586be437843537a3082f37ad139c88d0eacfbe99ddf00621efd4dc049a268cc" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/

On worker node NGINX is trying to come up but getting exited, I am not sure what’s going on here — I am newbie to kubernetes & not able to fix this issue —

root@kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS              PORTS               NAMES
5ad5500e8270        fadcc5d2b066           "/usr/local/bin/kube…"   3 minutes ago       Up 3 minutes                            k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e   "/pause"                 3 minutes ago       Up 3 minutes                            k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563   "/pause"                 3 minutes ago       Up 3 minutes                            k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
root@kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS              PORTS               NAMES
5ad5500e8270        fadcc5d2b066           "/usr/local/bin/kube…"   3 minutes ago       Up 3 minutes                            k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e   "/pause"                 3 minutes ago       Up 3 minutes                            k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563   "/pause"                 3 minutes ago       Up 3 minutes                            k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
root@kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS              PORTS               NAMES
5ad5500e8270        fadcc5d2b066           "/usr/local/bin/kube…"   3 minutes ago       Up 3 minutes                            k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e   "/pause"                 3 minutes ago       Up 3 minutes                            k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563   "/pause"                 3 minutes ago       Up 3 minutes                            k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1

    root@kubernetesslave1:/home/ubuntu# docker ps
    CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS                  PORTS               NAMES
    94b2994401d0   "/pause"                 1 second ago        Up Less than a second                       k8s_POD_nginx-5c7588df-5zds6_default_677a722b-4873-11e9-a33a-06516e7d78c4_534
    5ad5500e8270        fadcc5d2b066           "/usr/local/bin/kube…"   4 minutes ago       Up 4 minutes                                k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
    b1c9929ebe9e   "/pause"                 4 minutes ago       Up 4 minutes                                k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
    ceb78340b563   "/pause"                 4 minutes ago       Up 4 minutes                                k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
    root@kubernetesslave1:/home/ubuntu# docker ps
    CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS              PORTS               NAMES
    5ad5500e8270        fadcc5d2b066           "/usr/local/bin/kube…"   4 minutes ago       Up 4 minutes                            k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
    b1c9929ebe9e   "/pause"                 4 minutes ago       Up 4 minutes                            k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
    ceb78340b563   "/pause"                 4 minutes ago       Up 4 minutes                            k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
    root@kubernetesslave1:/home/ubuntu# docker ps
    CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS                  PORTS               NAMES
    f72500cae2b7   "/pause"                 1 second ago        Up Less than a second                       k8s_POD_nginx-5c7588df-5zds6_default_677a722b-4873-11e9-a33a-06516e7d78c4_585
    5ad5500e8270        fadcc5d2b066           "/usr/local/bin/kube…"   4 minutes ago       Up 4 minutes                                k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
    b1c9929ebe9e   "/pause"                 4 minutes ago       Up 4 minutes                                k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
    ceb78340b563   "/pause"                 4 minutes ago       Up 4 minutes                                k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
    root@kubernetesslave1:/home/ubuntu# docker ps
    CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS              PORTS               NAMES
    5ad5500e8270        fadcc5d2b066           "/usr/local/bin/kube…"   5 minutes ago       Up 5 minutes                            k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
    b1c9929ebe9e   "/pause"                 5 minutes ago       Up 5 minutes                            k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
    ceb78340b563   "/pause"                 5 minutes ago       Up 5 minutes                            k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1

I checked about /etc/cni/net.d & /opt/cni/bin on worker node as well, it is there —

root@kubernetesslave1:/home/ubuntu# cd /etc/cni
root@kubernetesslave1:/etc/cni# ls -ltr
total 4
drwxr-xr-x 2 root root 4096 Mar 17 05:19 net.d
root@kubernetesslave1:/etc/cni# cd /opt/cni
root@kubernetesslave1:/opt/cni# ls -ltr
total 4
drwxr-xr-x 2 root root 4096 Mar 17 05:19 bin
root@kubernetesslave1:/opt/cni# cd bin
root@kubernetesslave1:/opt/cni/bin# ls -ltr
total 107440
-rwxr-xr-x 1 root root  3890407 Aug 17  2017 bridge
-rwxr-xr-x 1 root root  3475802 Aug 17  2017 ipvlan
-rwxr-xr-x 1 root root  3520724 Aug 17  2017 macvlan
-rwxr-xr-x 1 root root  3877986 Aug 17  2017 ptp
-rwxr-xr-x 1 root root  3475750 Aug 17  2017 vlan
-rwxr-xr-x 1 root root  9921982 Aug 17  2017 dhcp
-rwxr-xr-x 1 root root  2605279 Aug 17  2017 sample
-rwxr-xr-x 1 root root 32351072 Mar 17 05:19 calico
-rwxr-xr-x 1 root root 31490656 Mar 17 05:19 calico-ipam
-rwxr-xr-x 1 root root  2856252 Mar 17 05:19 flannel
-rwxr-xr-x 1 root root  3084347 Mar 17 05:19 loopback
-rwxr-xr-x 1 root root  3036768 Mar 17 05:19 host-local
-rwxr-xr-x 1 root root  3550877 Mar 17 05:19 portmap
-rwxr-xr-x 1 root root  2850029 Mar 17 05:19 tuning

Pods remain in Pending phase

Check RunBook Match

When you see these errors:

Type     Reason                  Age                     From     Message
  ----     ------                  ----                    ----     -------
  Warning  FailedSync              47m (x27 over 3h)       kubelet  error determining status: rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedCreatePodSandBox  7m56s (x38 over 3h23m)  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "xxx": operation timeout: context deadline exceeded

Also noticed running kubectl top nodes

NAME                                                 CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
node-1-xxx   158m         16%    2311Mi          87%
node-2-xxx   111m         11%    1456Mi          55%
node-3-xxx   391m         41%    1662Mi          63%
node-4-xxx   169m         17%    2210Mi          83%
node-5-xxx   <unknown>   <unknown>   <unknown>   <unknown>

Also noticed node events showed the node was repeatedly restarting, ran kubectl describe node-5-xxx

  Type    Reason        Age                     From     Message
  ----    ------        ----                    ----     -------
  Normal  NodeNotReady  3m15s (x2344 over 10d)  kubelet  Node node-5-xxx status is now: NodeNotReady
  Normal  NodeReady     2m15s (x2345 over 49d)  kubelet  Node node-5-xxx status is now: NodeReady

Initial Steps Overview

  1. Investigate the failing pod

  2. Identify the node the pod is meant to be scheduled on

  3. Investigate the node

Detailed Steps

1) Investigate the failing pod

Check the logs of the pod:

$ kubectl logs pod-xxx

Check the events of the pod:

$ kubectl describe pod pod-xxx

2) Investigate the node the pod is meant to be scheduled on

Describe pod and see what node the pod is meant to be running on:

$ kubectl describe pod-xxx

Ouput will start with something like this, look for the “Node: » part:

Name:         pod-xxx
Namespace:    namespace-xxx
Priority:     0
Node:         node-xxx

3) Investigate the node

Check the resources of the nodes:

$ kubectl top nodes

Check the events of the node you identified the pod was meant to be scheduled on:

$ kubectl describe node node-xxx

Solutions List

A) Remove problematic node

Solutions Detail

The solution was to remove the problematic node, see more details below.

A) Remove problematic node

Check Resolution

Further Steps

  1. Create Extra Node

  2. Drain Problematic Node

  3. Delete Problematic Node

1) Create Extra Node

You may need to create a new node before draining this node. Just check with kubectl top nodes to see if the nodes have extra capacity to schedule your drained pods.

If you see you need an extra node before you drain the node, make sure to do so.

In our situation we were using Managed GKE cluster, so we added a new node via the console.

2) Drain Problematic Node

Once you are sure there is enough capacity amongst your remaining nodes to schedule the pods that are on the problematic node, then you can go ahead and drain the node.

$ kubectl drain node-xxx

3) Delete Problematic Node

Check once all scheduled pods have been drained off of the node.

$ kubectl get nodes

Once done you can delete the node:

$ kubectl delete node node-xxx

Further Information


About k8s deployment error resolution

  • Error message
Warning  FailedCreatePodSandBox  89s               kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "1c97ad2710e2939c0591477f9d6dde8e0d7d31b3fbc138a7fa38aaa657566a9a" network for pod "coredns-7f89b7bc75-qg924": networkPlugin cni failed to set up pod "coredns-7f89b7bc75-qg924_kube-system" network: error getting ClusterInformation: Get "https://[]:443/apis/": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes"), failed to clean up sandbox container "1c97ad2710e2939c0591477f9d6dde8e0d7d31b3fbc138a7fa38aaa657566a9a" network for pod "coredns-7f89b7bc75-qg924": networkPlugin cni failed to teardown pod "coredns-7f89b7bc75-qg924_kube-system" network: error getting ClusterInformation: Get "https://[]:443/apis/": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")]
  • Performance status
[[email protected] ~]# kubectl get pods -n kube-system
NAME                              READY   STATUS              RESTARTS   AGE
coredns-7f89b7bc75-jzs26          0/1     ContainerCreating   0          63s
coredns-7f89b7bc75-qg924          0/1     ContainerCreating   0          63s

# coredns cannot run
  • Change calico.yaml

# Cluster type to identify the deployment type
  - name: CLUSTER_TYPE
  value: "k8s,bgp"
# New below
    value: "interface=ens192"
    # ens192 is the local NIC name
  • kubectl apply -f calico.yaml
  • Check that it is running
[[email protected] ~]# kubectl get pods -n kube-system
NAME                                       READY   STATUS    RESTARTS   AGE
calico-kube-controllers-69496d8b75-2nm5k   1/1     Running   0          23m
calico-node-8wfk9                          1/1     Running   0          23m
calico-node-9vn4v                          1/1     Running   0          23m
calico-node-qm8s2                          1/1     Running   0          23m
coredns-7f89b7bc75-jzs26                   1/1     Running   0          26m
coredns-7f89b7bc75-qg924                   1/1     Running   0          26m
etcd-linux03                               1/1     Running   0          26m
kube-apiserver-linux03                     1/1     Running   0          26m
kube-controller-manager-linux03            1/1     Running   0          26m
kube-proxy-29lcf                           1/1     Running   0          25m
kube-proxy-c29wz                           1/1     Running   0          26m
kube-proxy-lpgrr                           1/1     Running   0          25m
kube-scheduler-linux03                     1/1     Running   0          26m

