Why is my Amazon EKS pod stuck in the ContainerCreating state with the error «failed to create pod sandbox»?
Last updated: 2023-01-09
My Amazon Elastic Kubernetes Service (Amazon EKS) pod is stuck in the ContainerCreating state with the error «failed to create pod sandbox».
Resolution
Your Amazon EKS pods might be stuck in the ContainerCreating state with a network connectivity error for several reasons. Use the following troubleshooting steps based on the error message that you get.
Error response from daemon: failed to start shim: fork/exec /usr/bin/containerd-shim: resource temporarily unavailable: unknown
This error occurs because of an operating system limitation that’s caused by the defined kernel settings for maximum PID or maximum number of files.
Run the following command to get information about your pod:
$ kubectl describe pod example_pod
Example output:
kubelet, ip-xx-xx-xx-xx.xx-xxxxx-x.compute.internal Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "example_pod": Error response from daemon: failed to start shim: fork/exec /usr/bin/containerd-shim: resource temporarily unavailable: unknown
To temporarily resolve the issue, restart the node.
To troubleshoot the issue, do the following:
- Gather the node logs.
- Review the Docker logs for the error «dockerd[4597]: runtime/cgo: pthread_create failed: Resource temporarily unavailable».
- Review the Kubelet log for the following errors:
- «kubelet[5267]: runtime: failed to create new OS thread (have 2 already; errno=11)»
- «kubelet[5267]: runtime: may need to increase max user processes (ulimit -u)».
- Identify the zombie processes by running the ps command. All the processes listed with the Z state in the output are the zombie processes.
Network plugin cni failed to set up pod network: add cmd: failed to assign an IP address to container
This error indicates that the Container Network Interface (CNI) can’t assign an IP address for the newly provisioned pod.
The following are reasons why the CNI fails to provide an IP address to the newly created pod:
- The instance used the maximum allowed elastic network interfaces and IP addresses.
- The Amazon Virtual Private Cloud (Amazon VPC) subnets have an IP address count of zero.
The following is an example of network interface IP address exhaustion:
Instance type Maximum network interfaces Private IPv4 addresses per interface IPv6 addresses per interface
t3.medium 3 6 6
In this example, the instance t3.medium has a maximum of 3 network interfaces, and each network interface has a maximum of 6 IP addresses. The first IP address is used for the node and isn’t assignable. This leaves 17 IP addresses that the network interface can allocate.
The Local IP Address Management daemon (ipamD) logs show the following message when the network interface runs out of IP addresses:
"ipamd/ipamd.go:1285","msg":"Total number of interfaces found: 3 "
"AssignIPv4Address: IP address pool stats: total: 17, assigned 17"
"AssignPodIPv4Address: ENI eni-abc123 does not have available addresses"
Run the following command to get information about your pod:
$ kubectl describe pod example_pod
Example output:
Warning FailedCreatePodSandBox 23m (x2203 over 113m) kubelet, ip-xx-xx-xx-xx.xx-xxxxx-x.compute.internal (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" network for pod "provisioning-XXXXXXXXXXXXXXX": networkPlugin cni failed to set up pod "provisioning-XXXXXXXXXXXXXXX" network: add cmd: failed to assign an IP address to container
Review the subnet to identify if the subnet ran out of free IP addresses. You can view available IP addresses for each subnet in the Amazon VPC console under the Subnets section.
Subnet: XXXXXXXXXX
IPv4 CIDR Block 10.2.1.0/24 Number of allocated ips 254 Free address count 0
To resolve this issue, scale down some of the workload to free up available IP addresses. If additional subnet capacity is available, then you can scale the node. You can also create an additional subnet. For more information, see How do I use multiple CIDR ranges with Amazon EKS? Follow the instructions in the Create subnets with a new CIDR range section.
Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused
This error indicates that the aws-node pod failed to communicate with IPAM because the aws-node pod failed to run on the node.
Run the following commands to get information about the pod:
$ kubectl describe pod example_pod
$ kubectl describe pod/aws-node-XXXXX -n kube-system
Example outputs:
Warning FailedCreatePodSandBox 51s kubelet, ip-xx-xx-xx-xx.ec2.internal Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" network for pod "example_pod": NetworkPlugin cni failed to set up pod "example_pod" network: add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container
"XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" network for pod "example_pod": NetworkPlugin cni failed to teardown pod "example_pod" network: del cmd: error received from DelNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
To troubleshoot this issue, verify that the aws-node pod is deployed and is in the Running state:
kubectl get pods --selector=k8s-app=aws-node -n kube-system
Note: Make sure that you’re running the correct version of the VPC CNI plugin for the cluster version.
The pods might be in Pending state due to Liveness and Readiness probe errors. Be sure that you have the latest recommended VPC CNI add-on version according to the compatibility table.
Run the following command to view the last log message from the aws-node pod:
kubectl -n kube-system exec -it aws-node-XXX-- tail -f /host/var/log/aws-routed-eni/ipamd.log | tee ipamd.log
The issue might also occur because the Dockershim mount point fails to mount. The following is an example message that you can receive when this issue occurs:
Getting running pod sandboxes from "unix:///var/run/dockershim.sock
Not able to get local pod sandboxes yet (attempt 1/5): rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial unix /var/run/dockershim.sock: connect: no such file or director
The preceding message indicates that the pod didn’t mount var/run/dockershim.sock.
To resolve this issue, try the following:
- Restart the aws-node pod to remap the mount point.
- Cordon the node, and scale the nodes in the node group.
- Upgrade the Amazon VPC network interface to the latest cluster version that’s supported.
If you added the CNI as a managed plugin in the AWS Management Console, then the aws-node fails the probes. Managed plugins overwrite the service account. However, the service account isn’t configured with the selected role. To resolve this issue, turn off the plugin from the AWS Management Console, and create the service account using a manifest file. Or, edit the current aws-node service account to add the role that’s used on the managed plugin.
Network plugin cni failed to set up pod «my-app-xxbz-zz» network: failed to parse Kubernetes args: pod does not have label vpc.amazonaws.com/PrivateIPv4Address
You get this error because of either of the following reasons:
- The pod isn’t running properly.
- The certificate that the pod is using isn’t created successfully.
This error relates to the Amazon VPC admission controller webhook that’s required on Amazon EKS clusters to run Windows workloads. The webhook is a plugin that runs a pod in the kube-system namespace. The component runs on Linux nodes and allows networking for incoming pods on Windows nodes.
Run the following command to get the list of pods that are affected:
Example output:
my-app-xxx-zz 0/1 ContainerCreating 0 58m <none> ip-XXXXXXX.compute.internal <none>
my-app-xxbz-zz 0/1 ContainerCreating 0 58m <none>
Run the following command to get information about the pod:
$ kubectl describe pod my-app-xxbz-zz
Example output:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" network for pod "<POD_ANME>": networkPlugin cni failed to set up pod "example_pod" network: failed to parse Kubernetes args: pod does not have label vpc.amazonaws.com/PrivateIPv4Address
Reconciler worker 1 starting processing node ip-XXXXXXX.compute.internal.
Reconciler checking resource vpc.amazonaws.com/PrivateIPv4Address warmpool size 1 desired 3 on node ip-XXXXXXX.compute.internal.
Reconciler creating resource vpc.amazonaws.com/PrivateIPv4Address on node ip-XXXXXXX.compute.internal.
Reconciler failed to create resource vpc.amazonaws.com/PrivateIPv4Address on node ip-XXXXXXX.compute.internal: node has no open IP address slots.
Windows nodes support one network interface per node. Each Windows node can run as many pods as the available IP addresses per network interface, minus one. To resolve this issue, scale up the number of Windows nodes.
If the IP addresses aren’t the issue, then review the Amazon VPC admission controller pod event and logs.
Run the following command to confirm that the Amazon VPC admission controller pod is created:
$ kubectl get pods -n kube-system OR kubectl get pods -n kube-system | grep "vpc-admission"
Example output:
vpc-admission-webhook-5bfd555984-fkj8z 1/1 Running 0 25m
Run the following command to get information about the pod:
$ kubectl describe pod vpc-admission-webhook-5bfd555984-fkj8z -n kube-system
Example output:
Normal Scheduled 27m default-scheduler Successfully assigned kube-system/vpc-admission-webhook-5bfd555984-fkj8z to ip-xx-xx-xx-xx.ec2.internal
Normal Pulling 27m kubelet Pulling image "xxxxxxx.dkr.ecr.xxxx.amazonaws.com/eks/vpc-admission-webhook:v0.2.7"
Normal Pulled 27m kubelet Successfully pulled image "xxxxxxx.dkr.ecr.xxxx.amazonaws.com/eks/vpc-admission-webhook:v0.2.7" in 1.299938222s
Normal Created 27m kubelet Created container vpc-admission-webhook
Normal Started 27m kubelet Started container vpc-admission-webhook
Run the following command to check the pod logs for any configuration issues:
$ kubectl logs vpc-admission-webhook-5bfd555984-fkj8z -n kube-system
Example output:
I1109 07:32:59.352298 1 main.go:72] Initializing vpc-admission-webhook version v0.2.7.
I1109 07:32:59.352866 1 webhook.go:145] Setting up webhook with OSLabelSelectorOverride: windows.
I1109 07:32:59.352908 1 main.go:105] Webhook Server started.
I1109 07:32:59.352933 1 main.go:96] Listening on :61800 for metrics and healthz
I1109 07:39:25.778144 1 webhook.go:289] Skip mutation for as the target platform is .
The preceding output shows that the container started successfully. The pod then adds the vpc.amazonaws.com/PrivateIPv4Address label to the application pod. However, the manifest for the application pod must contain a node selector or affinity so that the pod is scheduled on the Windows nodes.
Other options to troubleshoot the issue include verifying the following:
- You deployed the Amazon VPC admission controller pod in the kube-system namespace.
- Logs or events aren’t pointing to an expired certificate. If the certificate is expired and Windows pods are stuck in the Container creating state, then you must delete and redeploy the pods.
- There aren’t any timeouts or DNS-related issues.
If you don’t create the Amazon VPC admission controller, then turn on Windows support for your cluster.
Important: Amazon EKS doesn’t require you to turn on the Amazon VPC admission controller to support Windows node groups. If you turned on the Amazon VPC admission controller, then remove legacy Windows support from your data plane.
Did this article help?
Do you need billing or technical support?
AWS support for Internet Explorer ends on 07/31/2022. Supported browsers are Chrome, Firefox, Edge, and Safari.
Learn more »
When I used calico as CNI and I faced a similar issue.
The container remained in creating state, I checked for /etc/cni/net.d and /opt/cni/bin on master both are present but not sure if this is required on worker node as well.
root@KubernetesMaster:/opt/cni/bin# kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-5c7588df-5zds6 0/1 ContainerCreating 0 21m
root@KubernetesMaster:/opt/cni/bin# kubectl get nodes
NAME STATUS ROLES AGE VERSION
kubernetesmaster Ready master 26m v1.13.4
kubernetesslave1 Ready <none> 22m v1.13.4
root@KubernetesMaster:/opt/cni/bin#
kubectl describe pods
Name: nginx-5c7588df-5zds6
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: kubernetesslave1/10.0.3.80
Start Time: Sun, 17 Mar 2019 05:13:30 +0000
Labels: app=nginx
pod-template-hash=5c7588df
Annotations: <none>
Status: Pending
IP:
Controlled By: ReplicaSet/nginx-5c7588df
Containers:
nginx:
Container ID:
Image: nginx
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-qtfbs (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-qtfbs:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-qtfbs
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 18m default-scheduler Successfully assigned default/nginx-5c7588df-5zds6 to kubernetesslave1
Warning FailedCreatePodSandBox 18m kubelet, kubernetesslave1 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "123d527490944d80f44b1976b82dbae5dc56934aabf59cf89f151736d7ea8adc" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
Warning FailedCreatePodSandBox 18m kubelet, kubernetesslave1 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "8cc5e62ebaab7075782c2248e00d795191c45906cc9579464a00c09a2bc88b71" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
Warning FailedCreatePodSandBox 18m kubelet, kubernetesslave1 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "30ffdeace558b0935d1ed3c2e59480e2dd98e983b747dacae707d1baa222353f" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
Warning FailedCreatePodSandBox 18m kubelet, kubernetesslave1 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "630e85451b6ce2452839c4cfd1ecb9acce4120515702edf29421c123cf231213" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
Warning FailedCreatePodSandBox 18m kubelet, kubernetesslave1 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "820b919b7edcfc3081711bb78b79d33e5be3f7dafcbad29fe46b6d7aa22227aa" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
Warning FailedCreatePodSandBox 18m kubelet, kubernetesslave1 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "abbfb5d2756f12802072039dec20ba52f546ae755aaa642a9a75c86577be589f" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
Warning FailedCreatePodSandBox 18m kubelet, kubernetesslave1 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "dfeb46ffda4d0f8a434f3f3af04328fcc4b6c7cafaa62626e41b705b06d98cc4" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
Warning FailedCreatePodSandBox 18m kubelet, kubernetesslave1 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "9ae3f47bb0282a56e607779d3267127ee8b0ae1d7f416f5a184682119203b1c8" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
Warning FailedCreatePodSandBox 18m kubelet, kubernetesslave1 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "546d07f1864728b2e2675c066775f94d658e221ada5fb4ed6bf6689ec7b8de23" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
Normal SandboxChanged 18m (x12 over 18m) kubelet, kubernetesslave1 Pod sandbox changed, it will be killed and re-created.
Warning FailedCreatePodSandBox 3m39s (x829 over 18m) kubelet, kubernetesslave1 (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "f586be437843537a3082f37ad139c88d0eacfbe99ddf00621efd4dc049a268cc" network for pod "nginx-5c7588df-5zds6": NetworkPlugin cni failed to set up pod "nginx-5c7588df-5zds6_default" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
root@KubernetesMaster:/etc/cni/net.d#
On worker node NGINX is trying to come up but getting exited, I am not sure what’s going on here — I am newbie to kubernetes & not able to fix this issue —
root@kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5ad5500e8270 fadcc5d2b066 "/usr/local/bin/kube…" 3 minutes ago Up 3 minutes k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e k8s.gcr.io/pause:3.1 "/pause" 3 minutes ago Up 3 minutes k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563 k8s.gcr.io/pause:3.1 "/pause" 3 minutes ago Up 3 minutes k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
root@kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5ad5500e8270 fadcc5d2b066 "/usr/local/bin/kube…" 3 minutes ago Up 3 minutes k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e k8s.gcr.io/pause:3.1 "/pause" 3 minutes ago Up 3 minutes k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563 k8s.gcr.io/pause:3.1 "/pause" 3 minutes ago Up 3 minutes k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
root@kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5ad5500e8270 fadcc5d2b066 "/usr/local/bin/kube…" 3 minutes ago Up 3 minutes k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e k8s.gcr.io/pause:3.1 "/pause" 3 minutes ago Up 3 minutes k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563 k8s.gcr.io/pause:3.1 "/pause" 3 minutes ago Up 3 minutes k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
root@kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
94b2994401d0 k8s.gcr.io/pause:3.1 "/pause" 1 second ago Up Less than a second k8s_POD_nginx-5c7588df-5zds6_default_677a722b-4873-11e9-a33a-06516e7d78c4_534
5ad5500e8270 fadcc5d2b066 "/usr/local/bin/kube…" 4 minutes ago Up 4 minutes k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e k8s.gcr.io/pause:3.1 "/pause" 4 minutes ago Up 4 minutes k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563 k8s.gcr.io/pause:3.1 "/pause" 4 minutes ago Up 4 minutes k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
root@kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5ad5500e8270 fadcc5d2b066 "/usr/local/bin/kube…" 4 minutes ago Up 4 minutes k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e k8s.gcr.io/pause:3.1 "/pause" 4 minutes ago Up 4 minutes k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563 k8s.gcr.io/pause:3.1 "/pause" 4 minutes ago Up 4 minutes k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
root@kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
f72500cae2b7 k8s.gcr.io/pause:3.1 "/pause" 1 second ago Up Less than a second k8s_POD_nginx-5c7588df-5zds6_default_677a722b-4873-11e9-a33a-06516e7d78c4_585
5ad5500e8270 fadcc5d2b066 "/usr/local/bin/kube…" 4 minutes ago Up 4 minutes k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e k8s.gcr.io/pause:3.1 "/pause" 4 minutes ago Up 4 minutes k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563 k8s.gcr.io/pause:3.1 "/pause" 4 minutes ago Up 4 minutes k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
root@kubernetesslave1:/home/ubuntu# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5ad5500e8270 fadcc5d2b066 "/usr/local/bin/kube…" 5 minutes ago Up 5 minutes k8s_kube-proxy_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
b1c9929ebe9e k8s.gcr.io/pause:3.1 "/pause" 5 minutes ago Up 5 minutes k8s_POD_calico-node-749qx_kube-system_4e2d8c9c-4873-11e9-a33a-06516e7d78c4_1
ceb78340b563 k8s.gcr.io/pause:3.1 "/pause" 5 minutes ago Up 5 minutes k8s_POD_kube-proxy-f24gd_kube-system_4e2d313a-4873-11e9-a33a-06516e7d78c4_1
I checked about /etc/cni/net.d & /opt/cni/bin on worker node as well, it is there —
root@kubernetesslave1:/home/ubuntu# cd /etc/cni
root@kubernetesslave1:/etc/cni# ls -ltr
total 4
drwxr-xr-x 2 root root 4096 Mar 17 05:19 net.d
root@kubernetesslave1:/etc/cni# cd /opt/cni
root@kubernetesslave1:/opt/cni# ls -ltr
total 4
drwxr-xr-x 2 root root 4096 Mar 17 05:19 bin
root@kubernetesslave1:/opt/cni# cd bin
root@kubernetesslave1:/opt/cni/bin# ls -ltr
total 107440
-rwxr-xr-x 1 root root 3890407 Aug 17 2017 bridge
-rwxr-xr-x 1 root root 3475802 Aug 17 2017 ipvlan
-rwxr-xr-x 1 root root 3520724 Aug 17 2017 macvlan
-rwxr-xr-x 1 root root 3877986 Aug 17 2017 ptp
-rwxr-xr-x 1 root root 3475750 Aug 17 2017 vlan
-rwxr-xr-x 1 root root 9921982 Aug 17 2017 dhcp
-rwxr-xr-x 1 root root 2605279 Aug 17 2017 sample
-rwxr-xr-x 1 root root 32351072 Mar 17 05:19 calico
-rwxr-xr-x 1 root root 31490656 Mar 17 05:19 calico-ipam
-rwxr-xr-x 1 root root 2856252 Mar 17 05:19 flannel
-rwxr-xr-x 1 root root 3084347 Mar 17 05:19 loopback
-rwxr-xr-x 1 root root 3036768 Mar 17 05:19 host-local
-rwxr-xr-x 1 root root 3550877 Mar 17 05:19 portmap
-rwxr-xr-x 1 root root 2850029 Mar 17 05:19 tuning
root@kubernetesslave1:/opt/cni/bin#
Failed Create Pod Sandbox
Overview
Pods remain in Pending phase
This was on a Managed GKE cluster
Check RunBook Match
When you see these errors:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedSync 47m (x27 over 3h) kubelet error determining status: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Warning FailedCreatePodSandBox 7m56s (x38 over 3h23m) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "xxx": operation timeout: context deadline exceeded
Also noticed running kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
node-1-xxx 158m 16% 2311Mi 87%
node-2-xxx 111m 11% 1456Mi 55%
node-3-xxx 391m 41% 1662Mi 63%
node-4-xxx 169m 17% 2210Mi 83%
node-5-xxx <unknown> <unknown> <unknown> <unknown>
Also noticed node events showed the node was repeatedly restarting, ran kubectl describe node-5-xxx
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NodeNotReady 3m15s (x2344 over 10d) kubelet Node node-5-xxx status is now: NodeNotReady
Normal NodeReady 2m15s (x2345 over 49d) kubelet Node node-5-xxx status is now: NodeReady
Initial Steps Overview
-
Investigate the failing pod
-
Identify the node the pod is meant to be scheduled on
-
Investigate the node
Detailed Steps
1) Investigate the failing pod
Check the logs of the pod:
$ kubectl logs pod-xxx
Check the events of the pod:
$ kubectl describe pod pod-xxx
2) Investigate the node the pod is meant to be scheduled on
Describe pod and see what node the pod is meant to be running on:
$ kubectl describe pod-xxx
Ouput will start with something like this, look for the “Node: » part:
Name: pod-xxx
Namespace: namespace-xxx
Priority: 0
Node: node-xxx
3) Investigate the node
Check the resources of the nodes:
$ kubectl top nodes
Check the events of the node you identified the pod was meant to be scheduled on:
$ kubectl describe node node-xxx
Solutions List
A) Remove problematic node
Solutions Detail
The solution was to remove the problematic node, see more details below.
A) Remove problematic node
Check Resolution
Further Steps
-
Create Extra Node
-
Drain Problematic Node
-
Delete Problematic Node
1) Create Extra Node
You may need to create a new node before draining this node. Just check with kubectl top nodes
to see if the nodes have extra capacity to schedule your drained pods.
If you see you need an extra node before you drain the node, make sure to do so.
In our situation we were using Managed GKE cluster, so we added a new node via the console.
2) Drain Problematic Node
Once you are sure there is enough capacity amongst your remaining nodes to schedule the pods that are on the problematic node, then you can go ahead and drain the node.
$ kubectl drain node-xxx
3) Delete Problematic Node
Check once all scheduled pods have been drained off of the node.
$ kubectl get nodes
Once done you can delete the node:
$ kubectl delete node node-xxx
Further Information
Owner
Cari Liebenberg
About k8s deployment error resolution
- Error message
Warning FailedCreatePodSandBox 89s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "1c97ad2710e2939c0591477f9d6dde8e0d7d31b3fbc138a7fa38aaa657566a9a" network for pod "coredns-7f89b7bc75-qg924": networkPlugin cni failed to set up pod "coredns-7f89b7bc75-qg924_kube-system" network: error getting ClusterInformation: Get "https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes"), failed to clean up sandbox container "1c97ad2710e2939c0591477f9d6dde8e0d7d31b3fbc138a7fa38aaa657566a9a" network for pod "coredns-7f89b7bc75-qg924": networkPlugin cni failed to teardown pod "coredns-7f89b7bc75-qg924_kube-system" network: error getting ClusterInformation: Get "https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")]
- Performance status
[[email protected] ~]# kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-7f89b7bc75-jzs26 0/1 ContainerCreating 0 63s
coredns-7f89b7bc75-qg924 0/1 ContainerCreating 0 63s
# coredns cannot run
- Change calico.yaml
# Cluster type to identify the deployment type
- name: CLUSTER_TYPE
value: "k8s,bgp"
# New below
- name: IP_AUTODETECTION_METHOD
value: "interface=ens192"
# ens192 is the local NIC name
- kubectl apply -f calico.yaml
- Check that it is running
[[email protected] ~]# kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-69496d8b75-2nm5k 1/1 Running 0 23m
calico-node-8wfk9 1/1 Running 0 23m
calico-node-9vn4v 1/1 Running 0 23m
calico-node-qm8s2 1/1 Running 0 23m
coredns-7f89b7bc75-jzs26 1/1 Running 0 26m
coredns-7f89b7bc75-qg924 1/1 Running 0 26m
etcd-linux03 1/1 Running 0 26m
kube-apiserver-linux03 1/1 Running 0 26m
kube-controller-manager-linux03 1/1 Running 0 26m
kube-proxy-29lcf 1/1 Running 0 25m
kube-proxy-c29wz 1/1 Running 0 26m
kube-proxy-lpgrr 1/1 Running 0 25m
kube-scheduler-linux03 1/1 Running 0 26m