Error from server badrequest container in pod is waiting to start containercreating

kubernetes pods stuck at containercreating #722 Comments I have a raspberry pi cluster (one master , 3 nodes) My basic image is : raspbian stretch lite I already set up a basic kubernetes setup where a master can see all his nodes (kubectl get nodes) and they’re all running. I used a weave network […]

Содержание

  1. kubernetes pods stuck at containercreating #722
  2. Comments
  3. Kubeadm/Kubernetes dashboard pod is in ContainerCreating status after installation #2863
  4. Comments
  5. Name already in use
  6. cluster-api-provider-vsphere / docs / troubleshooting.md
  7. Unable to start pod/container in lab 2.3 — Error message is «Error from server (BadRequest) ..»
  8. Best Answers
  9. Answers

kubernetes pods stuck at containercreating #722

I have a raspberry pi cluster (one master , 3 nodes)

My basic image is : raspbian stretch lite

I already set up a basic kubernetes setup where a master can see all his nodes (kubectl get nodes) and they’re all running.
I used a weave network plugin for the network communication

When everything is all setup i tried to run a nginx pod (first with some replica’s but now just 1 pod) on my cluster as followed
kubectl run my-nginx —image=nginx

But somehow the pod get stuck in the status «Container creating» , when i run docker images i can’t see the nginx image being pulled. And normally an nginx image is not that large so it had to be pulled already by now (15 minutes).
The kubectl describe pods give the error that the pod sandbox failed to create and kubernetes will rec-create it.

I searched everything about this issue and tried the solutions on stackoverflow (reboot to restart cluster, searched describe pods , new network plugin tried it with flannel) but i can’t see what the actual problem is.
I did the exact same thing in Virtual box (just ubuntu not ARM ) and everything worked.

First i thougt it was a permission issue because i run everything as a normal user , but in vm i did the same thing and nothing changed.
Then i checked kubectl get pods —all-namespaces to verify that the pods for the weaver network and kube-dns are running and also nothing wrong over there .

Is this a firewall issue in Raspberry pi ?
Is the weave network plugin not compatible (even the kubernetes website says it is) with arm devices ?
I ‘am guessing there is an api network problem and thats why i can’t get my pod runnning on a node

kubectl describe podName

kubectl logs podName

The text was updated successfully, but these errors were encountered:

Источник

Kubeadm/Kubernetes dashboard pod is in ContainerCreating status after installation #2863

Environment
Steps to reproduce
  1. install kubeadm: sudo apt-get install kubeadm kubectl kubelet kubernetes-cni
  2. kubeadm init
  3. config kubernetes
    mkdir -p $HOME/.kube
    sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
    sudo chown $(id -u):$ (id -g) $HOME/.kube/config
  4. Install dashboard:
    kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/master/src/deploy/recommended/kubernetes-dashboard.yaml
Observed result

NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system kube-dns-6f4fd4bdf-pkpbf 0/3 ContainerCreating 0 5h
kube-system kubernetes-dashboard-7f645d76f6-gpbhq 0/1 ContainerCreating 0 1h

kubectl logs kubernetes-dashboard-7f645d76f6-gpbhq —namespace=kube-system

Error from server (BadRequest): container «kubernetes-dashboard» in pod «kubernetes-dashboard-7f645d76f6-gpbhq» is waiting to start: ContainerCreating

It says in «ContainerCreating» status, and , totally, can’t access to dashboard via proxy due to «Service is not available» error.

Expected result

Generate correctly working dashboard service and pods

Comments

I am not sure why this happens.

The text was updated successfully, but these errors were encountered:

Источник

Name already in use

cluster-api-provider-vsphere / docs / troubleshooting.md

  • Go to file T
  • Go to line L
  • Copy path
  • Copy permalink

Copy raw contents

Copy raw contents

This is a guide on how to troubleshoot issues related to the Cluster API provider for vSphere (CAPV).

This section describes how to debug issues tha occur while trying to deploy a new cluster with clusterctl and CAPV.

Bootstrapping with logging

The first step to figuring out what went wrong is to increase the logging.

Adjusting log levels

There are three places to adjust the log level when bootstrapping cluster.

Adjusting the CAPI manager log level

The following steps may be used to adjust the CAPI manager’s log level:

Open the provider-components.yaml file, ex. ./out/management-cluster/provider-components.yaml

Search for cluster-api/cluster-api-controller

Modify the pod spec for the CAPI manager to indicate where to send logs and the log level:

A log level of six should provided additional information useful for figuring out most issues.

Adjusting the CAPV manager log level

Open the provider-components.yaml file, ex. ./out/management-cluster/provider-components.yaml

Search for cluster-api-provider-vsphere

Modify the pod spec for the CAPV manager to indicate the log level:

A log level of six should provided additional information useful for figuring out most issues.

Adjusting the clusterctl log level

The clusterctl log level may be specified when running clusterctl :

The last line of the above command, -v 6 , tells clusterctl to log messages at level six. This should provide additional information that may be used to diagnose issues.

Accessing the logs in the bootstrap cluster

The clusterctl logs are client-side only. The more interesting information is occurring inside of the bootstrap cluster. This section describes how to access the logs in the bootstrap cluster.

Exporting the kubeconfig

To make the subsequent steps easier, please go ahead and export a KUBECONFIG environment variable to point to the bootstrap cluster that is or will be running via Kind:

Following the CAPI manager logs

The following command may be used to follow the logs from the CAPI manager:

The above command immediately begins trying to follow the CAPI manager log, even before the bootstrap cluster and the CAPI manager pod exist. Once the latter is finally available, the command will start following its log.

Following the CAPV manager logs

To tail the logs from the CAPV manager image, use the following command:

The above command immediately begins trying to follow the CAPV manager log, even before the bootstrap cluster and the CAPV manager pod exist. Once the latter is finally available, the command will start following its log.

Following Kubernetes core component logs

Solving issues may also require accessing the logs from the bootstrap cluster’s core components:

The controller manager

This section contains issues commonly encountered by people using CAPV.

Ensure prerequisites are up to date

The Getting Started guide lists the prerequisites for deploying clusters with CAPV. Make sure those prerequisites, such as clusterctl , kubectl , kind , etc. are up to date.

Missing manifest files during bootstrap phase

If you are using CAPV from a previously tested CAPV on, you may be using an out of date manifest docker image. You can remedy this by removing your existing CAPV manifest image by using docker rmi gcr.io/cluster-api-provider-vsphere/release/manifests:latest or by updating the command to specify a specific manifest image, for example:

This will ensure that the desired image is being used.

envvars.txt is a directory

When generating the YAML manifest the following error may occur:

This means that «$(pwd)/envvars.txt» does not refer to an existing file on the localhost. So instead of bind mounting a file into the container, Docker created a new directory on the localhost at the path «$(pwd)/envvars.txt» and bind mounted it into the container.

Make sure the path to the envvars.txt file is correct before using it to generate the YAML manifests.

Failed to retrieve kubeconfig secret

When bootstrapping the management cluster, the vSphere manager log may emit errors similar to the following:

The above error does not mean there is a problem. Kubernetes components operate in a reconciliation model — a message loops attempts to reconcile the desired state over and over until it is achieved or a timeout occurs.

The error message simply indicates that the first control plane node for the target cluster has not yet come online and provided the information necessary to generate the kubeconfig for the target cluster.

It is quite typical to see many errors in Kubernetes service logs, from the API server, to the controller manager, to the kubelet — the errors are eventually reconciled as the expected configurations are provided and the desired state is reconciled.

Timed out while failing to retrieve kubeconfig secret

When clusterctl times out waiting for the management cluster to come online, and the vSphere manager log repeats failed to retrieve kubeconfig secret for Cluster over and over again, it means there was an error bringing the management cluster’s first control plane node online. Possible reasons include:

Cannot access the vSphere endpoint

Two common causes for a failed deployment are related to accessing the remote vSphere endpoint:

  1. The host from which clusterctl is executed must have access to the vSphere endpoint to which the management cluster is being deployed.
  2. The provided vSphere credentials are invalid.

A quick way to validate both access and the credentials is using the program govc or its container image, vmware/govc :

If the above command fails then there is an issue with accessing the vSphere endpoint, and it must be corrected before clusterctl will succeed.

A VM with the same name already exists

Deployed VMs get their names from the names of the machines in machines.yaml and machineset.yaml . If a VM with the same name already exists in the same location as one of the VMs that would be created by a new cluster, then the new cluster will fail to deploy and the CAPV manager log will include an error similar to the following:

Use the govc image to check to see if there is a VM with the same name:

A static IP address must include the segment length

Another common error is to omit the segment length when using a static IP address. For example:

The above network configuration defines a static IP address, 192.168.6.20 , but also includes the required segment length. Without this, clusterctl will timeout waiting for the control plane to come online.

A machine with multiple networks may cause the bootstrap process to fail for various reasons.

Multiple default routes

A machine that defines two networks may lead to failure if both networks use DHCP and two default routes are defined on the guest. For example:

The above network configuratoin from a machine definition includes two network devices, both using DHCP. This likely causes two default routes to be defined on the guest, meaning it’s not possible to determine the default IPv4 address that should be used by Kubernetes.

Preferring an IP address

Another reason a machine with two networks can lead to failure is because the order in which IP addresses are returned externally from a VM is not guaranteed to be the same order as they are when inspected inside the guest. The solution for this is to define a preferred CIDR — the network segment that contains the IP that the kubeadm bootstrap process selected for the API server. For example:

The above network definition specifies the CIDR to which the IP address belongs that is bound to the Kubernetes API server on the guest.

Network Time Protocol (NTP) related problems causing Kubernetes CA related problems

During the bootstrapping process a CA certificate is transferred to the new VM. This CA has a «not valid until» date associated with it. If the ESXI host does not have NTP properly configured there is a chance you will get an error during the kubeadm bootstrapping process which will output an error similar to this in the /var/log/cloud-init-output.log log on the VM:

The solution for this is to either properly configure NTP in vCenter Configuring Network Time Protocol (NTP) on an ESXi host using the vSphere Web Client (57147) or add a NTP server block to the KubeadmConfig :

Machine object stuck in a provisioning state

This section discusses issues that can cause a Machine object to be stuck in a provisioning state.

To troubleshoot these type of scenarios capv-controller-manager logs are a good starting point. These logs can be retrived using kubectl logs capv-controller-manager-88f646758-nj8fs -n capv-system

VM folder does not exist

One of the scenarios where a machine object fails to provision successfully and is stuck in a provisioning state is when the VM folder specified in the manifest does not exist. Below error messages can be seen in the capv-controller-manager logs:

Источник

Unable to start pod/container in lab 2.3 — Error message is «Error from server (BadRequest) ..»

I have completed lab 2.2 to setup the Kubernetes cluster. When I tried to create a pod/container in lab 2.3, I keep seeing this error:

Error from server (BadRequest): container «nginx» in pod «nginx» is waiting to start: ContainerCreating

When I do a kubectl describe pod nginx, I see these errors:

.
Warning FailedCreatePodSandBox 95s kubelet Failed to create pod
sandbox: rpc error: code = Unknown desc = failed to mount container k8s_POD_ngi
nx_default_e1106b28-303c-4e75-afc2-d6d14bd67913_0 in pod sandbox k8s_nginx_defau
lt_e1106b28-303c-4e75-afc2-d6d14bd67913_0(507780f27bf6a769b6e7178ebe52a032e8967f
2af9d720f1931933e0a202c917): error creating overlay mount to /var/lib/containers
/storage/overlay/0c9cccedaee7f6a42d1546dc06d3100072fb4ac860040aeb7b58d85d3e39c9a
c/merged, mount_data=»nodev,metacopy=on,lowerdir=/var/lib/containers/storage/ove
rlay/l/4NMXH7DOMDNBJNKEOKA65YCBZS,upperdir=/var/lib/containers/storage/overlay/0
c9cccedaee7f6a42d1546dc06d3100072fb4ac860040aeb7b58d85d3e39c9ac/diff,workdir=/va
r/lib/containers/storage/overlay/0c9cccedaee7f6a42d1546dc06d3100072fb4ac860040ae
b7b58d85d3e39c9ac/work»: invalid argument
.

What am I doing wrong?

Best Answers

I notice you have AppArmor enabled. That could be the cause of some headaches, does the problem persist when you disable it?

As all the failed pods are on your worker I would suspect it is either AppArmor, GCE VPC firewall issue, or a networking issue where the nodes are using overlapping IP addresses with the host.

Could you disable AppArmor on all nodes, ensure your VPC allows all traffic, and show the IP ranges used by your primary interface (something like ens4) on both nodes and show the results after.

Thanks for your help. I recreated the VMs in GCE again and it is working now. I must have misconfigured VPC the first time. I managed to pass the exam after going through the labs.

Answers

Is this a generic symptom observed on multiple pods or just this one? Please provide the output of the following command:

kubectl get pods -A -o wide

This error is occurring for any pod that I try to create on the cluster. The output of the command is:

NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
default nginx 0/1 ContainerCreating 0 52s worker
kube-system calico-kube-controllers-5d995d45d6-6mk6b 1/1 Running 1 2d23h 192.168.242.65 cp
kube-system calico-node-s824n 0/1 Init:0/3 0 2d23h 10.2.0.5 worker
kube-system calico-node-zkxrn 1/1 Running 1 2d23h 10.2.0.4 cp
kube-system coredns-78fcd69978-4svtg 1/1 Running 1 2d23h 192.168.242.66 cp
kube-system coredns-78fcd69978-m4nsp 1/1 Running 1 2d23h 192.168.242.67 cp
kube-system etcd-cp 1/1 Running 1 2d23h 10.2.0.4 cp
kube-system kube-apiserver-cp 1/1 Running 1 2d23h 10.2.0.4 cp
kube-system kube-controller-manager-cp 1/1 Running 1 2d23h 10.2.0.4 cp
kube-system kube-proxy-fn5xm 0/1 ContainerCreating 0 2d23h 10.2.0.5 worker
kube-system kube-proxy-trxxb 1/1 Running 1 2d23h 10.2.0.4 cp
kube-system kube-scheduler-cp 1/1 Running 1 2d23h 10.2.0.4 cp

From the look of things Calico is not running on your worker. There are a few reasons this could happen, but chances are it has to do with a networking configuration error or a firewall between your instances.

What are you using to run the lab exercises, GCE, AWS, Digital Ocean, VMWare, VirtualBox, two Linux laptops?

I created the 2 VMs in GCE following the GCE Lab setup video.

Thank you for the provided output. It seems that none of the containers scheduled to the worker node are able to start. The node itself may not be ready.
What may help are the outputs of the following two commands:

kubectl get nodes

kubectl describe node worker

I would also double check that the VPC is allowing all traffic between your VMs, as well.

This is the output of kubectl describe node worker. You can see the last message:

Node worker status is now: NodeReady

Источник

I have a raspberry pi cluster (one master , 3 nodes)

My basic image is : raspbian stretch lite

I already set up a basic kubernetes setup where a master can see all his nodes (kubectl get nodes) and they’re all running.
I used a weave network plugin for the network communication

When everything is all setup i tried to run a nginx pod (first with some replica’s but now just 1 pod) on my cluster as followed
kubectl run my-nginx —image=nginx

But somehow the pod get stuck in the status «Container creating» , when i run docker images i can’t see the nginx image being pulled. And normally an nginx image is not that large so it had to be pulled already by now (15 minutes).
The kubectl describe pods give the error that the pod sandbox failed to create and kubernetes will rec-create it.

I searched everything about this issue and tried the solutions on stackoverflow (reboot to restart cluster, searched describe pods , new network plugin tried it with flannel) but i can’t see what the actual problem is.
I did the exact same thing in Virtual box (just ubuntu not ARM ) and everything worked.

First i thougt it was a permission issue because i run everything as a normal user , but in vm i did the same thing and nothing changed.
Then i checked kubectl get pods —all-namespaces to verify that the pods for the weaver network and kube-dns are running and also nothing wrong over there .

Is this a firewall issue in Raspberry pi ?
Is the weave network plugin not compatible (even the kubernetes website says it is) with arm devices ?
I ‘am guessing there is an api network problem and thats why i can’t get my pod runnning on a node

[EDIT]
Log files

kubectl describe podName

>     
>     Name:           my-nginx-9d5677d94-g44l6 Namespace:      default Node: kubenode1/10.1.88.22 Start Time:     Tue, 06 Mar 2018 08:24:13
> +0000 Labels:         pod-template-hash=581233850
>                     run=my-nginx Annotations:    <none> Status:         Pending IP: Controlled By:  ReplicaSet/my-nginx-9d5677d94 Containers: 
> my-nginx:
>         Container ID:
>         Image:          nginx
>         Image ID:
>         Port:           80/TCP
>         State:          Waiting
>           Reason:       ContainerCreating
>         Ready:          False
>         Restart Count:  0
>         Environment:    <none>
>         Mounts:
>           /var/run/secrets/kubernetes.io/serviceaccount from default-token-phdv5 (ro) Conditions:   Type           Status  
> Initialized    True   Ready          False   PodScheduled   True
> Volumes:   default-token-phdv5:
>         Type:        Secret (a volume populated by a Secret)
>         SecretName:  default-token-phdv5
>         Optional:    false QoS Class:       BestEffort Node-Selectors:  <none> Tolerations:     node.kubernetes.io/not-ready:NoExecute for
> 300s
>                      node.kubernetes.io/unreachable:NoExecute for 300s Events:   Type     Reason                  Age   From               
> Message   ----     ------                  ----  ----               
>     -------   Normal   Scheduled               5m    default-scheduler   Successfully assigned my-nginx-9d5677d94-g44l6 to kubenode1   Normal  
> SuccessfulMountVolume   5m    kubelet, kubenode1  MountVolume.SetUp
> succeeded for volume "default-token-phdv5"   Warning 
> FailedCreatePodSandBox  1m    kubelet, kubenode1  Failed create pod
> sandbox.   Normal   SandboxChanged          1m    kubelet, kubenode1 
> Pod sandbox changed, it will be killed and re-created.

kubectl logs podName

Error from server (BadRequest): container "my-nginx" in pod "my-nginx-9d5677d94-g44l6" is waiting to start: ContainerCreating

Hi,

I have completed lab 2.2 to setup the Kubernetes cluster. When I tried to create a pod/container in lab 2.3, I keep seeing this error:

Error from server (BadRequest): container «nginx» in pod «nginx» is waiting to start: ContainerCreating

When I do a kubectl describe pod nginx, I see these errors:

…..
Warning FailedCreatePodSandBox 95s kubelet Failed to create pod
sandbox: rpc error: code = Unknown desc = failed to mount container k8s_POD_ngi
nx_default_e1106b28-303c-4e75-afc2-d6d14bd67913_0 in pod sandbox k8s_nginx_defau
lt_e1106b28-303c-4e75-afc2-d6d14bd67913_0(507780f27bf6a769b6e7178ebe52a032e8967f
2af9d720f1931933e0a202c917): error creating overlay mount to /var/lib/containers
/storage/overlay/0c9cccedaee7f6a42d1546dc06d3100072fb4ac860040aeb7b58d85d3e39c9a
c/merged, mount_data=»nodev,metacopy=on,lowerdir=/var/lib/containers/storage/ove
rlay/l/4NMXH7DOMDNBJNKEOKA65YCBZS,upperdir=/var/lib/containers/storage/overlay/0
c9cccedaee7f6a42d1546dc06d3100072fb4ac860040aeb7b58d85d3e39c9ac/diff,workdir=/va
r/lib/containers/storage/overlay/0c9cccedaee7f6a42d1546dc06d3100072fb4ac860040ae
b7b58d85d3e39c9ac/work»: invalid argument
…..

What am I doing wrong?

Thanks

TW

 if you try to run the image from AKR(Azure Kubernetes Registry) with the following command in the Azure CLI

 kubectl run nodeapp

  —image=mydanaksacr.azurecr.io/node:v1

  —port=8080

the output indicate that the pod was create. however when you check the pod. the result is below

danny@Azure:~/clouddrive$ kubectl get pods

NAME      READY   STATUS         RESTARTS   AGE

nodeapp   0/1     ErrImagePull   0          36s

after I check the log with kubectl logs on the pod 

danny@Azure:~/clouddrive$ kubectl logs nodeapp

Error from server (BadRequest): container «nodeapp» in pod «nodeapp» is waiting to start: image can’t be pulled

the message indicates that the service principal does not have the right to pull the image from AKR

here is the solution to solve the issue. run the following command in the cli to grant the service principal to the acrpull role.

az role assignment create —assignee «<<service principal ID>>» —role acrpull —scope «<<AKR resource ID>>»

this is the specific example running in the development environment

 az role assignment create —assignee «34d6880e-bc51-416f-b250-b87904390d0c» —role acrpull —scope «/subscriptions/3f2c3687-9d93-45be-a8e0-b8ca6e4f5944/resourceGroups/MyResourceGroup/providers/Microsoft.ContainerRegistry/registries/myDanAksAcr»

Find out how to troubleshoot issues you might encounter in the following situations.

General troubleshooting

Debug logs

By default, OneAgent logs are located in /var/log/dynatrace/oneagent.

To debug Dynatrace Operator issues, run

kubectloc

bash

kubectl -n dynatrace logs -f deployment/dynatrace-operator

bash

oc -n dynatrace logs -f deployment/dynatrace-operator

You might also want to check the logs from OneAgent pods deployed through Dynatrace Operator.

kubectloc

bash

kubectl get pods -n dynatrace NAME READY STATUS RESTARTS AGE dynatrace-operator-64865586d4-nk5ng 1/1 Running 0 1d dynakube-oneagent-<id> 1/1 Running 0 22h

bash

kubectl logs dynakube-oneagent-<id> -n dynatrace

bash

oc get pods -n dynatrace NAME READY STATUS RESTARTS AGE dynatrace-operator-64865586d4-nk5ng 1/1 Running 0 1d dynakube-classic-8r2kq 1/1 Running 0 22h

bash

oc logs oneagent-66qgb -n dynatrace

Troubleshoot common Dynatrace Operator setup issues using the troubleshoot subcommand

Dynatrace Operator version 0.9.0+

Run the command below to retrieve a basic output on DynaKube status, such as:

  • Namespace: If the dynatrace namespace exists (name can be overwritten via parameter)

  • DynaKube:

    • If CustomResourceDefinition exists
    • If CustomResource with the given name exists (name can be overwritten via parameter)
    • If the API URL ends with /api
    • If the secret name is the same as DynaKube (or .spec.tokens if used)
    • If the secret has apiToken and paasToken set
    • If the secret for customPullSecret is defined
  • Environment: If your environment is reachable from the Dynatrace Operator pod using the same parameters as the Dynatrace Operator binary (such as proxy and certificate).

  • OneAgent and ActiveGate image: If the registry is accessible; if the image is accessible from the Dynatrace Operator pod using the registry from the environment with (custom) pull secret.

bash

kubectl exec deploy/dynatrace-operator -n dynatrace -- dynatrace-operator troubleshoot

Note: If you use a different DynaKube name, add the --dynakube <your_dynakube_name> argument to the command.

Example output if there are no errors for the above-mentioned fields:

bash

{"level":"info","ts":"2022-09-12T08:45:21.437Z","logger":"dynatrace-operator-version","msg":"dynatrace-operator","version":"<operator version>","gitCommit":"<commithash>","buildDate":"<release date>","goVersion":"<go version>","platform":"<platform>"} [namespace ] --- checking if namespace 'dynatrace' exists ... [namespace ] √ using namespace 'dynatrace' [dynakube ] --- checking if 'dynatrace:dynakube' Dynakube is configured correctly [dynakube ] CRD for Dynakube exists [dynakube ] using 'dynatrace:dynakube' Dynakube [dynakube ] checking if api url is valid [dynakube ] api url is valid [dynakube ] checking if secret is valid [dynakube ] 'dynatrace:dynakube' secret exists [dynakube ] secret token 'apiToken' exists [dynakube ] customPullSecret not used [dynakube ] pull secret 'dynatrace:dynakube-pull-secret' exists [dynakube ] secret token '.dockerconfigjson' exists [dynakube ] proxy secret not used [dynakube ] √ 'dynatrace:dynakube' Dynakube is valid [dtcluster ] --- checking if tenant is accessible ... [dtcluster ] √ tenant is accessible

Debug configuration and monitoring issues using the Kubernetes Monitoring Statistics extension

The Kubernetes Monitoring Statistics extension can help you:

  • Troubleshoot your Kubernetes monitoring setup
  • Troubleshoot your Prometheus integration setup
  • Get detailed insights into queries from Dynatrace to the Kubernetes API
  • Receive alerts when your Kubernetes monitoring setup experiences issues
  • Get alerted on slow response times of your Kubernetes API

Set up monitoring errors

Pods stuck in Terminating state after upgrade

If your CSI driver and OneAgent pods get stuck in Terminating state after upgrading from Dynatrace Operator version 0.9.0, you need to manually delete the pods that are stuck.

Run the command below.

kubectloc

bash

kubectl delete pod -n dynatrace --selector=app.kubernetes.io/component=csi-driver,app.kubernetes.io/name=dynatrace-operator,app.kubernetes.io/version=0.9.0 --force --grace-period=0

bash

oc delete pod -n dynatrace --selector=app.kubernetes.io/component=csi-driver,app.kubernetes.io/name=dynatrace-operator,app.kubernetes.io/version=0.9.0 --force --grace-period=0

Unable to retrieve the complete list of server APIs

Dynatrace Operator

Example error:

unable to retrieve the complete list of server APIs: external.metrics.k8s.io/v1beta1: the server is currently unable to handle the request

If the Dynatrace Operator pod logs this error, you need to identify and fix the problematic services. To identify them

  1. Check available resources.

bash

kubectl api-resources
  1. If the command returns this error, list all the API services and make sure there aren’t any False services.

bash

kubectl get apiservice

CrashLoopBackOff: Downgrading OneAgent is not supported, please uninstall the old version first

Dynatrace Operator

If you get this error, the OneAgent version installed on your host is later than the version you’re trying to run.

Solution: First uninstall OneAgent from the host, and then select your desired version in the Dynatrace web UI or in DynaKube. To uninstall OneAgent, connect to the host and run the uninstall.sh script. (The default location is /opt/dynatrace/oneagent/agent/uninstall.sh)

Note: For CSI driver deployments, use the following commands instead:

  1. Delete the Dynakube custom resources.
  2. Delete the CSI driver manifest.
  3. Delete the /var/lib/kubelet/plugins/csi.oneagent.dynatrace.com directory from all Kubernetes nodes.
  4. Reapply the CSI driver and DynaKube custom resources.

Crash loop on pods when installing OneAgent

Application-only monitoring

If you get a crash loop on the pods when you install OneAgent, you need to increase the CPU memory of the pods.

Deployment seems successful but the dynatrace-oneagent container doesn’t show up as ready

DaemonSet

kubectloc

bash

kubectl get ds/dynatrace-oneagent --namespace=kube-system NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE-SELECTOR AGE dynatrace-oneagent 1 1 0 1 0 beta.kubernetes.io/os=linux 14mc

bash

kubectl logs -f dynatrace-oneagent-abcde --namespace=kube-system 09:46:18 Started agent deployment as Docker image, PID 1234. 09:46:18 Agent installer can only be downloaded from secure location. Your installer URL should start with 'https': REPLACE_WITH_YOUR_URL

Change the value REPLACE_WITH_YOUR_URL in the dynatrace-oneagent.yml DaemonSet with the Dynatrace OneAgent installer URL.

bash

oc get pods NAME READY STATUS RESTARTS AGE dynatrace-oneagent-abcde 0/1 ErrImagePull 0 3s

bash

oc logs -f dynatrace-oneagent-abcde Error from server (BadRequest): container "dynatrace-oneagent" in pod "dynatrace-oneagent-abcde" is waiting to start: image can't be pulled

This is typically the case if the dynatrace service account hasn’t been allowed to pull images from the RHCC.

Deployment seems successful, however the dynatrace-oneagent image can’t be pulled

DaemonSet

Example error:

bash

oc get pods NAME READY STATUS RESTARTS AGE dynatrace-oneagent-abcde 0/1 ErrImagePull 0 3s

bash

oc logs -f dynatrace-oneagent-abcde Error from server (BadRequest): container "dynatrace-oneagent" in pod "dynatrace-oneagent-abcde" is waiting to start: image can't be pulled

This is typically the case if the dynatrace service account hasn’t been allowed to pull images from the RHCC.

Deployment seems successful, but the dynatrace-oneagent container doesn’t produce meaningful logs

DaemonSet

Example error:

kubectloc

bash

kubectl get pods --namespace=kube-system NAME READY STATUS RESTARTS AGE dynatrace-oneagent-abcde 0/1 ContainerCreating 0 3s

bash

kubectl logs -f dynatrace-oneagent-abcde --namespace=kube-system Error from server (BadRequest): container "dynatrace-oneagent" in pod "dynatrace-oneagent-abcde" is waiting to start: ContainerCreating

bash

oc get pods NAME READY STATUS RESTARTS AGE dynatrace-oneagent-abcde 0/1 ContainerCreating 0 3s

bash

oc logs -f dynatrace-oneagent-abcde Error from server (BadRequest): container "dynatrace-oneagent" in pod "dynatrace-oneagent-abcde" is waiting to start: ContainerCreating

This is typically the case if the container hasn’t yet fully started. Simply wait a few more seconds.

Deployment seems successful, but the dynatrace-oneagent container isn’t running

DaemonSet

bash

oc process -f dynatrace-oneagent-template.yml ONEAGENT_INSTALLER_SCRIPT_URL="[oneagent-installer-script-url]" | oc apply -f - daemonset "dynatrace-oneagent" created

Please note that quotes are needed to protect the special shell characters in the OneAgent installer URL.

bash

oc get pods No resources found.

This is typically the case if the dynatrace service account hasn’t been configured to run privileged pods.

bash

oc describe ds/dynatrace-oneagent Name: dynatrace-oneagent Image(s): dynatrace/oneagent Selector: name=dynatrace-oneagent Node-Selector: <none> Labels: template=dynatrace-oneagent Desired Number of Nodes Scheduled: 0 Current Number of Nodes Scheduled: 0 Number of Nodes Misscheduled: 0 Pods Status: 0 Running / 0 Waiting / 0 Succeeded / 0 Failed Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------------ ------- 6m 3m 17 {daemon-set } Warning FailedCreate Error creating: pods "dynatrace-oneagent-" is forbidden: unable to validate against any security context constraint: [spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.securityContext.hostPID: Invalid value: true: Host PID is not allowed to be used spec.securityContext.hostIPC: Invalid value: true: Host IPC is not allowed to be used spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed spec.containers[0].securityContext.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.containers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.containers[0].securityContext.hostPID: Invalid value: true: Host PID is not allowed to be used spec.containers[0].securityContext.hostIPC: Invalid value: true: Host IPC is not allowed to be used]

Deployment was successful, but monitoring data isn’t available in Dynatrace

DaemonSet

Example:

kubectloc

bash

kubectl get pods --namespace=kube-system NAME READY STATUS RESTARTS AGE dynatrace-oneagent-abcde 1/1 Running 0 1m

bash

oc get pods NAME READY STATUS RESTARTS AGE dynatrace-oneagent-abcde 1/1 Running 0 1m

This is typically caused by a timing issue that occurs if application containers have started before OneAgent was fully installed on the system. As a consequence, some parts of your application run uninstrumented. To be on the safe side, OneAgent should be fully integrated before you start your application containers. If your application has already been running, restarting its containers will have the very same effect.

No pods scheduled on control-plane nodes

DaemonSet

Kubernetes version 1.24+

Taints on master and control plane nodes are changed on Kubernetes versions 1.24+, and the OneAgent DaemonSet is missing appropriate tolerations in the DynaKube custom resource.

To add the necessary tolerations, edit the DynaKube YAML as follows.

yaml

tolerations: - effect: NoSchedule key: node-role.kubernetes.io/master operator: Exists - effect: NoSchedule key: node-role.kubernetes.io/control-plane operator: Exists

Error when applying the custom resource on GKE

Example error:

plaintext

Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.dynatrace.com": Post "https://dynatrace-webhook.dynatrace.svc:443/validate?timeout=2s (https://dynatrace-webhook.dynatrace.svc/validate?timeout=2s)": context deadline exceeded

If you are getting this error when trying to apply the custom resource on your GKE cluster, the firewall is blocking requests from the Kubernetes API to the Dynatrace Webhook because the required port (8443) is blocked by default.

The default allowed ports (443 and 10250) on GCP refer to the ports exposed by your nodes and pods, not the ports exposed by any Kubernetes services. For example, if the cluster control plane attempts to access a service on port 443 such as the Dynatrace webhook, but the service is implemented by a pod using port 8443, this is blocked by the firewall.

To fix this, add a firewall rule to explicitly allow ingress to port 8443.

For more information about this issue, see API request that triggers admission webhook timing out.

CannotPullContainerError

If you get errors like this on your pods when installing Dynatrace OneAgent, your Docker download rate limit has been exceeded.

CannotPullContainerError: inspect image has been retried [X] time(s): httpReaderSeeker: failed open: unexpected status code

For details, consult the Docker documentation.

Limit log timeframe

cloudNativeFullStack applicationMonitoring

Dynatrace Operator version 0.10.0+

If there’s DiskPressure on your nodes, you can configure the CSI driver log garbage collection interval to lower the storage usage of the CSI driver. The default value of keeping logs before they are deleted from the file system is 7 (days). To edit this timeframe, select one of the options below, depending on your deployment mode.

Be careful when setting this value; you might need the logs to investigate problems.

Manual (kubectl/oc)helm

  1. Edit the manifests of the CSI driver daemonset (kubernetes-csi.yaml, openshift-csi.yaml), by replacing the placeholders (<your_value>) with your value.

yaml

apiVersion: apps/v1 kind: DaemonSet ... spec: ... template: ... spec: ... containers: ... - name: provisioner ... env: - name: MAX_UNMOUNTED_VOLUME_AGE value: <your_value> # defined in days, must be a plain number. `0` means logs are immediately deleted. If not set, defaults to `7`.
  1. Apply the changes.

Edit values.yaml to set the maxUnmountedVolumeAge parameter under the csidriver section.

yaml

csidriver: enabled: true ... maxUnmountedVolumeAge: "" # defined in days, must be a plain number. `0` means logs are immediately deleted. If not set, defaults to `7`.

Connectivity issues between Dynatrace and your cluster

Problem with ActiveGate token

Example error on the ActiveGate deployment status page:

Problem with ActiveGate token (reason:Absent)

Example error on Dynatrace Operator logs:

bash

{"level":"info","ts":"2022-09-22T06:49:17.351Z","logger":"dynakube-controller","msg":"reconciling DynaKube","namespace":"dynatrace","name":"dynakube"} {"level":"info","ts":"2022-09-22T06:49:17.502Z","logger":"dynakube-controller","msg":"problem with token detected","dynakube":"dynakube","token":"APIToken","msg":"Token on secret dynatrace:dynakube missing scopes [activeGateTokenManagement.create]"}

Example error on DynaKube status:

bash

status: ... conditions: - message: Token on secret dynatrace:dynakube missing scopes [activeGateTokenManagement.create] reason: TokenScopeMissing status: "False" type: APIToken

Starting Dynatrace Operator version 0.9.0, Dynatrace Operator handles the ActiveGate token by default. If you’re getting one of these errors, follow the instructions below, according to your Dynatrace Operator version.

  • For Dynatrace Operator versions earlier than 0.7.0: you need to upgrade to the latest Dynatrace Operator version.
  • For Dynatrace Operator version 0.7.0 or later, but earlier than version 0.9.0: you need to create a new API token. For instructions, see Tokens and permissions required: Dynatrace Operator token.

ImagePullBackoff error on OneAgent and ActiveGate pods

The underlying host’s container runtime doesn’t contain the certificate presented by your endpoint.

Note: The skipCertCheck field in the DynaKube YAML does not control this certificate check.

Example error (the error message may vary):

plaintext

desc = failed to pull and unpack image "<environment>/linux/activegate:latest": failed to resolve reference "<environment>/linux/activegate:latest": failed to do request: Head "<environment>/linux/activegate/manifests/latest": x509: certificate signed by unknown authority Warning Failed ... Error: ErrImagePull Normal BackOff ... Back-off pulling image "<environment>/linux/activegate:latest" Warning Failed ... Error: ImagePullBackOff

In this example, if the description on your pod shows x509: certificate signed by unknown authority, you must fix the certificates on your Kubernetes hosts, or use the private repository configuration to store the images.

There was an error with the TLS handshake

The certificate for the communication is invalid or expired. If you’re using a self-signed certificate, check the mitigation procedures for the ActiveGate.

Invalid bearer token

The bearer token is invalid and the request has been rejected by the Kubernetes API. Verify the bearer token. Make sure it doesn’t contain any whitespaces. If you’re connecting to a Kubernetes cluster API via a centralized external role-based access control (RBAC), consult the documentation of the Kubernetes cluster manager. For Rancher, see the guidelines on the official Rancher website.

Could not check credentials. Process is started by other user

There is already a request pending for this integration with an ActiveGate. Wait for a couple minutes and check back.

If you get this error after applying the DynaKube custom resource, your Kubernetes API server may be configured with a proxy. You need to exclude https://dynatrace-webhook.dynatrace.svc from that proxy.

OneAgent unable to connect when using Istio

cloudNativeFullStack applicationMonitoring
Example error in the logs on the OneAgent pods: Initial connect: not successful - retrying after xs.
You can fix this problem by increasing the OneAgent timeout. Add the following feature flag to DynaKube:
Note: Be sure to replace the placeholder (<...>) with the name of your DynaKube custom resource.

bash

kubectl annotate dynakube <name-of-your-DynaKube> feature.dynatrace.com/oneagent-initial-connect-retry-ms=6000 -n dynatrace

Connectivity issues when using Calico

If you use Calico to handle or restrict network connections, you might experience connectivity issues, such as:

  • The operator, webhook, and CSI driver pods are constantly restarting
  • The operator cannot reach the API
  • The CSI driver fails to download OneAgent
  • Injection into pods doesn’t work

If you experience these or similar problems, use our GitHub sample policies for common problems.

Notes:

  • For the activegate-policy.yaml and dynatrace-policies.yaml policies, if Dynatrace Operator isn’t installed in the dynatrace namespace (Kubernetes) or project (OpenShift), you need to adapt the metadata and namespace properties in the YAML files accordingly.
  • The purpose of the agent-policy.yaml and agent-policy-external-only.yaml policies is to let OneAgents that are injected into pods open external connections. Only agent-policy-external-only.yaml is required, while agent-policy.yaml allows internal connections to be made, such as pod-to-pod connections, where needed.
  • Because these policies are needed for all pods where OneAgent injects, you also need to adapt the podSelector property of the YAML files.

Potential issues when changing the monitoring mode

  • Changing the monitoring mode from classicFullStackto cloudNativeFullStack affects the host ID calculations for monitored hosts, leading to new IDs being assigned and no connection between old and new entities.
  • If you want to change the monitoring method from applicationMonitoring or cloudNativeFullstack to classicFullstack or hostMonitoring, you need to restart all the pods that were previously instrumented with applicationMonitoring or cloudNativeFullstack.

I like to break things. Kubernetes is really good at maintaining desired state so I decided to play whack-a-mole with my pod:

# kubectl get pods
NAME                       READY     STATUS    RESTARTS   AGE
etcd-2372957639-s3107   2/2       Running   1          1d

# kubectl delete pods etcd-2372957639-s3107
pod "etcd-2372957639-s3107" deleted

# kubectl get pods
NAME                       READY     STATUS              RESTARTS   AGE
etcd-2372957639-s3107   2/2       Terminating         1          1d
etcd-2372957639-tl5xh   0/2       ContainerCreating   0          4s

After dozens of ‘whacks’ I noticed it got stuck with a status of ContainerCreating. Lets check the logs:

# kubectl logs etcd-2372957639-tl5xh etcd
Error from server (BadRequest): container "etcd" in pod "etcd-2372957639-tl5xh" is waiting to start: ContainerCreating

Oh yeah, if the container hasn’t started yet it has no logs to view! Lets describe the pod and check the events section:

# kubectl describe pods etcd-2372957639-tl5xh
...
Events:
  FirstSeen     LastSeen        Count   From                    SubObjectPath   Type            Reason          Message
  ---------     --------        -----   ----                    -------------   --------        ------          -------
  17m           17m             1       default-scheduler                       Normal          Scheduled       Successfully assigned etcd-2372957639-tl5xh to 3-docker2

  17m           3m              15      kubelet, 3-docker2                      Warning         FailedMount     MountVolume.SetUp failed for volume
   "kubernetes.io/iscsi/fb6d5a06-56b5-11e7-bafb-525400f4fd4c-etcd" (spec.Name: "etcd") pod "fb6d5a06-56b5-11e7-bafb-525400f4fd4c"
   (UID: "fb6d5a06-56b5-11e7-bafb-525400f4fd4c") with: 'fsck' found errors on device
   /dev/disk/by-path/ip-192.168.1.190:3260-iscsi-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0 but could not correct them: fsck from util-linux 2.23.2
   /dev/sda contains a file system with errors, check forced.
   /dev/sda: Unattached inode 23
   /dev/sda: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
   (i.e., without -a or -p options)

  1m             1m              1       kubelet, 3-docker2                     Warning         FailedMount     MountVolume.SetUp failed for volume
   "kubernetes.io/iscsi/fb6d5a06-56b5-11e7-bafb-525400f4fd4c-etcd" (spec.Name: "etcd") pod "fb6d5a06-56b5-11e7-bafb-525400f4fd4c" (UID: "fb6d5a06-56b5-11e7-bafb-525400f4fd4c")
   with: mount failed: exit status 32
  Mounting command: mount
  Mounting arguments: /dev/disk/by-path/ip-192.168.1.190:3260-iscsi-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0
  /var/lib/kubelet/plugins/kubernetes.io/iscsi/iface-default/192.168.1.190:3260-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0 ext4 [defaults]
  Output: mount: /dev/sda is already mounted or
  /var/lib/kubelet/plugins/kubernetes.io/iscsi/iface-default/192.168.1.190:3260-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0 busy

  15m            24s             8       kubelet, 3-docker2                     Warning         FailedMount     Unable to mount volumes for pod  
   "etcd-2372957639-dkdwz_default(fb6d5a06-56b5-11e7-bafb-525400f4fd4c)": timeout expired waiting for volumes to attach/mount for pod "default"/"etcd-2372957639-dkdwz". list
   of unattached/unmounted volumes=[etcd-vol]

  15m            24s             8       kubelet, 3-docker2                     Warning         FailedSync      Error syncing pod, skipping: timeout expired waiting for volumes to
   attach/mount for pod "default"/"etcd-2372957639-dkdwz". list of unattached/unmounted volumes=[etcd-vol]

So we can see the pod was assigned to the 3-docker2 host, but then when the kubelet tried to mount a volume the ext4 filesystem inside had errors that require manual intervention. Not good.

Since the pod is scheduled to host 3-docker2 lets go there and check out the logs:

# journalctl -r -u kubelet

Jun 21 21:29:39 3-docker2 kubelet[13812]: E0621 21:29:39.794177   13812 iscsi_util.go:195] iscsi: failed to mount iscsi volume
/dev/disk/by-path/ip-192.168.1.190:3260-iscsi-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0 [ext4] to
/var/lib/kubelet/plugins/kubernetes.io/iscsi/iface-default/192.168.1.190:3260-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0, error 'fsck' found errors on device
/dev/disk/by-path/ip-192.168.1.190:3260-iscsi-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0 but could not correct them: fsck from util-linux 2.23.2
Jun 21 21:29:39 3-docker2 kubelet[13812]: /dev/sda contains a file system with errors, check forced.
Jun 21 21:29:39 3-docker2 kubelet[13812]: /dev/sda: Unattached inode 23
Jun 21 21:29:39 3-docker2 kubelet[13812]: /dev/sda: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
Jun 21 21:29:39 3-docker2 kubelet[13812]: (i.e., without -a or -p options)
Jun 21 21:29:39 3-docker2 kubelet[13812]: .
Jun 21 21:29:39 3-docker2 kubelet[13812]: E0621 21:29:39.794443   13812 disk_manager.go:50] failed to attach disk
Jun 21 21:29:39 3-docker2 kubelet[13812]: E0621 21:29:39.794471   13812 iscsi.go:228] iscsi: failed to setup
Jun 21 21:29:39 3-docker2 kubelet[13812]: E0621 21:29:39.794898   13812 nestedpendingoperations.go:262] Operation for
""kubernetes.io/iscsi/fb6d5a06-56b5-11e7-bafb-525400f4fd4c-etcd" ("fb6d5a06-56b5-11e7-bafb-525400f4fd4c")" failed. No retries permitted until 2017-06-21 21:31:39.794803785
+0200 CEST (durationBeforeRetry 2m0s). Error: MountVolume.SetUp failed for volume "kubernetes.io/iscsi/fb6d5a06-56b5-11e7-bafb-525400f4fd4c-etcd" (spec.Name: "etcd") pod
"fb6d5a06-56b5-11e7-bafb-525400f4fd4c" (UID: "fb6d5a06-56b5-11e7-bafb-525400f4fd4c") with: 'fsck' found errors on device
/dev/disk/by-path/ip-192.168.1.190:3260-iscsi-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0 but could not correct them: fsck from util-linux 2.23.2
Jun 21 21:29:39 3-docker2 kubelet[13812]: /dev/sda contains a file system with errors, check forced.
Jun 21 21:29:39 3-docker2 kubelet[13812]: /dev/sda: Unattached inode 23
Jun 21 21:29:39 3-docker2 kubelet[13812]: /dev/sda: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
Jun 21 21:29:39 3-docker2 kubelet[13812]: (i.e., without -a or -p options)
Jun 21 21:29:39 3-docker2 kubelet[13812]: .
Jun 21 21:30:46 3-docker2 kubelet[13812]: E0621 21:30:46.869197   13812 kubelet.go:1549] Unable to mount volumes for pod
"etcd-2372957639-dkdwz_default(fb6d5a06-56b5-11e7-bafb-525400f4fd4c)": timeout expired waiting for volumes to attach/mount
for pod "default"/"etcd-2372957639-dkdwz". list of unattached/unmounted volumes=[etcd-vol]; skipping pod
Jun 21 21:30:46 3-docker2 kubelet[13812]: E0621 21:30:46.869346   13812 pod_workers.go:182] Error syncing pod fb6d5a06-56b5-11e7-bafb-525400f4fd4c
("etcd-2372957639-dkdwz_default(fb6d5a06-56b5-11e7-bafb-525400f4fd4c)"), skipping:
timeout expired waiting for volumes to attach/mount for pod "default"/"etcd-2372957639-dkdwz". list of unattached/unmounted volumes=[etcd-vol]

So basically the same story.

The device is already presented to the host but just not mounted so run the fsck:

# fsck /dev/sda

fsck from util-linux 2.23.2
e2fsck 1.42.9 (28-Dec-2013)
/dev/sda contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Unattached inode 23
Connect to /lost+found<y>? yes
Inode 23 ref count is 2, should be 1.  Fix<y>? yes
Pass 5: Checking group summary information
Block bitmap differences:  +(7939--23563)
Fix<y>? yes
Free blocks count wrong for group #5 (16785, counted=32410).
Fix<y>? yes
Free blocks count wrong (391677, counted=439712).
Fix<y>? yes
Inode bitmap differences:  -17
Fix<y>? yes
Free inodes count wrong for group #0 (8125, counted=8126).
Fix<y>? yes
Free inodes count wrong (122141, counted=122142).
Fix<y>? yes

/dev/sda: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda: 18/122160 files (16.7% non-contiguous), 48736/488448 blocks

So we fixed it. I’m not an expert on ext4 but there is probably a file in lost+found we might want to check if we need. In any case, the filesystem should be able to be mounted now.

Lets give it a minute and check if Kubernetes got it up:

# kubectl get pods
NAME                       READY     STATUS    RESTARTS   AGE
etcd-2372957639-tl5xh      2/2       Running   1          50m

There we go, our pod is running again!

For the paranoid (me included) I was curious that no other admins or Kubernetes itself could corrupt things even more while I did my fsck. Checking the event log I see that while my fsck was active it logged a new error because the device was busy:

Events:
  FirstSeen     LastSeen        Count   From                    SubObjectPath   Type            Reason          Message
  ---------     --------        -----   ----                    -------------   --------        ------          -------
  10m           2m              5       kubelet, 3-docker2      Warning         FailedMount     MountVolume.SetUp failed for volume
  "kubernetes.io/iscsi/fb6d5a06-56b5-11e7-bafb-525400f4fd4c-etcd" (spec.Name: "etcd") pod "fb6d5a06-56b5-11e7-bafb-525400f4fd4c"
  (UID: "fb6d5a06-56b5-11e7-bafb-525400f4fd4c") with: mount failed: exit status 32
  Mounting command: mount
  Mounting arguments: /dev/disk/by-path/ip-192.168.1.190:3260-iscsi-iqn.2010-01.com.s:pnay.default-trash-7971f.2484-lun-0
  /var/lib/kubelet/plugins/kubernetes.io/iscsi/iface-default/192.168.1.190:3260-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0 ext4 [defaults]
  Output: mount: /dev/sdh is already mounted or /var/lib/kubelet/plugins/kubernetes.io/iscsi/iface-default/192.168.1.190:3260-iqn.2010-01.com.s:pnay.etcd-etcd.1306-lun-0 busy

A bit of filesystem recovery container style :-)

As always, comments are welcome!

Понравилась статья? Поделить с друзьями:
  • Error frisk вики
  • Error frisk x ink frisk
  • Error frisk wiki
  • Error frisk skin
  • Error frisk art