Error syncing pod skipping - Исправление ошибок и поиск оптимальных решений проблем

Comments

k8s-github-robot

pushed a commit
that referenced
this issue

May 11, 2017

Automatic merge from submit-queue (batch tested with PRs 45382, 45384, 44781, 45333, 45543)

Ensure desired state of world populator runs before volume reconstructor

If the kubelet's volumemanager reconstructor for actual state of world runs before the desired state of world has been populated, the pods in the actual state of world will have some incorrect volume information: namely outerVolumeSpecName, which if incorrect leads to part of the issue here #43515, because WaitForVolumeAttachAndMount searches the actual state of world with the correct outerVolumeSpecName and won't find it so reports 'timeout waiting....', etc. forever for existing pods. The comments acknowledge that this is a known issue

The all sources ready check doesn't work because the sources being ready doesn't necessarily mean the desired state of world populator added pods from the sources. So instead let's put the all sources ready check in the *populator*, and when the sources are ready, it will be able to populate the desired state of world and make "HasAddedPods()" return true. THEN, the reconstructor may run.

@jingxu97 PTAL, you wrote all of the reconstruction stuff

```release-note
NONE
```

This was referenced

Mar 19, 2018

k8s-github-robot

pushed a commit
that referenced
this issue

Mar 28, 2018

Automatic merge from submit-queue (batch tested with PRs 61848, 61188, 56363, 61357, 61838). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

add udev to hyperkube and bump versions

**What this PR does / why we need it**: Adds udev to the hyperkube to fix GCE and OpenStack volume mounts.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #61356 
Fixes #43515
Fixes coreos/bugs#2385

**Special notes for your reviewer**:

/cc @kubernetes/sig-node-bugs @kubernetes/sig-node-pr-reviews 
/cc @ixdy 

**Release note**:
```release-note
NONE
```

Источник

If you run into problems with your Dataflow pipeline or job, this
page lists error messages that you might see and provides suggestions for how to
fix each error.

Errors in the log types dataflow.googleapis.com/worker-startup,
dataflow.googleapis.com/harness-startup, and dataflow.googleapis.com/kubelet
indicate configuration problems with a job. They can also indicate conditions
that prevent the normal logging path from functioning.

Your pipeline might throw exceptions while processing data. Some of these errors
are transient, for example when temporary difficulty accessing an external
service occurs. Some of these errors are permanent, such as errors caused by
corrupt or unparseable input data, or null pointers during computation.

Dataflow processes elements in arbitrary bundles and retries the
complete bundle when an error is thrown for any element in that bundle. When
running in batch mode, bundles including a failing item are retried four times.
The pipeline fails completely when a single bundle fails four times. When
running in streaming mode, a bundle including a failing item is retried
indefinitely, which might cause your pipeline to permanently stall.

Exceptions in user code, for example, your DoFn instances, are
reported in the
Dataflow Monitoring Interface.
If you run your pipeline with BlockingDataflowPipelineRunner, you also see
error messages printed in your console or terminal window.

Consider guarding against errors in your code by adding exception handlers. For
example, if you want to drop elements that fail some custom input validation
done in a ParDo, use a try/catch block within your ParDo to handle the
exception and log and drop the element. For production workloads, implement an
unprocessed message pattern. To keep track of the error count, you use
aggregation transforms.

Missing log files

If you don’t see any logs for your jobs, remove any exclusion filters containing
resource.type="dataflow_step" from all of your Cloud Logging Log Router
sinks.

Go to Log Router

For more details about removing your logs exclusions, refer to the
Removing exclusions guide.

Pipeline errors

The following sections contain common pipeline errors that you might encounter
and steps for resolving or troubleshooting the errors.

Some Cloud APIs need to be enabled

When you try to run a Dataflow job, the following error occurs:

Some Cloud APIs need to be enabled for your project in order for Cloud Dataflow to run this job.

This issue occurs because some required APIs are not enabled in your project.

To resolve this issue and run a Dataflow job, enable the following
Google Cloud APIs in your project:

Compute Engine API (Compute Engine)
Cloud Logging API
Cloud Storage
Cloud Storage JSON API
BigQuery API
Pub/Sub
Datastore API

For detailed instructions, see the
Getting Started section on enabling Google Cloud APIs
.

Bad request

When you run a Dataflow job,
Cloud Monitoring logs display
a series of warnings similar to the following:

Unable to update setup work item STEP_ID error: generic::invalid_argument: Http(400) Bad Request
Update range task returned 'invalid argument'. Assuming lost lease for work with id LEASE_ID
with expiration time: TIMESTAMP, now: TIMESTAMP. Full status: generic::invalid_argument: Http(400) Bad Request

Bad request warnings occur if worker state information is stale or out of sync
due to processing delays. Often, your Dataflow job succeeds
despite the bad request warnings. If that is the case, ignore the warnings.

Cannot read and write in different locations

When you run a Dataflow job, you might see the following error in
the log files:

message:Cannot read and write in different locations: source: SOURCE_REGION, destination: DESTINATION_REGION,reason:invalid

This error occurs when the source and destination are in different regions. It
can also occur when the staging location and destination are in different
regions. For example, if the job reads from Pub/Sub and then writes to a
Cloud Storage temp bucket before writing to a BigQuery table, the
Cloud Storage temp bucket and the BigQuery table must be in the
same region.

Multi-region locations are considered different than single-region locations,
even if the single region falls within the scope of the multi-region location.
For example, us (multiple regions in the United States) and us-central1 are
different regions.

To resolve this issue, have your destination, source, and staging locations in
the same region. Cloud Storage bucket locations can’t be changed, so you
might need to create a new Cloud Storage bucket in the correct region.

No such object

When you run your Dataflow jobs, you might see the following error in
the log files:

..., 'server': 'UploadServer', 'status': '404'}>, <content <No such object:...

These errors typically occur when some of your running Dataflow jobs
use the same temp_location to stage temporary job files created when the
pipeline runs. When multiple concurrent jobs share the same temp_location,
these jobs might step on each other’s temporary data and a race condition might
occur. To avoid this, it is recommended that you use a unique temp_location
for each job.

DEADLINE_EXCEEDED or Server Unresponsive

When you run your jobs, you might encounter RPC timeout exceptions or one of the
following errors:

DEADLINE_EXCEEDED

Or:

Server Unresponsive

These errors typically occur for one of the following reasons:

The VPC network used for your job might be missing a
firewall rule. The firewall rule needs to enable
all TCP traffic among VMs in the VPC network you
specified in your pipeline
options. See
Specifying your network and subnetwork
for more details.
Your job is shuffle-bound.

To resolve this issue, make one or more of the following changes.

Java

If the job is not using the service-based shuffle, switch to using the
service-based Dataflow Shuffle by setting
--experiments=shuffle_mode=service.
For details and availability, read
Dataflow Shuffle.
Add more workers. Try setting --numWorkers with a
higher value when you run your pipeline.
Increase the size of the attached disk for workers. Try setting
--diskSizeGb with a higher value when you run your pipeline.
Use an SSD-backed persistent disk. Try setting
--workerDiskType="compute.googleapis.com/projects/PROJECT_ID/zones/ZONE/diskTypes/pd-ssd"
when you run your pipeline.

Python

If the job is not using the service-based shuffle, switch to using the
service-based Dataflow Shuffle by setting
--experiments=shuffle_mode=service.
For details and availability, read
Dataflow Shuffle.
Add more workers. Try setting --num_workers with a
higher value when you run your pipeline.
Increase the size of the attached disk for workers. Try setting
--disk_size_gb with a higher value when you run your pipeline.
Use an SSD-backed persistent disk. Try setting
--worker_disk_type="compute.googleapis.com/projects/PROJECT_ID/zones/ZONE/diskTypes/pd-ssd"
when you run your pipeline.

Go

If the job is not using the service-based shuffle, switch to using the
service-based Dataflow Shuffle by setting
--experiments=shuffle_mode=service.
For details and availability, read
Dataflow Shuffle.
Add more workers. Try setting --num_workers with a
higher value when you run your pipeline.
Increase the size of the attached disk for workers. Try setting
--disk_size_gb with a higher value when you run your pipeline.
Use an SSD-backed persistent disk. Try setting
--disk_type="compute.googleapis.com/projects/PROJECT_ID/zones/ZONE/diskTypes/pd-ssd"
when you run your pipeline.

Encoding errors, IOExceptions, or unexpected behavior in user code

The Apache Beam SDKs and the Dataflow workers depend on common
third-party components. These components import additional dependencies. Version
collisions can result in unexpected behavior in the service. If you are using
any of these packages in your code, be aware that some libraries are not
forward-compatible. You might need to pin to the listed versions that are in
scope during execution.
SDK and Worker Dependencies
contains a list of dependencies and their required versions.

Error running LookupEffectiveGuestPolicies

When you run a Dataflow job, you might see the following error in
the log files:

OSConfigAgent Error policies.go:49: Error running LookupEffectiveGuestPolicies:
error calling LookupEffectiveGuestPolicies: code: "Unauthenticated",
message: "Request is missing required authentication credential.
Expected OAuth 2 access token, login cookie or other valid authentication credential.

This error occurs if
OS configuration management is
enabled for the entire project.

To resolve this issue, disable VM Manager
policies that apply to the entire project. If disabling VM Manager
policies for the entire project isn’t possible, you can safely ignore this error
and filter it out of log monitoring tools.

Exhausted resource pool

When you create a Google Cloud resource, you might see the following error for
an exhausted resource pool:

ERROR: ZONE_RESOURCE_POOL_EXHAUSTED

This error occurs for temporary stock-out conditions for a specific resource in
a specific zone.

To resolve the issue, you can either wait for a period of time or create the
same resource in another zone. As a best practice, we recommend that you
distribute your resources across
multiple zones and regions
to tolerate outages.

A fatal error has been detected by the Java Runtime Environment

The following error occurs during worker startup:

A fatal error has been detected by the Java Runtime Environment

This error occurs if the pipeline is using Java Native Interface (JNI) to run
non-Java code and that code or the JNI bindings contain an error.

A hot key … was detected

The following error occurs:

A hot key HOT_KEY_NAME was detected in...

These errors occur if your data contains a hot key. A hot key is a key with
enough elements to negatively impact pipeline performance. These keys limit
Dataflow’s ability to process elements in parallel, which
increases execution time.

To print the human-readable key to the logs when a hot key is detected in the
pipeline, use the
hot key pipeline option.

To resolve this issue, check that your data is evenly distributed. If a key has
disproportionately many values, consider the following courses of action:

Rekey your data. Apply a
ParDo
transform to output new key-value pairs.
For Java jobs, use the Combine.PerKey.withHotKeyFanout
transform.
For Python jobs, use the CombinePerKey.with_hot_key_fanout
transform.
Enable Dataflow Shuffle.

To view hot keys in the Dataflow monitoring UI, see
Troubleshoot stragglers in batch jobs.

Invalid table specification in Data Catalog

When you use Dataflow SQL to create Dataflow SQL jobs, your job might fail
with the following error in the log files:

Invalid table specification in Data Catalog: Could not resolve table in Data Catalog

This error occurs if the Dataflow service account doesn’t have
access to the Data Catalog API.

To resolve this issue,
enable the Data Catalog API
in the Google Cloud
project
that you’re using to write and run queries.

Alternately, assign the roles/datacatalog.viewer role to the
Dataflow service account.

The job graph is too large

Your job might fail with the following error:

The job graph is too large. Please try again with a smaller job graph,
or split your job into two or more smaller jobs.

This error occurs if your job’s graph size exceeds 10 MB. Certain
conditions in your pipeline can cause the job graph to exceed the limit. Common
conditions include:

A Create transform that includes a large amount of in-memory data.
A large DoFn instance that is serialized for transmission to remote
workers.
A DoFn as an anonymous inner class instance that (possibly inadvertently)
pulls in a large amount of data to be serialized.
A directed acyclic graph (DAG) is being used as part of a programmatic loop
that is enumerating a large list.

To avoid these conditions, consider restructuring your pipeline.

Key Commit Too Large

When running a streaming job, the following error appears in the worker log
files:

KeyCommitTooLargeException

This error occurs in streaming scenarios if a very large amount of data is
grouped without using a Combine transform, or if a large amount of data is
produced from a single input element.

To reduce the possibility of encountering this error, use the following
strategies:

Ensure that processing a single element cannot result in outputs or state
modifications exceeding the limit.
If multiple elements were grouped by a key, consider increasing the key
space to reduce the elements grouped per key.
If elements for a key are emitted at a high frequency over a short period of
time, that might result in many GB of events for that key in windows.
Rewrite the pipeline to detect keys like this and only emit an output
indicating the key was frequently present in that window.
Make sure to use sub-linear space Combine transforms for commutative and
associate operations. Don’t use a combiner if it doesn’t reduce space. For
example, combiner for strings that just appends strings together is worse
than not using combiner.

rejecting message over 7168K

When you run a Dataflow job created from a template, the job
might fail with the following error:

Error: CommitWork failed: status: APPLICATION_ERROR(3): Pubsub publish requests are limited to 10MB, rejecting message over 7168K (size MESSAGE_SIZE) to avoid exceeding limit with byte64 request encoding.

This error occurs when messages written to a dead-letter queue exceed the size
limit of 7168K.
As a workaround, enable
Streaming Engine,
which has a higher size limit.
To enable Streaming Engine, use the following
pipeline option.

Java

--enableStreamingEngine=true

Python

--enable_streaming_engine=true

Request Entity Too Large

When you submit your job, one of the following errors appears in your console or
terminal window:

413 Request Entity Too Large
The size of serialized JSON representation of the pipeline exceeds the allowable limit
Failed to create a workflow job: Invalid JSON payload received
Failed to create a workflow job: Request payload exceeds the allowable limit

When you encounter an error about the JSON payload when submitting your job,
your pipeline’s JSON representation exceeds the maximum 20 MB request size.

The size of your job is specifically tied to the JSON representation of the
pipeline. A larger pipeline means a larger request. Dataflow
currently has a limitation that caps requests at 20 MB.

To estimate the size of your pipeline’s JSON request, run your pipeline with the
following option:

Java

--dataflowJobFile=PATH_TO_OUTPUT_FILE

Python

--dataflow_job_file=PATH_TO_OUTPUT_FILE

Go

Outputting your job as JSON is not supported in Go.

This command writes a JSON representation of your job to a file. The size of the
serialized file is a good estimate of the size of the request. The actual size
is slightly larger due to some additional information included the request.

Certain conditions in your pipeline can cause the JSON representation to exceed
the limit. Common conditions include:

A Create transform that includes a large amount of in-memory data.
A large DoFn instance that is serialized for transmission to remote
workers.
A DoFn as an anonymous inner class instance that (possibly inadvertently)
pulls in a large amount of data to be serialized.

To avoid these conditions, consider restructuring your pipeline.

SDK pipeline options or staging file list exceeds size limit

When running a pipeline, the following error occurs:

SDK pipeline options or staging file list exceeds size limit.
Please keep their length under 256K Bytes each and 512K Bytes in total.

This error occurs if the pipeline could not be started due to Google
Compute Engine metadata limits being exceeded. These limits cannot be changed.
Dataflow uses Compute Engine metadata for pipeline options. The
limit is documented in the Compute Engine custom metadata
limitations.

Having too many JAR files to stage can cause the JSON representation to exceed
the limit.

To estimate the size of your pipeline’s JSON request, run your pipeline with the
following option:

Java

--dataflowJobFile=PATH_TO_OUTPUT_FILE

Python

--dataflow_job_file=PATH_TO_OUTPUT_FILE

Go

Outputting your job as JSON is not supported in Go.

The size of the output file from this command has to be less than 256 KB.
512 KB in the error message is referring to the total size of the above
output file and the custom metadata options for the compute engine VM instance.

You can get a rough estimate of the custom metadata option for VM instance from
running Dataflow jobs in the project. Choose any running
Dataflow job, take a VM instance and then navigate to the
Compute Engine VM instance details page for that VM to check for the custom
metadata section. The total length of the custom metadata and the file should be
less than 512 KB. An accurate estimate for the failed job is not possible,
because the VMs are not spun up for failed jobs.

If your JAR list is hitting the 256 KB limit, review it and reduce any
unnecessary JAR files. If it is still too large, try executing the
Dataflow job using an uber JAR. The cloud function section of the
documentation has an
example
showing how to create and use uber JAR.

Shuffle key too large

The following error appears in the worker log files:

Shuffle key too large

This error occurs if the serialized key emitted to a particular (Co-)GroupByKey
is too large after the corresponding coder is applied. Dataflow
has a limit for serialized shuffle keys.

To resolve this issue, reduce the size of the keys or use more space-efficient
coders.

Total number of BoundedSource objects … is larger than the allowable limit

One of the following errors might occur when running jobs with Java:

Total number of BoundedSource objects generated by splitIntoBundles() operation is larger than the allowable limit

Or:

Total size of the BoundedSource objects generated by splitIntoBundles() operation is larger than the allowable limit

Java

This error can occur if you’re reading from a very large number of files via
TextIO, AvroIO, or some other file-based source. The particular limit
depends on the details of your source, but it is on the order of tens of
thousands of files in one pipeline. For example, embedding schema in
AvroIO.Read allows fewer files.

This error can also occur if you created a custom data source for your
pipeline and your source’s splitIntoBundles method returned a list of
BoundedSource objects which takes up more than 20 MB when serialized.

The allowable limit for the total size of the BoundedSource objects
generated by your custom source’s splitIntoBundles() operation is
20 MB.

To work around this limitation, modify your custom BoundedSource subclass
so that the total size of the generated BoundedSource objects is smaller
than the 20 MB limit. For example, your source might generate fewer
splits initially, and rely on
Dynamic Work Rebalancing
to further split inputs on demand.

NameError

When you execute your pipeline using the Dataflow service, the
following error occurs:

NameError

This error does not occur when you execute locally, such as when you execute
using the DirectRunner.

This error occurs if your DoFns are using values in the global namespace that
are not available on the Dataflow worker.

By default, global imports, functions, and variables defined in the main session
are not saved during the serialization of a Dataflow job.

To resolve this issue, use one of the following methods. If your DoFns are
defined in the main file and reference imports and functions in the global
namespace, set the --save_main_session pipeline option to True. This change
pickles the state of the global namespace to and loads it on the
Dataflow worker.

If you have objects in your global namespace that can’t be pickled, a pickling
error occurs. If the error is regarding a module that should be available in the
Python distribution, import the module locally, where it is used.

For example, instead of:

import re
…
def myfunc():
  # use re module

use:

def myfunc():
  import re
  # use re module

Alternatively, if your DoFns span multiple files, use
a different approach to packaging your workflow and
managing dependencies.

Processing stuck/Operation ongoing

If Dataflow spends more time executing a DoFn than the time
specified in TIME_INTERVAL without returning, the following message is displayed.

Processing stuck/Operation ongoing in step STEP_ID for at least TIME_INTERVAL without outputting or completing in state finish at STACK_TRACE

This behavior has two possible causes:

Your DoFn code is simply slow, or waiting for some slow external operation
to complete.
Your DoFn code might be stuck, deadlocked, or abnormally slow to finish
processing.

To determine which of these is the case, expand the
Cloud Monitoring log entry to
see a stack trace. Look for messages that indicate that the DoFn code is stuck
or otherwise encountering issues. If no messages are present, the issue might be
the execution speed of the DoFn code. Consider using
Cloud Profiler or other tool to
investigate the code’s performance.

You can further investigate the cause of your stuck code if your pipeline is
built on the Java VM (using either Java or Scala); take a full thread dump of
the whole JVM (not just the stuck thread) by following these steps:

Make note of the worker name from the log entry.
In the Compute Engine section of the Google Cloud console, find
the Compute Engine instance with the worker name you noted.
SSH into the instance with that name.
Run the following command:
```
curl http://localhost:8081/threadz
```

Pub/Sub quota errors

When running a streaming pipeline from Pub/Sub, the following
errors occur:

429 (rateLimitExceeded)

Or:

Request was throttled due to user QPS limit being reached

These errors occur if your project has insufficient
Pub/Sub quota.

To find out if your project has insufficient quota, follow these steps to check
for client errors:

Go to the Google Cloud console.
In the menu on the left, select APIs & services.
In the Search Box, search for Cloud Pub/Sub.
Click the Usage tab.
Check Response Codes and look for (4xx) client error codes.

Pub/Sub unable to determine backlog

When running a streaming pipeline from Pub/Sub, the following
error occurs:

Dataflow is unable to determine the backlog for Pub/Sub subscription

When a Dataflow pipeline pulls data from Pub/Sub,
Dataflow needs to repeatedly request information from
Pub/Sub about the amount of backlog on the subscription and the
age of the oldest unacknowledged message. This information is necessary for
autoscaling and to advance the watermark of the pipeline. Occasionally,
Dataflow is unable to retrieve this information from
Pub/Sub because of internal system issues, in which case the
pipeline might not autoscale properly, and the watermark might not advance. If
these problems persist, contact customer support.

For more information, see
Streaming With Cloud Pub/Sub.

Request is prohibited by organization’s policy

When running a pipeline, the following error occurs:

Error trying to get gs://BUCKET_NAME/FOLDER/FILE:
{"code":403,"errors":[{"domain":"global","message":"Request is prohibited by organization's policy","reason":"forbidden"}],
"message":"Request is prohibited by organization's policy"}

This error occurs if the Cloud Storage bucket is outside of your
service perimeter.

To resolve this issue, create an
egress rule that allows
access to the bucket outside of the service perimeter.

Staged package…is inaccessible

Jobs that used to succeed might fail with the following error:

Staged package...is inaccessible

To resolve this issue:

Verify that the Cloud Storage bucket used for staging does not have
TTL settings that cause staged packages to be
deleted.
Verify that your Dataflow project’s controller service account
has the permission to access the Cloud Storage bucket used for
staging. Gaps in permission can be due to any of the following reasons:
- The Cloud Storage bucket used for staging is present in a
  different project.
- The Cloud Storage bucket used for staging was migrated from
  fine-grained access to
  uniform bucket-level access.
  Due to the inconsistency between IAM and ACL policies,
  migrating the staging bucket to uniform bucket-level access disallows
  ACLs for Cloud Storage resources, which includes the
  permissions held by your Dataflow project’s controller
  service account over the staging bucket.

For more information, see
Accessing Cloud Storage buckets across Google Cloud projects.

A work item failed 4 times

The following error occurs when a job fails:

a work item failed 4 times

This error occurs if a single operation causes the worker code to fail four
times. Dataflow fails the job, and this message is displayed.

You can’t configure this failure threshold. For more details, refer to
pipeline error and exception handling.

To resolve this issue, look in the job’s
Cloud Monitoring logs for the
four individual failures. Look for Error-level or Fatal-level log
entries in the worker logs that show exceptions or errors. The exception or
error should appear at least four times. If the logs only contain generic
timeout errors related to accessing external resources, such as MongoDB, verify
that the worker service account has permission to access the resource’s
subnetwork.

Timeout in Polling Result File

The following occurs when a job fails:

Timeout in polling result file: PATH. Possible causes are:
1. Your launch takes too long time to finish. Please check the logs on stackdriver.
2. Service account SERVICE_ACCOUNT may not have enough permissions to pull
container image IMAGE_PATH or create new objects in PATH.
3. Transient errors occurred, please try again.

The issue is often related to how the Python dependencies are being installed
via the requirements.txt file. The Apache Beam stager downloads the source of
all dependencies from PyPi, including the sources of transitive dependencies.
Then, the wheel compilation happens implicitly during the pip download
command for some of the Python packages that are dependencies of apache-beam.
This can result in a timeout issue from the requirements.txt file.

This is a known issue by the
Apache Arrow team. The
suggested workaround
is to install apache-beam directly in the Dockerfile. This way, the timeout
for the requirements.txt file is not applied.

Container image errors

The following sections contain common errors that you might encounter when using
custom containers and steps for resolving or troubleshooting the errors. The
errors are typically prefixed with the following message:

Unable to pull container image due to error: DETAILED_ERROR_MESSAGE

Custom container pipeline fails or is slow to start up workers

A known issue might cause Dataflow to mark workers as unresponsive
and restart the worker if they are using a large custom container (~10 GB) that
takes a long time to download. This issue can cause pipeline startup to be slow
or in some extreme cases prevent workers from starting up at all.

To confirm that this known issue is causing the problem, in
dataflow.googleapis.com/kubelet, look for worker logs that show several failed
attempts to pull a container and confirm that kubelet did not start on the
worker. For example, the logs might contain Pulling image <URL> without a
corresponding Successfully pulled image <URL> or Started Start kubelet after
all images have been pulled.

To work around this issue, run the pipeline with the
--experiments=disable_worker_container_image_prepull pipeline option.

Error syncing pod … failed to «StartContainer»

The following error occurs during worker startup:

Error syncing pod POD_ID, skipping: [failed to "StartContainer" for CONTAINER_NAME with CrashLoopBackOff: "back-off 5m0s restarting failed container=CONTAINER_NAME pod=POD_NAME].

A pod is a co-located group of Docker containers running on a
Dataflow worker. This error occurs when one of the Docker
containers in the pod fails to start. If the failure is not recoverable, the
Dataflow worker isn’t able to start, and Dataflow
batch jobs eventually fail with errors like the following:

The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h.

This error typically occurs when one of the containers is continuously crashing
during startup. To understand the root cause, look for the logs captured
immediately prior to the failure. To analyze the logs, use the
Logs Explorer.

In the Logs Explorer, limit the log files to log entries emitted from the
worker with container startup errors. To limit the log entries, complete the
following steps:

In the Logs Explorer, find the Error syncing pod log entry.
To see the labels associated with the log entry, expand the log entry.
Click the label associated with the resource_name, and then click Show
matching entries.

In the Logs Explorer, the Dataflow logs are organized into
several log streams. The Error syncing pod message is emitted in the log named
kubelet. However, the logs from the failing container could be in a different
log stream. Each container has a name. Use the following table to determine
which log stream might contain logs relevant to the failing container.

Container name	Log names
sdk, sdk0, sdk1, sdk-0-0, and so on	docker
harness	harness, harness-startup
python, java-batch, java-streaming	worker-startup, worker
artifact	artifact

When you query the Logs Explorer, make sure that the query either includes
the relevant log names
in the query builder UI
or does not have restrictions on the log name.

After you select the relevant logs, the query result might look like the
following example:

resource.type="dataflow_step"
resource.labels.job_id="2022-06-29_08_02_54-JOB_ID"
labels."compute.googleapis.com/resource_name"="testpipeline-jenkins-0629-DATE-cyhg-harness-8crw"
logName=("projects/apache-beam-testing/logs/dataflow.googleapis.com%2Fdocker"
OR
"projects/apache-beam-testing/logs/dataflow.googleapis.com%2Fworker-startup"
OR
"projects/apache-beam-testing/logs/dataflow.googleapis.com%2Fworker")

Because the logs reporting the symptom of the container failure are sometimes
reported as INFO, include INFO logs in your analysis.

Typical causes of container failures include the following:

Your Python pipeline has additional dependencies that are installed at
runtime, and the installation is unsuccessful. You might see errors like
pip install failed with error. This issue might occur due to conflicting
requirements, or due to a restricted networking configuration that prevents
a Dataflow worker from pulling an external dependency from a
public repository over the internet.
A worker fails in the middle of the pipeline run due to an out of memory
error. You might see an error like one of the following:
- java.lang.OutOfMemoryError: Java heap space
- Shutting down JVM after 8 consecutive periods of measured GC thrashing. Memory is used/total/max = 24453/42043/42043 MB, GC last/max = 58.97/99.89 %, #pushbacks=82, gc thrashing=true. Heap dump not written.
To debug an out of memory issue, see
Troubleshoot Dataflow out of memory errors.
Dataflow is unable to pull the container image. For more
information, see
Image pull request failed with error.

After you identify the error causing the container to fail, try to address the
error, and then resubmit the pipeline.

Image pull request failed with error

During worker startup, one of the following errors appears in the worker or job
logs:

Image pull request failed with error

pull access denied for IMAGE_NAME

manifest for IMAGE_NAME not found: manifest unknown: Failed to fetch

Get IMAGE_NAME: Service Unavailable

These errors occur if a worker is unable to start up because the worker can’t
pull a Docker container image. This issue happens in the following scenarios:

The custom SDK container image URL is incorrect
The worker lacks credential or network access to the remote image

To resolve this issue:

If you are using a custom container image with your job, verify that your
image URL is correct, has a valid tag or digest, and that the image is
accessible by the Dataflow workers.
Verify that public images can be pulled locally by running docker pull $image from an unauthenticated machine.

For private images or private workers:

Dataflow only supports private container images hosted on
Container Registry. If you are using Container Registry to host your container
image, the default Google Cloud service account has access to images in the
same project. If you are using images in a different project than the one
used to run your Google Cloud job, make sure to
configure access control for the
default Google Cloud service account.
If using shared Virtual Private Cloud (VPC), make sure that workers
can access the custom container
repository host.
Use ssh to connect with a running job worker VM and run docker pull $image to directly confirm the worker is configured properly.

If workers fail several times in a row due to this error and work has started on
a job, the job can fail with an error similar to:

Job appears to be stuck.

If you remove access to the image while the job is running, either by removing
the image itself or revoking the Dataflow worker service account
credentials or internet access to access images, Dataflow only
logs errors and doesn’t take actions to fail the job. Dataflow
also avoids failing long-running streaming pipelines to avoid losing pipeline
state.

Other possible errors can arise from repository quota issues or outages. If you
experience issues exceeding the
Docker Hub quota
for pulling public images or general third-party repository outages, consider
using Container Registry as the image repository.

SystemError: unknown opcode

Your Python custom container pipeline might fail with the following error
immediately after job submission:

SystemError: unknown opcode

In addition, the stack trace might include

apache_beam/internal/pickler.py

To resolve this issue, verify that the Python version that you are using locally
matches the version in the container image up to the major and minor version.
The difference in the patch version, such as 3.6.7 versus 3.6.8 does not create
compatibility issues, but the difference in minor version such as 3.6.8 versus
3.8.2 can cause pipeline failures.

Worker errors

The following sections contain common worker errors that you might encounter and
steps for resolving or troubleshooting the errors.

Call from Java worker harness to Python DoFn fails with error

If a call from the Java worker harness to a Python DoFn fails, a relevant
error message is displayed.

To investigate the error, expand the
Cloud Monitoring error log
entry and look at the error message and traceback. It shows you which code
failed so you can correct it if necessary. If you believe that this is a bug in
Apache Beam or Dataflow,
report the bug.

EOFError: marshal data too short

The following error appears in the worker logs:

EOFError: marshal data too short

This error sometimes occurs when Python pipeline workers run out of disk space.

To resolve this issue, see No space left on device.

No space left on device

When a job runs out of disk space, the following error might appear in the
worker logs:

No space left on device

This error can occur for one of the following reasons:

The worker persistent storage runs out of free space, which can occur for
one of the following reasons:
- A job downloads large dependencies at runtime
- A job uses large custom containers
- A job writes a lot of temporary data to local disk
When using
Dataflow Shuffle,
Dataflow sets
lower default disk size.
As a result, this error might occur with jobs moving from worker-based
shuffle.
The worker boot disk fills up because it is logging more than 50 entries per
second.

To resolve this issue, follow these troubleshooting steps:

To see disk resources associated with a single worker, look up VM instance
details for worker VMs associated with your job. Part of the disk space is
consumed by the operating system, binaries, logs, and containers.

To increase persistent disk or boot disk space, adjust the
disk size pipeline option.

Track disk space usage on the worker VM instances by using Cloud Monitoring.
See
Receive worker VM metrics from the Monitoring agent
for instructions explaining how to set this up.

Look for boot disk space issues by
Viewing serial port output
on the worker VM instances and looking for messages like:

Failed to open system journal: No space left on device

If you have a large number of worker VM instances, you can create a script to
run gcloud compute instances get-serial-port-output on all of them at once and
review the output from that instead.

Python pipeline fails after one hour of worker inactivity

When using the Apache Beam SDK for Python with Dataflow Runner
V2 on worker machines with a large number of CPU cores and when the pipeline
performs Python dependency installations at startup, the Dataflow
job might take a long time to start.

This issue occurs when Runner V2 starts one container per CPU core on workers
when running Python pipelines. On workers with many cores, the simultaneous
container startup and initialization might cause resource exhaustion.

To resolve this issue, pre-build your Python container. This step can improve VM
startup times and horizontal autoscaling performance. To use this experimental
feature, enable the Cloud Build API on your project and submit your pipeline
with the following parameter:

‑‑prebuild_sdk_container_engine=cloud_build.

For more information, see
Dataflow Runner V2.

Alternatively,
use a custom container image
with all dependencies preinstalled.

As a workaround, you can run the pipeline with the
‑‑experiments=disable_runner_v2 pipeline option or
upgrade to Beam SDK 2.35.0 or later and run the pipeline with the
‑‑dataflow_service_options=use_sibling_sdk_workers
pipeline option.

Startup of the worker pool in zone failed to bring up any of the desired workers

The following error occurs:

Startup of the worker pool in zone ZONE_NAME failed to bring up any of the desired NUMBER workers.
The project quota may have been exceeded or access control policies may be preventing the operation;
review the Cloud Logging 'VM Instance' log for diagnostics.

This error occurs for one of the following reasons:

You have exceeded one of the Compute Engine quotas that Dataflow
worker creation relies on.
Your organization has
constraints
in place that prohibit some aspect of the VM instance creation process, like
the account being used, or the zone being targeted.

To resolve this issue, follow these troubleshooting steps:

Review the VM Instance log

Go to the Cloud Logging viewer
In the Audited Resource drop-down list, select VM Instance.
In the All logs drop-down list, select
compute.googleapis.com/activity_log.
Scan the log for any entries related to VM instance creation failure.

Check your usage of Compute Engine quotas

On your local machine command line or in Cloud Shell, run the following
command to view Compute Engine resource usage compared to
Dataflow quotas for
the zone you are targeting:

gcloud compute regions describe [REGION]
Review the results for the following resources to see if any are exceeding
quota:
- CPUS
- DISKS_TOTAL_GB
- IN_USE_ADDRESSES
- INSTANCE_GROUPS
- INSTANCES
- REGIONAL_INSTANCE_GROUP_MANAGERS
If needed, request a quota change.

Review your organization policy constraints

Go to the
Organization policies page
Review the constraints for any that might limit VM instance creation for
either the account you are using (this is the
Dataflow service account
by default) or in the zone you are targeting.

Timed out waiting for an update from the worker

When a Dataflow job fails, the following error occurs:

Root cause: Timed out waiting for an update from the worker. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact.

In some cases, this error occurs when the worker runs out of memory or swap
space. In this scenario, to resolve this issue, as a first step, try running the
job again. If the job still fails and the same error occurs, try using a worker
with more memory and disk space. For example, add the following pipeline startup
option:

--worker_machine_type=m1-ultramem-40 --disk_size_gb=500

Note that changing the worker type could impact billed cost. For more information,
see Troubleshoot Dataflow out of memory errors.

This error can also occur when your data contains a hot key. In this scenario,
CPU utilization is high on some workers during most of the job’s duration, but
the number of workers does not reach the maximum allowed. For more information
about hot keys and possible solutions, see
Writing Dataflow pipelines with scalability in mind.

For additional solutions to this issue, see
A hot key … was detected.

If your Python code calls C/C++ code by using the Python extension mechanism,
check whether the extension code releases the Python Global Interpreter Lock (GIL) in computationally-intensive
parts of code that don’t access Python state.
The libraries that facilitate interactions with extensions like Cython, and PyBind
have primitives to control GIL status. You can also manually release the GIL
and re-acquire it before returning control to the Python interpreter by using the
Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS macros.
For more information, see Thread State and the Global Interpreter Lock
in the Python documentation.

Also note that in Python pipelines, in the default configuration, Dataflow
assumes that each Python process running on the workers efficiently uses one
vCPU core. If the pipeline code bypasses the GIL limitations,
such as by using libraries that are implemented in C++, processing
elements might use resources from more than one vCPU core, and the
workers might not get enough CPU resources. To work around this issue,
reduce the number of threads
on the workers.

Streaming Engine errors

The following sections contain common Streaming Engine errors that you might
encounter and steps for resolving or troubleshooting the errors.

## BigQuery connector errors

The following sections contain common BigQuery connector errors that
you might encounter and steps for resolving or troubleshooting the errors.

quotaExceeded

When using the BigQuery connector to write to BigQuery using
streaming inserts, write throughput is lower than expected, and the following
error might occur:

quotaExceeded

Slow throughput might be due to your pipeline exceeding the available
BigQuery streaming insert quota. If this is the case, quota related
error messages from BigQuery appear in the Dataflow
worker logs (look for quotaExceeded errors).

If you see quotaExceeded errors, to resolve this issue:

When using the Apache Beam SDK for Java, set the BigQuery sink
option ignoreInsertIds().
When using the Apache Beam SDK for Python, use the ignore_insert_ids
option.

These settings make you eligible for a one GB/sec per-project BigQuery
streaming insert throughput. For more information on caveats related to
automatic message de-duplication, see the
BigQuery documentation.
To increase the BigQuery streaming insert quota above one GB/s,
submit a request through the Google Cloud console.

If you do not see quota related errors in worker logs, the issue might be that
default bundling or batching related parameters do not provide adequate
parallelism for your pipeline to scale. You can adjust several
Dataflow BigQuery connector related configurations to
achieve the expected performance when writing to BigQuery using
streaming inserts. For example, for Apache Beam SDK for Java, adjust
numStreamingKeys to match the maximum number of workers and consider
increasing insertBundleParallelism to configure BigQuery connector to
write to BigQuery using more parallel threads.

For configurations available in the Apache Beam SDK for Java, see
BigQueryPipelineOptions
{:.external}, and for configurations available in the Apache Beam SDK for
Python, see the
WriteToBigQuery transform.

rateLimitExceeded

When using the BigQuery connector, the following error occurs:

rateLimitExceeded

This error occurs if BigQuery too many
API requests
are sent during a short duration. BigQuery has short term quota limits
that apply when . It’s possible for your Dataflow pipeline to
temporarily exceed such a quota. Whenever this happens,
API requests
from your Dataflow pipeline to BigQuery might fail, which
could result in rateLimitExceeded errors in worker logs.

Dataflow retries such failures, so you can safely ignore these
errors. If you believe that your pipeline is significantly impacted due to
rateLimitExceeded errors, contact Google Cloud Support.

Recommendations

For guidance on recommendations generated by Dataflow Insights,
see Insights.

Источник

1 answer to this question.

Related Questions In Kubernetes

All categories
ChatGPT
(4)
Apache Kafka
(84)
Apache Spark
(596)
Azure
(131)
Big Data Hadoop
(1,907)
Blockchain
(1,673)
C#
(141)
C++
(271)
Career Counselling
(1,060)
Cloud Computing
(3,446)
Cyber Security & Ethical Hacking
(147)
Data Analytics
(1,266)
Database
(855)
Data Science
(75)
DevOps & Agile
(3,575)
Digital Marketing
(111)
Events & Trending Topics
(28)
IoT (Internet of Things)
(387)
Java
(1,247)
Kotlin
(8)
Linux Administration
(389)
Machine Learning
(337)
MicroStrategy
(6)
PMP
(423)
Power BI
(516)
Python
(3,188)
RPA
(650)
SalesForce
(92)
Selenium
(1,569)
Software Testing
(56)
Tableau
(608)
Talend
(73)
TypeSript
(124)
Web Development
(3,002)
Ask us Anything!
(66)
Others
(1,929)
Mobile Development
(263)

Subscribe to our Newsletter, and get personalized recommendations.

Already have an account? Sign in.

Источник

When creating Kubernetes-DashboardPod, it is found that kubernetes-DashboardPod can be successfully created, but when viewing the status of pod, it is not found, and the running of pod is not found from node node. After the screening process, it was found that pod- Infrastructure image download failed, causing POD startup failure.
Pod -infrastructure image download configuration
open /etc/kubernetes/kubelet configuration file.
vim /etc/kubernetes/kubelet
1
###
# kubernetes kubelet (minion) config
# The address for The info server to serve on (set to 0.0.0.0 or “for all interfaces)
KUBELET_ADDRESS=”–address=0.0.0.0″
# The port for the info server to serve on
# KUBELET_PORT=”–port=10250″
# You may leave this blank to use the actual hostname
KUBELET_HOSTNAME=”–hostname-override=10.0.11.150″
# location of the api-server
KUBELET_API_SERVER=”–api-servers=http://10.0.11.150:8080″
# pod proceeds the container
KUBELET_POD_INFRA_CONTAINER = “- pod – infra – container – image = 10.0.11.150:5000/rhel7/pod – infrastructure: v1.0.0”
# Add your own!
KUBELET_ARGS=””

the configuration item information for the kubelet configuration file is shown above. Where the KUBELET_POD_INFRA_CONTAINER configuration item connotes the address at which the pod- Infrastructure image is downloaded. The original configuration address is:
KUBELET_POD_INFRA_CONTAINER = “– pod-infra-container-image=registry.access.redhat.com/rhel7/pod-infrastructure:latest”

1 through test, found that the domestic is unable to download the to pod – proceeds the mirror… .
The solution
first, you can first configure the local Docker and use image acceleration.
Then the image file that can be downloaded is found through Docker search Pod – Infrastructure
docker search pod-infrastructure
The INDEX NAME DESCRIPTION STARS OFFICIAL AUTOMATED
docker. IO Docker. IO/openshift origin – pod The pod proceeds image for openshift 3 5
docker. IO docker. IO/infrastructureascode/aws – cli Containerized AWS CLI on alpine to avoid r… 3 [OK]
docker. IO docker. IO/newrelic/infrastructure Public image for New Relic proceeds.
3 docker. IO Docker. IO/infrastructureascode uwsgi uwsgi application server 2 [OK]
docker. IO docker. IO/manageiq/manageiq – the pods OpenShift based images for manageiq. 2 [OK]
docker.io docker.io/podigg/podigg-lc-hobbit A hobbit dataset The generator wrapper for PoDiGG 1 [OK]
docker. IO docker. IO/tianyebj/pod – infrastructure registry.access.redhat.com/rhel7/pod-infra… 1
docker. IO docker. IO/w564791/pod – infrastructure latest 1

to find the image file, you want The image is then downloaded locally through the docker pull command.
Then mirror pod-Infrastructure into the local private library and modify the KUBELET_POD_INFRA_CONTAINER configuration item in the Kubelet configuration file:
KUBELET_POD_INFRA_CONTAINER = “- pod – infra – container – image = 10.0.11.150:5000/rhel7/pod – infrastructure: v1.0.0”

among them 1, 10.0.11.150 private library, is one of my local docker rhel7/pod – infrastructure is my a mirror image of the name, kept in a private library v1.0.0 is saved version.
Finally, restart the cluster:
Master node restart command:
For SERVICES in kube-apiserver Kube-Controller-Manager Kube-scheduler; Do
systemctl restart $SERVICES
done

Node restart command:
Systemctl restart kubelet
1
note: Master Node and Node Node both make this change.
— — — — — — — — — — — — — — — — — — — — —
the author: _silent YanQin
source: CSDN
,
https://blog.csdn.net/A632189007/article/details/78730903 copyright statement: this article original articles for bloggers, reproduced please attach link to blog!

Источник

My kubectl version

Client Version: version.Info{Major:"1", Minor:"2", GitVersion:"v1.2.4", GitCommit:"3eed1e3be6848b877ff80a93da3785d9034d0a4f", GitTreeState:"clean"}
Server Version: version.Info{Major:"1", Minor:"2", GitVersion:"v1.2.4", GitCommit:"3eed1e3be6848b877ff80a93da3785d9034d0a4f", GitTreeState:"clean"}

I followed Creating Multi-Container Pods

After launching the pod, One container is UP but not other.

kubectl get pods

NAME                    READY     STATUS             RESTARTS   AGE
redis-django            1/2       CrashLoopBackOff   9          22m

Then I did kubectl describe redis-django. At the bottom I saw Error syncing pod, skipping error

  31m   <invalid>       150     {kubelet 172.25.30.21}  spec.containers{frontend}       Warning BackOff         Back-off restarting failed docker container
  25m   <invalid>       121     {kubelet 172.25.30.21}                                  Warning FailedSync      Error syncing pod, skipping: failed to "StartContainer" for "frontend" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=frontend pod=redis-django_default(9f35ffcd-391e-11e6-b160-0022195df673)"

How can I reslove this error? any help!

Thanks!

OS: Ubuntu 14

UPDATE

Previsoly I used below yaml file which was found at Creating Multi-Container Pods

apiVersion: v1
kind: Pod
metadata:
  name: redis-django
  labels:
    app: web
spec:
  containers:
    - name: key-value-store
      image: redis
      ports:
        - containerPort: 6379
    - name: frontend
      image: django
      ports:
        - containerPort: 8000

frontend container was not started. Then I changed the yaml file to two redis containers with different names and ports. But, the result is same(Getting Error syncing pod, skipping)

Later I have changed yaml file to, only one django container. This pod status CrashLoopBackOff and the Error syncing pod, skipping

UPDATE-2

I tail -f /var/log/upstart/kublet.log, which is giving same error. Kubelet is continously trying to start the container, but it is not!

I0623 12:15:13.943046     445 manager.go:2050] Back-off 5m0s restarting failed container=key-value-store pod=redis-django_default(94683d3c-392e-11e6-b160-0022195df673)
E0623 12:15:13.943100     445 pod_workers.go:138] Error syncing pod 94683d3c-392e-11e6-b160-0022195df673, skipping: failed to "StartContainer" for "key-value-store" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=key-value-store pod=redis-django_default(94683d3c-392e-11e6-b160-0022195df673)"

UPDATE-3

[email protected]:~/kubernetes/cluster/ubuntu/binaries# kubectl describe pod redis-django
Name:           redis-django
Namespace:      default
Node:           192.168.1.10/192.168.1.10
Start Time:     Thu, 23 Jun 2016 22:58:03 -0700
Labels:         app=web
Status:         Running
IP:             172.16.20.2
Controllers:    <none>
Containers:
  key-value-store:
    Container ID:       docker://8dbdd6826c354243964f0306427082223d3da49bf2aaf30e15961ea00362fe42
    Image:              redis
    Image ID:           docker://sha256:4465e4bcad80b5b43cef0bace96a8ef0a55c0050be439c1fb0ecd64bc0b8cce4
    Port:               6379/TCP
    QoS Tier:
      cpu:              BestEffort
      memory:           BestEffort
    State:              Running
      Started:          Thu, 23 Jun 2016 22:58:10 -0700
    Ready:              True
    Restart Count:      0
    Environment Variables:
  frontend:
    Container ID:       docker://9c89602739abe7331b3beb3a79e92a7cc42e2a7e40e11618413c8bcfd0afbc16
    Image:              django
    Image ID:           docker://sha256:0cb63b45e2b9a8de5763fc9c98b79c38b6217df718238251a21c8c4176fb3d68
    Port:               8000/TCP
    QoS Tier:
      cpu:              BestEffort
      memory:           BestEffort
    State:              Terminated
      Reason:           Completed
      Exit Code:        0
      Started:          Thu, 23 Jun 2016 22:58:41 -0700
      Finished:         Thu, 23 Jun 2016 22:58:41 -0700
    Last State:         Terminated
      Reason:           Completed
      Exit Code:        0
      Started:          Thu, 23 Jun 2016 22:58:22 -0700
      Finished:         Thu, 23 Jun 2016 22:58:22 -0700
    Ready:              False
    Restart Count:      2
    Environment Variables:
Conditions:
  Type          Status
  Ready         False 
Volumes:
  default-token-0oq7p:
    Type:       Secret (a volume populated by a Secret)
    SecretName: default-token-0oq7p
Events:
  FirstSeen     LastSeen        Count   From                    SubobjectPath                           Type            Reason           Message
  ---------     --------        -----   ----                    -------------                           --------        ------           -------
  49s           49s             1       {default-scheduler }                                            Normal          Scheduled        Successfully assigned redis-django to 192.168.1.10
  48s           48s             1       {kubelet 192.168.1.10}  spec.containers{key-value-store}        Normal          Pulling          pulling image "redis"
  43s           43s             1       {kubelet 192.168.1.10}  spec.containers{key-value-store}        Normal          Pulled           Successfully pulled image "redis"
  43s           43s             1       {kubelet 192.168.1.10}  spec.containers{key-value-store}        Normal          Created          Created container with docker id 8dbdd6826c35
  42s           42s             1       {kubelet 192.168.1.10}  spec.containers{key-value-store}        Normal          Started          Started container with docker id 8dbdd6826c35
  37s           37s             1       {kubelet 192.168.1.10}  spec.containers{frontend}               Normal          Started          Started container with docker id 3872ceae75d4
  37s           37s             1       {kubelet 192.168.1.10}  spec.containers{frontend}               Normal          Created          Created container with docker id 3872ceae75d4
  30s           30s             1       {kubelet 192.168.1.10}  spec.containers{frontend}               Normal          Created          Created container with docker id d97b99b6780c
  30s           30s             1       {kubelet 192.168.1.10}  spec.containers{frontend}               Normal          Started          Started container with docker id d97b99b6780c
  29s           29s             1       {kubelet 192.168.1.10}                                          Warning         FailedSync       Error syncing pod, skipping: failed to "StartContainer" for "frontend" with CrashLoopBackOff: "Back-off 10s restarting failed container=frontend pod=redis-django_default(9d0a966a-39d0-11e6-9027-000c293d51ab)"

  42s   16s     3       {kubelet 192.168.1.10}  spec.containers{frontend}       Normal  Pulling         pulling image "django"
  11s   11s     1       {kubelet 192.168.1.10}  spec.containers{frontend}       Normal  Started         Started container with docker id 9c89602739ab
  38s   11s     3       {kubelet 192.168.1.10}  spec.containers{frontend}       Normal  Pulled          Successfully pulled image "django"
  11s   11s     1       {kubelet 192.168.1.10}  spec.containers{frontend}       Normal  Created         Created container with docker id 9c89602739ab
  29s   10s     2       {kubelet 192.168.1.10}  spec.containers{frontend}       Warning BackOff         Back-off restarting failed docker container
  10s   10s     1       {kubelet 192.168.1.10}                                  Warning FailedSync      Error syncing pod, skipping: failed to "StartContainer" for "frontend" with CrashLoopBackOff: "Back-off 20s restarting failed container=frontend pod=redis-django_default(9d0a966a-39d0-11e6-9027-000c293d51ab)"

For container frontend: Not showing any logs messages

[email protected]:~/kubernetes/cluster/ubuntu/binaries# kubectl logs redis-django -p -c frontend
[email protected]:~/kubernetes/cluster/ubuntu/binaries# kubectl logs redis-django -p -c key-value-store
Error from server: previous terminated container "key-value-store" in pod "redis-django" not found
[email protected]:~/kubernetes/cluster/ubuntu/binaries# docker ps
CONTAINER ID        IMAGE                                COMMAND                  CREATED             STATUS              PORTS               NAMES
8dbdd6826c35        redis                                "docker-entrypoint.sh"   2 minutes ago       Up 2 minutes                            k8s_key-value-store.f572c2d_redis-django_default_9d0a966a-39d0-11e6-9027-000c293d51ab_11101aea
8995bbf9f4f4        gcr.io/google_containers/pause:2.0   "/pause"                 2 minutes ago       Up 2 minutes                            k8s_POD.48e5231f_redis-django_default_9d0a966a-39d0-11e6-9027-000c293d51ab_c00025b0
[email protected]:~/kubernetes/cluster/ubuntu/binaries#

Источник

Comments

Missing log files

Pipeline errors

Some Cloud APIs need to be enabled

Bad request

Cannot read and write in different locations

No such object

DEADLINE_EXCEEDED or Server Unresponsive

Java

Python

Go

Encoding errors, IOExceptions, or unexpected behavior in user code

Error running LookupEffectiveGuestPolicies

Exhausted resource pool

A fatal error has been detected by the Java Runtime Environment

A hot key … was detected

Invalid table specification in Data Catalog

The job graph is too large

Key Commit Too Large

rejecting message over 7168K

Java

Python

Request Entity Too Large

Java

Python

Go

SDK pipeline options or staging file list exceeds size limit

Java

Python

Go

Shuffle key too large

Total number of BoundedSource objects … is larger than the allowable limit

Java

NameError

Processing stuck/Operation ongoing

Pub/Sub quota errors

Pub/Sub unable to determine backlog

Request is prohibited by organization’s policy

Staged package…is inaccessible

A work item failed 4 times

Timeout in Polling Result File

Container image errors

Custom container pipeline fails or is slow to start up workers

Error syncing pod … failed to «StartContainer»

Image pull request failed with error

SystemError: unknown opcode

Worker errors

Call from Java worker harness to Python DoFn fails with error

EOFError: marshal data too short

No space left on device

Python pipeline fails after one hour of worker inactivity

Startup of the worker pool in zone failed to bring up any of the desired workers

Timed out waiting for an update from the worker

Streaming Engine errors

quotaExceeded

rateLimitExceeded

Recommendations

1 answer to this question.

Related Questions In Kubernetes

Subscribe to our Newsletter, and get personalized recommendations.

Read More:

Читайте также: