Comments
k8s-github-robot
pushed a commit
that referenced
this issue
May 11, 2017
Automatic merge from submit-queue (batch tested with PRs 45382, 45384, 44781, 45333, 45543) Ensure desired state of world populator runs before volume reconstructor If the kubelet's volumemanager reconstructor for actual state of world runs before the desired state of world has been populated, the pods in the actual state of world will have some incorrect volume information: namely outerVolumeSpecName, which if incorrect leads to part of the issue here #43515, because WaitForVolumeAttachAndMount searches the actual state of world with the correct outerVolumeSpecName and won't find it so reports 'timeout waiting....', etc. forever for existing pods. The comments acknowledge that this is a known issue The all sources ready check doesn't work because the sources being ready doesn't necessarily mean the desired state of world populator added pods from the sources. So instead let's put the all sources ready check in the *populator*, and when the sources are ready, it will be able to populate the desired state of world and make "HasAddedPods()" return true. THEN, the reconstructor may run. @jingxu97 PTAL, you wrote all of the reconstruction stuff ```release-note NONE ```
This was referenced
Mar 19, 2018
k8s-github-robot
pushed a commit
that referenced
this issue
Mar 28, 2018
Automatic merge from submit-queue (batch tested with PRs 61848, 61188, 56363, 61357, 61838). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. add udev to hyperkube and bump versions **What this PR does / why we need it**: Adds udev to the hyperkube to fix GCE and OpenStack volume mounts. **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Fixes #61356 Fixes #43515 Fixes coreos/bugs#2385 **Special notes for your reviewer**: /cc @kubernetes/sig-node-bugs @kubernetes/sig-node-pr-reviews /cc @ixdy **Release note**: ```release-note NONE ```
If you run into problems with your Dataflow pipeline or job, this
page lists error messages that you might see and provides suggestions for how to
fix each error.
Errors in the log types dataflow.googleapis.com/worker-startup
,
dataflow.googleapis.com/harness-startup
, and dataflow.googleapis.com/kubelet
indicate configuration problems with a job. They can also indicate conditions
that prevent the normal logging path from functioning.
Your pipeline might throw exceptions while processing data. Some of these errors
are transient, for example when temporary difficulty accessing an external
service occurs. Some of these errors are permanent, such as errors caused by
corrupt or unparseable input data, or null pointers during computation.
Dataflow processes elements in arbitrary bundles and retries the
complete bundle when an error is thrown for any element in that bundle. When
running in batch mode, bundles including a failing item are retried four times.
The pipeline fails completely when a single bundle fails four times. When
running in streaming mode, a bundle including a failing item is retried
indefinitely, which might cause your pipeline to permanently stall.
Exceptions in user code, for example, your DoFn
instances, are
reported in the
Dataflow Monitoring Interface.
If you run your pipeline with BlockingDataflowPipelineRunner
, you also see
error messages printed in your console or terminal window.
Consider guarding against errors in your code by adding exception handlers. For
example, if you want to drop elements that fail some custom input validation
done in a ParDo
, use a try/catch block within your ParDo
to handle the
exception and log and drop the element. For production workloads, implement an
unprocessed message pattern. To keep track of the error count, you use
aggregation transforms.
Missing log files
If you don’t see any logs for your jobs, remove any exclusion filters containing
resource.type="dataflow_step"
from all of your Cloud Logging Log Router
sinks.
Go to Log Router
For more details about removing your logs exclusions, refer to the
Removing exclusions guide.
Pipeline errors
The following sections contain common pipeline errors that you might encounter
and steps for resolving or troubleshooting the errors.
Some Cloud APIs need to be enabled
When you try to run a Dataflow job, the following error occurs:
Some Cloud APIs need to be enabled for your project in order for Cloud Dataflow to run this job.
This issue occurs because some required APIs are not enabled in your project.
To resolve this issue and run a Dataflow job, enable the following
Google Cloud APIs in your project:
- Compute Engine API (Compute Engine)
- Cloud Logging API
- Cloud Storage
- Cloud Storage JSON API
- BigQuery API
- Pub/Sub
- Datastore API
For detailed instructions, see the
Getting Started section on enabling Google Cloud APIs
.
Bad request
When you run a Dataflow job,
Cloud Monitoring logs display
a series of warnings similar to the following:
Unable to update setup work item STEP_ID error: generic::invalid_argument: Http(400) Bad Request
Update range task returned 'invalid argument'. Assuming lost lease for work with id LEASE_ID
with expiration time: TIMESTAMP, now: TIMESTAMP. Full status: generic::invalid_argument: Http(400) Bad Request
Bad request warnings occur if worker state information is stale or out of sync
due to processing delays. Often, your Dataflow job succeeds
despite the bad request warnings. If that is the case, ignore the warnings.
Cannot read and write in different locations
When you run a Dataflow job, you might see the following error in
the log files:
message:Cannot read and write in different locations: source: SOURCE_REGION, destination: DESTINATION_REGION,reason:invalid
This error occurs when the source and destination are in different regions. It
can also occur when the staging location and destination are in different
regions. For example, if the job reads from Pub/Sub and then writes to a
Cloud Storage temp
bucket before writing to a BigQuery table, the
Cloud Storage temp
bucket and the BigQuery table must be in the
same region.
Multi-region locations are considered different than single-region locations,
even if the single region falls within the scope of the multi-region location.
For example, us (multiple regions in the United States)
and us-central1
are
different regions.
To resolve this issue, have your destination, source, and staging locations in
the same region. Cloud Storage bucket locations can’t be changed, so you
might need to create a new Cloud Storage bucket in the correct region.
No such object
When you run your Dataflow jobs, you might see the following error in
the log files:
..., 'server': 'UploadServer', 'status': '404'}>, <content <No such object:...
These errors typically occur when some of your running Dataflow jobs
use the same temp_location
to stage temporary job files created when the
pipeline runs. When multiple concurrent jobs share the same temp_location
,
these jobs might step on each other’s temporary data and a race condition might
occur. To avoid this, it is recommended that you use a unique temp_location
for each job.
DEADLINE_EXCEEDED or Server Unresponsive
When you run your jobs, you might encounter RPC timeout exceptions or one of the
following errors:
DEADLINE_EXCEEDED
Or:
Server Unresponsive
These errors typically occur for one of the following reasons:
-
The VPC network used for your job might be missing a
firewall rule. The firewall rule needs to enable
all TCP traffic among VMs in the VPC network you
specified in your pipeline
options. See
Specifying your network and subnetwork
for more details. -
Your job is shuffle-bound.
To resolve this issue, make one or more of the following changes.
Java
- If the job is not using the service-based shuffle, switch to using the
service-based Dataflow Shuffle by setting
--experiments=shuffle_mode=service
.
For details and availability, read
Dataflow Shuffle. - Add more workers. Try setting
--numWorkers
with a
higher value when you run your pipeline. - Increase the size of the attached disk for workers. Try setting
--diskSizeGb
with a higher value when you run your pipeline. - Use an SSD-backed persistent disk. Try setting
--workerDiskType="compute.googleapis.com/projects/PROJECT_ID/zones/ZONE/diskTypes/pd-ssd"
when you run your pipeline.
Python
- If the job is not using the service-based shuffle, switch to using the
service-based Dataflow Shuffle by setting
--experiments=shuffle_mode=service
.
For details and availability, read
Dataflow Shuffle. - Add more workers. Try setting
--num_workers
with a
higher value when you run your pipeline. - Increase the size of the attached disk for workers. Try setting
--disk_size_gb
with a higher value when you run your pipeline. - Use an SSD-backed persistent disk. Try setting
--worker_disk_type="compute.googleapis.com/projects/PROJECT_ID/zones/ZONE/diskTypes/pd-ssd"
when you run your pipeline.
Go
- If the job is not using the service-based shuffle, switch to using the
service-based Dataflow Shuffle by setting
--experiments=shuffle_mode=service
.
For details and availability, read
Dataflow Shuffle. - Add more workers. Try setting
--num_workers
with a
higher value when you run your pipeline. - Increase the size of the attached disk for workers. Try setting
--disk_size_gb
with a higher value when you run your pipeline. - Use an SSD-backed persistent disk. Try setting
--disk_type="compute.googleapis.com/projects/PROJECT_ID/zones/ZONE/diskTypes/pd-ssd"
when you run your pipeline.
Encoding errors, IOExceptions, or unexpected behavior in user code
The Apache Beam SDKs and the Dataflow workers depend on common
third-party components. These components import additional dependencies. Version
collisions can result in unexpected behavior in the service. If you are using
any of these packages in your code, be aware that some libraries are not
forward-compatible. You might need to pin to the listed versions that are in
scope during execution.
SDK and Worker Dependencies
contains a list of dependencies and their required versions.
Error running LookupEffectiveGuestPolicies
When you run a Dataflow job, you might see the following error in
the log files:
OSConfigAgent Error policies.go:49: Error running LookupEffectiveGuestPolicies:
error calling LookupEffectiveGuestPolicies: code: "Unauthenticated",
message: "Request is missing required authentication credential.
Expected OAuth 2 access token, login cookie or other valid authentication credential.
This error occurs if
OS configuration management is
enabled for the entire project.
To resolve this issue, disable VM Manager
policies that apply to the entire project. If disabling VM Manager
policies for the entire project isn’t possible, you can safely ignore this error
and filter it out of log monitoring tools.
Exhausted resource pool
When you create a Google Cloud resource, you might see the following error for
an exhausted resource pool:
ERROR: ZONE_RESOURCE_POOL_EXHAUSTED
This error occurs for temporary stock-out conditions for a specific resource in
a specific zone.
To resolve the issue, you can either wait for a period of time or create the
same resource in another zone. As a best practice, we recommend that you
distribute your resources across
multiple zones and regions
to tolerate outages.
A fatal error has been detected by the Java Runtime Environment
The following error occurs during worker startup:
A fatal error has been detected by the Java Runtime Environment
This error occurs if the pipeline is using Java Native Interface (JNI) to run
non-Java code and that code or the JNI bindings contain an error.
A hot key … was detected
The following error occurs:
A hot key HOT_KEY_NAME was detected in...
These errors occur if your data contains a hot key. A hot key is a key with
enough elements to negatively impact pipeline performance. These keys limit
Dataflow’s ability to process elements in parallel, which
increases execution time.
To print the human-readable key to the logs when a hot key is detected in the
pipeline, use the
hot key pipeline option.
To resolve this issue, check that your data is evenly distributed. If a key has
disproportionately many values, consider the following courses of action:
- Rekey your data. Apply a
ParDo
transform to output new key-value pairs. - For Java jobs, use the
Combine.PerKey.withHotKeyFanout
transform. - For Python jobs, use the
CombinePerKey.with_hot_key_fanout
transform. - Enable Dataflow Shuffle.
To view hot keys in the Dataflow monitoring UI, see
Troubleshoot stragglers in batch jobs.
Invalid table specification in Data Catalog
When you use Dataflow SQL to create Dataflow SQL jobs, your job might fail
with the following error in the log files:
Invalid table specification in Data Catalog: Could not resolve table in Data Catalog
This error occurs if the Dataflow service account doesn’t have
access to the Data Catalog API.
To resolve this issue,
enable the Data Catalog API
in the Google Cloud
project
that you’re using to write and run queries.
Alternately, assign the roles/datacatalog.viewer
role to the
Dataflow service account.
The job graph is too large
Your job might fail with the following error:
The job graph is too large. Please try again with a smaller job graph,
or split your job into two or more smaller jobs.
This error occurs if your job’s graph size exceeds 10 MB. Certain
conditions in your pipeline can cause the job graph to exceed the limit. Common
conditions include:
- A
Create
transform that includes a large amount of in-memory data. - A large
DoFn
instance that is serialized for transmission to remote
workers. - A
DoFn
as an anonymous inner class instance that (possibly inadvertently)
pulls in a large amount of data to be serialized. - A directed acyclic graph (DAG) is being used as part of a programmatic loop
that is enumerating a large list.
To avoid these conditions, consider restructuring your pipeline.
Key Commit Too Large
When running a streaming job, the following error appears in the worker log
files:
KeyCommitTooLargeException
This error occurs in streaming scenarios if a very large amount of data is
grouped without using a Combine
transform, or if a large amount of data is
produced from a single input element.
To reduce the possibility of encountering this error, use the following
strategies:
- Ensure that processing a single element cannot result in outputs or state
modifications exceeding the limit. - If multiple elements were grouped by a key, consider increasing the key
space to reduce the elements grouped per key. - If elements for a key are emitted at a high frequency over a short period of
time, that might result in many GB of events for that key in windows.
Rewrite the pipeline to detect keys like this and only emit an output
indicating the key was frequently present in that window. - Make sure to use sub-linear space
Combine
transforms for commutative and
associate operations. Don’t use a combiner if it doesn’t reduce space. For
example, combiner for strings that just appends strings together is worse
than not using combiner.
rejecting message over 7168K
When you run a Dataflow job created from a template, the job
might fail with the following error:
Error: CommitWork failed: status: APPLICATION_ERROR(3): Pubsub publish requests are limited to 10MB, rejecting message over 7168K (size MESSAGE_SIZE) to avoid exceeding limit with byte64 request encoding.
This error occurs when messages written to a dead-letter queue exceed the size
limit of 7168K.
As a workaround, enable
Streaming Engine,
which has a higher size limit.
To enable Streaming Engine, use the following
pipeline option.
Java
--enableStreamingEngine=true
Python
--enable_streaming_engine=true
Request Entity Too Large
When you submit your job, one of the following errors appears in your console or
terminal window:
413 Request Entity Too Large
The size of serialized JSON representation of the pipeline exceeds the allowable limit
Failed to create a workflow job: Invalid JSON payload received
Failed to create a workflow job: Request payload exceeds the allowable limit
When you encounter an error about the JSON payload when submitting your job,
your pipeline’s JSON representation exceeds the maximum 20 MB request size.
The size of your job is specifically tied to the JSON representation of the
pipeline. A larger pipeline means a larger request. Dataflow
currently has a limitation that caps requests at 20 MB.
To estimate the size of your pipeline’s JSON request, run your pipeline with the
following option:
Java
--dataflowJobFile=PATH_TO_OUTPUT_FILE
Python
--dataflow_job_file=PATH_TO_OUTPUT_FILE
Go
Outputting your job as JSON is not supported in Go.
This command writes a JSON representation of your job to a file. The size of the
serialized file is a good estimate of the size of the request. The actual size
is slightly larger due to some additional information included the request.
Certain conditions in your pipeline can cause the JSON representation to exceed
the limit. Common conditions include:
- A
Create
transform that includes a large amount of in-memory data. - A large
DoFn
instance that is serialized for transmission to remote
workers. - A
DoFn
as an anonymous inner class instance that (possibly inadvertently)
pulls in a large amount of data to be serialized.
To avoid these conditions, consider restructuring your pipeline.
SDK pipeline options or staging file list exceeds size limit
When running a pipeline, the following error occurs:
SDK pipeline options or staging file list exceeds size limit.
Please keep their length under 256K Bytes each and 512K Bytes in total.
This error occurs if the pipeline could not be started due to Google
Compute Engine metadata limits being exceeded. These limits cannot be changed.
Dataflow uses Compute Engine metadata for pipeline options. The
limit is documented in the Compute Engine custom metadata
limitations.
Having too many JAR files to stage can cause the JSON representation to exceed
the limit.
To estimate the size of your pipeline’s JSON request, run your pipeline with the
following option:
Java
--dataflowJobFile=PATH_TO_OUTPUT_FILE
Python
--dataflow_job_file=PATH_TO_OUTPUT_FILE
Go
Outputting your job as JSON is not supported in Go.
The size of the output file from this command has to be less than 256 KB.
512 KB in the error message is referring to the total size of the above
output file and the custom metadata options for the compute engine VM instance.
You can get a rough estimate of the custom metadata option for VM instance from
running Dataflow jobs in the project. Choose any running
Dataflow job, take a VM instance and then navigate to the
Compute Engine VM instance details page for that VM to check for the custom
metadata section. The total length of the custom metadata and the file should be
less than 512 KB. An accurate estimate for the failed job is not possible,
because the VMs are not spun up for failed jobs.
If your JAR list is hitting the 256 KB limit, review it and reduce any
unnecessary JAR files. If it is still too large, try executing the
Dataflow job using an uber JAR. The cloud function section of the
documentation has an
example
showing how to create and use uber JAR.
Shuffle key too large
The following error appears in the worker log files:
Shuffle key too large
This error occurs if the serialized key emitted to a particular (Co-)GroupByKey
is too large after the corresponding coder is applied. Dataflow
has a limit for serialized shuffle keys.
To resolve this issue, reduce the size of the keys or use more space-efficient
coders.
Total number of BoundedSource objects … is larger than the allowable limit
One of the following errors might occur when running jobs with Java:
Total number of BoundedSource objects generated by splitIntoBundles() operation is larger than the allowable limit
Or:
Total size of the BoundedSource objects generated by splitIntoBundles() operation is larger than the allowable limit
Java
This error can occur if you’re reading from a very large number of files via
TextIO
, AvroIO
, or some other file-based source. The particular limit
depends on the details of your source, but it is on the order of tens of
thousands of files in one pipeline. For example, embedding schema in
AvroIO.Read
allows fewer files.
This error can also occur if you created a custom data source for your
pipeline and your source’s splitIntoBundles
method returned a list of
BoundedSource
objects which takes up more than 20 MB when serialized.
The allowable limit for the total size of the BoundedSource
objects
generated by your custom source’s splitIntoBundles()
operation is
20 MB.
To work around this limitation, modify your custom BoundedSource
subclass
so that the total size of the generated BoundedSource
objects is smaller
than the 20 MB limit. For example, your source might generate fewer
splits initially, and rely on
Dynamic Work Rebalancing
to further split inputs on demand.
NameError
When you execute your pipeline using the Dataflow service, the
following error occurs:
NameError
This error does not occur when you execute locally, such as when you execute
using the DirectRunner
.
This error occurs if your DoFn
s are using values in the global namespace that
are not available on the Dataflow worker.
By default, global imports, functions, and variables defined in the main session
are not saved during the serialization of a Dataflow job.
To resolve this issue, use one of the following methods. If your DoFn
s are
defined in the main file and reference imports and functions in the global
namespace, set the --save_main_session
pipeline option to True
. This change
pickles the state of the global namespace to and loads it on the
Dataflow worker.
If you have objects in your global namespace that can’t be pickled, a pickling
error occurs. If the error is regarding a module that should be available in the
Python distribution, import the module locally, where it is used.
For example, instead of:
import re … def myfunc(): # use re module
use:
def myfunc(): import re # use re module
Alternatively, if your DoFn
s span multiple files, use
a different approach to packaging your workflow and
managing dependencies.
Processing stuck/Operation ongoing
If Dataflow spends more time executing a DoFn
than the time
specified in TIME_INTERVAL without returning, the following message is displayed.
Processing stuck/Operation ongoing in step STEP_ID for at least TIME_INTERVAL without outputting or completing in state finish at STACK_TRACE
This behavior has two possible causes:
- Your
DoFn
code is simply slow, or waiting for some slow external operation
to complete. - Your
DoFn
code might be stuck, deadlocked, or abnormally slow to finish
processing.
To determine which of these is the case, expand the
Cloud Monitoring log entry to
see a stack trace. Look for messages that indicate that the DoFn
code is stuck
or otherwise encountering issues. If no messages are present, the issue might be
the execution speed of the DoFn
code. Consider using
Cloud Profiler or other tool to
investigate the code’s performance.
You can further investigate the cause of your stuck code if your pipeline is
built on the Java VM (using either Java or Scala); take a full thread dump of
the whole JVM (not just the stuck thread) by following these steps:
- Make note of the worker name from the log entry.
- In the Compute Engine section of the Google Cloud console, find
the Compute Engine instance with the worker name you noted. - SSH into the instance with that name.
-
Run the following command:
curl http://localhost:8081/threadz
Pub/Sub quota errors
When running a streaming pipeline from Pub/Sub, the following
errors occur:
429 (rateLimitExceeded)
Or:
Request was throttled due to user QPS limit being reached
These errors occur if your project has insufficient
Pub/Sub quota.
To find out if your project has insufficient quota, follow these steps to check
for client errors:
- Go to the Google Cloud console.
- In the menu on the left, select APIs & services.
- In the Search Box, search for Cloud Pub/Sub.
- Click the Usage tab.
- Check Response Codes and look for
(4xx)
client error codes.
Pub/Sub unable to determine backlog
When running a streaming pipeline from Pub/Sub, the following
error occurs:
Dataflow is unable to determine the backlog for Pub/Sub subscription
When a Dataflow pipeline pulls data from Pub/Sub,
Dataflow needs to repeatedly request information from
Pub/Sub about the amount of backlog on the subscription and the
age of the oldest unacknowledged message. This information is necessary for
autoscaling and to advance the watermark of the pipeline. Occasionally,
Dataflow is unable to retrieve this information from
Pub/Sub because of internal system issues, in which case the
pipeline might not autoscale properly, and the watermark might not advance. If
these problems persist, contact customer support.
For more information, see
Streaming With Cloud Pub/Sub.
Request is prohibited by organization’s policy
When running a pipeline, the following error occurs:
Error trying to get gs://BUCKET_NAME/FOLDER/FILE:
{"code":403,"errors":[{"domain":"global","message":"Request is prohibited by organization's policy","reason":"forbidden"}],
"message":"Request is prohibited by organization's policy"}
This error occurs if the Cloud Storage bucket is outside of your
service perimeter.
To resolve this issue, create an
egress rule that allows
access to the bucket outside of the service perimeter.
Staged package…is inaccessible
Jobs that used to succeed might fail with the following error:
Staged package...is inaccessible
To resolve this issue:
- Verify that the Cloud Storage bucket used for staging does not have
TTL settings that cause staged packages to be
deleted. -
Verify that your Dataflow project’s controller service account
has the permission to access the Cloud Storage bucket used for
staging. Gaps in permission can be due to any of the following reasons:- The Cloud Storage bucket used for staging is present in a
different project. - The Cloud Storage bucket used for staging was migrated from
fine-grained access to
uniform bucket-level access.
Due to the inconsistency between IAM and ACL policies,
migrating the staging bucket to uniform bucket-level access disallows
ACLs for Cloud Storage resources, which includes the
permissions held by your Dataflow project’s controller
service account over the staging bucket.
- The Cloud Storage bucket used for staging is present in a
For more information, see
Accessing Cloud Storage buckets across Google Cloud projects.
A work item failed 4 times
The following error occurs when a job fails:
a work item failed 4 times
This error occurs if a single operation causes the worker code to fail four
times. Dataflow fails the job, and this message is displayed.
You can’t configure this failure threshold. For more details, refer to
pipeline error and exception handling.
To resolve this issue, look in the job’s
Cloud Monitoring logs for the
four individual failures. Look for Error-level or Fatal-level log
entries in the worker logs that show exceptions or errors. The exception or
error should appear at least four times. If the logs only contain generic
timeout errors related to accessing external resources, such as MongoDB, verify
that the worker service account has permission to access the resource’s
subnetwork.
Timeout in Polling Result File
The following occurs when a job fails:
Timeout in polling result file: PATH. Possible causes are:
1. Your launch takes too long time to finish. Please check the logs on stackdriver.
2. Service account SERVICE_ACCOUNT may not have enough permissions to pull
container image IMAGE_PATH or create new objects in PATH.
3. Transient errors occurred, please try again.
The issue is often related to how the Python dependencies are being installed
via the requirements.txt
file. The Apache Beam stager downloads the source of
all dependencies from PyPi, including the sources of transitive dependencies.
Then, the wheel
compilation happens implicitly during the pip
download
command for some of the Python packages that are dependencies of apache-beam
.
This can result in a timeout issue from the requirements.txt
file.
This is a known issue by the
Apache Arrow team. The
suggested workaround
is to install apache-beam
directly in the Dockerfile. This way, the timeout
for the requirements.txt
file is not applied.
Container image errors
The following sections contain common errors that you might encounter when using
custom containers and steps for resolving or troubleshooting the errors. The
errors are typically prefixed with the following message:
Unable to pull container image due to error: DETAILED_ERROR_MESSAGE
Custom container pipeline fails or is slow to start up workers
A known issue might cause Dataflow to mark workers as unresponsive
and restart the worker if they are using a large custom container (~10 GB) that
takes a long time to download. This issue can cause pipeline startup to be slow
or in some extreme cases prevent workers from starting up at all.
To confirm that this known issue is causing the problem, in
dataflow.googleapis.com/kubelet
, look for worker logs that show several failed
attempts to pull a container and confirm that kubelet did not start on the
worker. For example, the logs might contain Pulling image <URL>
without a
corresponding Successfully pulled image <URL>
or Started Start kubelet
after
all images have been pulled.
To work around this issue, run the pipeline with the
--experiments=disable_worker_container_image_prepull
pipeline option.
Error syncing pod … failed to «StartContainer»
The following error occurs during worker startup:
Error syncing pod POD_ID, skipping: [failed to "StartContainer" for CONTAINER_NAME with CrashLoopBackOff: "back-off 5m0s restarting failed container=CONTAINER_NAME pod=POD_NAME].
A pod is a co-located group of Docker containers running on a
Dataflow worker. This error occurs when one of the Docker
containers in the pod fails to start. If the failure is not recoverable, the
Dataflow worker isn’t able to start, and Dataflow
batch jobs eventually fail with errors like the following:
The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h.
This error typically occurs when one of the containers is continuously crashing
during startup. To understand the root cause, look for the logs captured
immediately prior to the failure. To analyze the logs, use the
Logs Explorer.
In the Logs Explorer, limit the log files to log entries emitted from the
worker with container startup errors. To limit the log entries, complete the
following steps:
- In the Logs Explorer, find the
Error syncing pod
log entry. - To see the labels associated with the log entry, expand the log entry.
- Click the label associated with the
resource_name
, and then click Show
matching entries.
In the Logs Explorer, the Dataflow logs are organized into
several log streams. The Error syncing pod
message is emitted in the log named
kubelet
. However, the logs from the failing container could be in a different
log stream. Each container has a name. Use the following table to determine
which log stream might contain logs relevant to the failing container.
Container name | Log names |
---|---|
sdk, sdk0, sdk1, sdk-0-0, and so on | docker |
harness | harness, harness-startup |
python, java-batch, java-streaming | worker-startup, worker |
artifact | artifact |
When you query the Logs Explorer, make sure that the query either includes
the relevant log names
in the query builder UI
or does not have restrictions on the log name.
After you select the relevant logs, the query result might look like the
following example:
resource.type="dataflow_step"
resource.labels.job_id="2022-06-29_08_02_54-JOB_ID"
labels."compute.googleapis.com/resource_name"="testpipeline-jenkins-0629-DATE-cyhg-harness-8crw"
logName=("projects/apache-beam-testing/logs/dataflow.googleapis.com%2Fdocker"
OR
"projects/apache-beam-testing/logs/dataflow.googleapis.com%2Fworker-startup"
OR
"projects/apache-beam-testing/logs/dataflow.googleapis.com%2Fworker")
Because the logs reporting the symptom of the container failure are sometimes
reported as INFO
, include INFO
logs in your analysis.
Typical causes of container failures include the following:
- Your Python pipeline has additional dependencies that are installed at
runtime, and the installation is unsuccessful. You might see errors like
pip install failed with error
. This issue might occur due to conflicting
requirements, or due to a restricted networking configuration that prevents
a Dataflow worker from pulling an external dependency from a
public repository over the internet. -
A worker fails in the middle of the pipeline run due to an out of memory
error. You might see an error like one of the following:java.lang.OutOfMemoryError: Java heap space
Shutting down JVM after 8 consecutive periods of measured GC thrashing.
Memory is used/total/max = 24453/42043/42043 MB, GC last/max =
58.97/99.89 %, #pushbacks=82, gc thrashing=true. Heap dump not written.
To debug an out of memory issue, see
Troubleshoot Dataflow out of memory errors. -
Dataflow is unable to pull the container image. For more
information, see
Image pull request failed with error.
After you identify the error causing the container to fail, try to address the
error, and then resubmit the pipeline.
Image pull request failed with error
During worker startup, one of the following errors appears in the worker or job
logs:
Image pull request failed with error
pull access denied for IMAGE_NAME
manifest for IMAGE_NAME not found: manifest unknown: Failed to fetch
Get IMAGE_NAME: Service Unavailable
These errors occur if a worker is unable to start up because the worker can’t
pull a Docker container image. This issue happens in the following scenarios:
- The custom SDK container image URL is incorrect
- The worker lacks credential or network access to the remote image
To resolve this issue:
- If you are using a custom container image with your job, verify that your
image URL is correct, has a valid tag or digest, and that the image is
accessible by the Dataflow workers. - Verify that public images can be pulled locally by running
docker pull
from an unauthenticated machine.
$image
For private images or private workers:
- Dataflow only supports private container images hosted on
Container Registry. If you are using Container Registry to host your container
image, the default Google Cloud service account has access to images in the
same project. If you are using images in a different project than the one
used to run your Google Cloud job, make sure to
configure access control for the
default Google Cloud service account. - If using shared Virtual Private Cloud (VPC), make sure that workers
can access the custom container
repository host. - Use
ssh
to connect with a running job worker VM and rundocker pull
to directly confirm the worker is configured properly.
$image
If workers fail several times in a row due to this error and work has started on
a job, the job can fail with an error similar to:
Job appears to be stuck.
If you remove access to the image while the job is running, either by removing
the image itself or revoking the Dataflow worker service account
credentials or internet access to access images, Dataflow only
logs errors and doesn’t take actions to fail the job. Dataflow
also avoids failing long-running streaming pipelines to avoid losing pipeline
state.
Other possible errors can arise from repository quota issues or outages. If you
experience issues exceeding the
Docker Hub quota
for pulling public images or general third-party repository outages, consider
using Container Registry as the image repository.
SystemError: unknown opcode
Your Python custom container pipeline might fail with the following error
immediately after job submission:
SystemError: unknown opcode
In addition, the stack trace might include
apache_beam/internal/pickler.py
To resolve this issue, verify that the Python version that you are using locally
matches the version in the container image up to the major and minor version.
The difference in the patch version, such as 3.6.7 versus 3.6.8 does not create
compatibility issues, but the difference in minor version such as 3.6.8 versus
3.8.2 can cause pipeline failures.
Worker errors
The following sections contain common worker errors that you might encounter and
steps for resolving or troubleshooting the errors.
Call from Java worker harness to Python DoFn fails with error
If a call from the Java worker harness to a Python DoFn
fails, a relevant
error message is displayed.
To investigate the error, expand the
Cloud Monitoring error log
entry and look at the error message and traceback. It shows you which code
failed so you can correct it if necessary. If you believe that this is a bug in
Apache Beam or Dataflow,
report the bug.
EOFError: marshal data too short
The following error appears in the worker logs:
EOFError: marshal data too short
This error sometimes occurs when Python pipeline workers run out of disk space.
To resolve this issue, see No space left on device.
No space left on device
When a job runs out of disk space, the following error might appear in the
worker logs:
No space left on device
This error can occur for one of the following reasons:
- The worker persistent storage runs out of free space, which can occur for
one of the following reasons:- A job downloads large dependencies at runtime
- A job uses large custom containers
- A job writes a lot of temporary data to local disk
- When using
Dataflow Shuffle,
Dataflow sets
lower default disk size.
As a result, this error might occur with jobs moving from worker-based
shuffle. - The worker boot disk fills up because it is logging more than 50 entries per
second.
To resolve this issue, follow these troubleshooting steps:
To see disk resources associated with a single worker, look up VM instance
details for worker VMs associated with your job. Part of the disk space is
consumed by the operating system, binaries, logs, and containers.
To increase persistent disk or boot disk space, adjust the
disk size pipeline option.
Track disk space usage on the worker VM instances by using Cloud Monitoring.
See
Receive worker VM metrics from the Monitoring agent
for instructions explaining how to set this up.
Look for boot disk space issues by
Viewing serial port output
on the worker VM instances and looking for messages like:
Failed to open system journal: No space left on device
If you have a large number of worker VM instances, you can create a script to
run gcloud compute instances get-serial-port-output
on all of them at once and
review the output from that instead.
Python pipeline fails after one hour of worker inactivity
When using the Apache Beam SDK for Python with Dataflow Runner
V2 on worker machines with a large number of CPU cores and when the pipeline
performs Python dependency installations at startup, the Dataflow
job might take a long time to start.
This issue occurs when Runner V2 starts one container per CPU core on workers
when running Python pipelines. On workers with many cores, the simultaneous
container startup and initialization might cause resource exhaustion.
To resolve this issue, pre-build your Python container. This step can improve VM
startup times and horizontal autoscaling performance. To use this experimental
feature, enable the Cloud Build API on your project and submit your pipeline
with the following parameter:
‑‑prebuild_sdk_container_engine=cloud_build
.
For more information, see
Dataflow Runner V2.
Alternatively,
use a custom container image
with all dependencies preinstalled.
As a workaround, you can run the pipeline with the
‑‑experiments=disable_runner_v2
pipeline option or
upgrade to Beam SDK 2.35.0 or later and run the pipeline with the
‑‑dataflow_service_options=use_sibling_sdk_workers
pipeline option.
Startup of the worker pool in zone failed to bring up any of the desired workers
The following error occurs:
Startup of the worker pool in zone ZONE_NAME failed to bring up any of the desired NUMBER workers.
The project quota may have been exceeded or access control policies may be preventing the operation;
review the Cloud Logging 'VM Instance' log for diagnostics.
This error occurs for one of the following reasons:
- You have exceeded one of the Compute Engine quotas that Dataflow
worker creation relies on. - Your organization has
constraints
in place that prohibit some aspect of the VM instance creation process, like
the account being used, or the zone being targeted.
To resolve this issue, follow these troubleshooting steps:
Review the VM Instance log
- Go to the Cloud Logging viewer
- In the Audited Resource drop-down list, select VM Instance.
- In the All logs drop-down list, select
compute.googleapis.com/activity_log. - Scan the log for any entries related to VM instance creation failure.
Check your usage of Compute Engine quotas
-
On your local machine command line or in Cloud Shell, run the following
command to view Compute Engine resource usage compared to
Dataflow quotas for
the zone you are targeting:gcloud compute regions describe [REGION]
-
Review the results for the following resources to see if any are exceeding
quota:- CPUS
- DISKS_TOTAL_GB
- IN_USE_ADDRESSES
- INSTANCE_GROUPS
- INSTANCES
- REGIONAL_INSTANCE_GROUP_MANAGERS
-
If needed, request a quota change.
Review your organization policy constraints
- Go to the
Organization policies page - Review the constraints for any that might limit VM instance creation for
either the account you are using (this is the
Dataflow service account
by default) or in the zone you are targeting.
Timed out waiting for an update from the worker
When a Dataflow job fails, the following error occurs:
Root cause: Timed out waiting for an update from the worker. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact.
In some cases, this error occurs when the worker runs out of memory or swap
space. In this scenario, to resolve this issue, as a first step, try running the
job again. If the job still fails and the same error occurs, try using a worker
with more memory and disk space. For example, add the following pipeline startup
option:
--worker_machine_type=m1-ultramem-40 --disk_size_gb=500
Note that changing the worker type could impact billed cost. For more information,
see Troubleshoot Dataflow out of memory errors.
This error can also occur when your data contains a hot key. In this scenario,
CPU utilization is high on some workers during most of the job’s duration, but
the number of workers does not reach the maximum allowed. For more information
about hot keys and possible solutions, see
Writing Dataflow pipelines with scalability in mind.
For additional solutions to this issue, see
A hot key … was detected.
If your Python code calls C/C++ code by using the Python extension mechanism,
check whether the extension code releases the Python Global Interpreter Lock (GIL) in computationally-intensive
parts of code that don’t access Python state.
The libraries that facilitate interactions with extensions like Cython, and PyBind
have primitives to control GIL status. You can also manually release the GIL
and re-acquire it before returning control to the Python interpreter by using the
Py_BEGIN_ALLOW_THREADS
and Py_END_ALLOW_THREADS
macros.
For more information, see Thread State and the Global Interpreter Lock
in the Python documentation.
Also note that in Python pipelines, in the default configuration, Dataflow
assumes that each Python process running on the workers efficiently uses one
vCPU core. If the pipeline code bypasses the GIL limitations,
such as by using libraries that are implemented in C++, processing
elements might use resources from more than one vCPU core, and the
workers might not get enough CPU resources. To work around this issue,
reduce the number of threads
on the workers.
Streaming Engine errors
The following sections contain common Streaming Engine errors that you might
encounter and steps for resolving or troubleshooting the errors.
## BigQuery connector errors
The following sections contain common BigQuery connector errors that
you might encounter and steps for resolving or troubleshooting the errors.
quotaExceeded
When using the BigQuery connector to write to BigQuery using
streaming inserts, write throughput is lower than expected, and the following
error might occur:
quotaExceeded
Slow throughput might be due to your pipeline exceeding the available
BigQuery streaming insert quota. If this is the case, quota related
error messages from BigQuery appear in the Dataflow
worker logs (look for quotaExceeded
errors).
If you see quotaExceeded
errors, to resolve this issue:
- When using the Apache Beam SDK for Java, set the BigQuery sink
optionignoreInsertIds()
. - When using the Apache Beam SDK for Python, use the
ignore_insert_ids
option.
These settings make you eligible for a one GB/sec per-project BigQuery
streaming insert throughput. For more information on caveats related to
automatic message de-duplication, see the
BigQuery documentation.
To increase the BigQuery streaming insert quota above one GB/s,
submit a request through the Google Cloud console.
If you do not see quota related errors in worker logs, the issue might be that
default bundling or batching related parameters do not provide adequate
parallelism for your pipeline to scale. You can adjust several
Dataflow BigQuery connector related configurations to
achieve the expected performance when writing to BigQuery using
streaming inserts. For example, for Apache Beam SDK for Java, adjust
numStreamingKeys
to match the maximum number of workers and consider
increasing insertBundleParallelism
to configure BigQuery connector to
write to BigQuery using more parallel threads.
For configurations available in the Apache Beam SDK for Java, see
BigQueryPipelineOptions
{:.external}, and for configurations available in the Apache Beam SDK for
Python, see the
WriteToBigQuery transform.
rateLimitExceeded
When using the BigQuery connector, the following error occurs:
rateLimitExceeded
This error occurs if BigQuery too many
API requests
are sent during a short duration. BigQuery has short term quota limits
that apply when . It’s possible for your Dataflow pipeline to
temporarily exceed such a quota. Whenever this happens,
API requests
from your Dataflow pipeline to BigQuery might fail, which
could result in rateLimitExceeded
errors in worker logs.
Dataflow retries such failures, so you can safely ignore these
errors. If you believe that your pipeline is significantly impacted due to
rateLimitExceeded
errors, contact Google Cloud Support.
Recommendations
For guidance on recommendations generated by Dataflow Insights,
see Insights.
1 answer to this question.
Related Questions In Kubernetes
- All categories
-
ChatGPT
(4) -
Apache Kafka
(84) -
Apache Spark
(596) -
Azure
(131) -
Big Data Hadoop
(1,907) -
Blockchain
(1,673) -
C#
(141) -
C++
(271) -
Career Counselling
(1,060) -
Cloud Computing
(3,446) -
Cyber Security & Ethical Hacking
(147) -
Data Analytics
(1,266) -
Database
(855) -
Data Science
(75) -
DevOps & Agile
(3,575) -
Digital Marketing
(111) -
Events & Trending Topics
(28) -
IoT (Internet of Things)
(387) -
Java
(1,247) -
Kotlin
(8) -
Linux Administration
(389) -
Machine Learning
(337) -
MicroStrategy
(6) -
PMP
(423) -
Power BI
(516) -
Python
(3,188) -
RPA
(650) -
SalesForce
(92) -
Selenium
(1,569) -
Software Testing
(56) -
Tableau
(608) -
Talend
(73) -
TypeSript
(124) -
Web Development
(3,002) -
Ask us Anything!
(66) -
Others
(1,929) -
Mobile Development
(263)
Subscribe to our Newsletter, and get personalized recommendations.
Already have an account? Sign in.
When creating Kubernetes-DashboardPod, it is found that kubernetes-DashboardPod can be successfully created, but when viewing the status of pod, it is not found, and the running of pod is not found from node node. After the screening process, it was found that pod- Infrastructure image download failed, causing POD startup failure.
Pod -infrastructure image download configuration
open /etc/kubernetes/kubelet configuration file.
vim /etc/kubernetes/kubelet
1
###
# kubernetes kubelet (minion) config
# The address for The info server to serve on (set to 0.0.0.0 or “for all interfaces)
KUBELET_ADDRESS=”–address=0.0.0.0″
# The port for the info server to serve on
# KUBELET_PORT=”–port=10250″
# You may leave this blank to use the actual hostname
KUBELET_HOSTNAME=”–hostname-override=10.0.11.150″
# location of the api-server
KUBELET_API_SERVER=”–api-servers=http://10.0.11.150:8080″
# pod proceeds the container
KUBELET_POD_INFRA_CONTAINER = “- pod – infra – container – image = 10.0.11.150:5000/rhel7/pod – infrastructure: v1.0.0”
# Add your own!
KUBELET_ARGS=””
the configuration item information for the kubelet configuration file is shown above. Where the KUBELET_POD_INFRA_CONTAINER configuration item connotes the address at which the pod- Infrastructure image is downloaded. The original configuration address is:
KUBELET_POD_INFRA_CONTAINER = “– pod-infra-container-image=registry.access.redhat.com/rhel7/pod-infrastructure:latest”
1 through test, found that the domestic is unable to download the to pod – proceeds the mirror… .
The solution
first, you can first configure the local Docker and use image acceleration.
Then the image file that can be downloaded is found through Docker search Pod – Infrastructure
docker search pod-infrastructure
The INDEX NAME DESCRIPTION STARS OFFICIAL AUTOMATED
docker. IO Docker. IO/openshift origin – pod The pod proceeds image for openshift 3 5
docker. IO docker. IO/infrastructureascode/aws – cli Containerized AWS CLI on alpine to avoid r… 3 [OK]
docker. IO docker. IO/newrelic/infrastructure Public image for New Relic proceeds.
3 docker. IO Docker. IO/infrastructureascode uwsgi uwsgi application server 2 [OK]
docker. IO docker. IO/manageiq/manageiq – the pods OpenShift based images for manageiq. 2 [OK]
docker.io docker.io/podigg/podigg-lc-hobbit A hobbit dataset The generator wrapper for PoDiGG 1 [OK]
docker. IO docker. IO/tianyebj/pod – infrastructure registry.access.redhat.com/rhel7/pod-infra… 1
docker. IO docker. IO/w564791/pod – infrastructure latest 1
to find the image file, you want The image is then downloaded locally through the docker pull command.
Then mirror pod-Infrastructure into the local private library and modify the KUBELET_POD_INFRA_CONTAINER configuration item in the Kubelet configuration file:
KUBELET_POD_INFRA_CONTAINER = “- pod – infra – container – image = 10.0.11.150:5000/rhel7/pod – infrastructure: v1.0.0”
among them 1, 10.0.11.150 private library, is one of my local docker rhel7/pod – infrastructure is my a mirror image of the name, kept in a private library v1.0.0 is saved version.
Finally, restart the cluster:
Master node restart command:
For SERVICES in kube-apiserver Kube-Controller-Manager Kube-scheduler; Do
systemctl restart $SERVICES
done
Node restart command:
Systemctl restart kubelet
1
note: Master Node and Node Node both make this change.
— — — — — — — — — — — — — — — — — — — — —
the author: _silent YanQin
source: CSDN
,
https://blog.csdn.net/A632189007/article/details/78730903 copyright statement: this article original articles for bloggers, reproduced please attach link to blog!
Read More:
My kubectl version
Client Version: version.Info{Major:"1", Minor:"2", GitVersion:"v1.2.4", GitCommit:"3eed1e3be6848b877ff80a93da3785d9034d0a4f", GitTreeState:"clean"}
Server Version: version.Info{Major:"1", Minor:"2", GitVersion:"v1.2.4", GitCommit:"3eed1e3be6848b877ff80a93da3785d9034d0a4f", GitTreeState:"clean"}
I followed Creating Multi-Container Pods
After launching the pod, One container is UP but not other.
kubectl get pods
NAME READY STATUS RESTARTS AGE
redis-django 1/2 CrashLoopBackOff 9 22m
Then I did kubectl describe redis-django
. At the bottom I saw Error syncing pod, skipping
error
31m <invalid> 150 {kubelet 172.25.30.21} spec.containers{frontend} Warning BackOff Back-off restarting failed docker container
25m <invalid> 121 {kubelet 172.25.30.21} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "frontend" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=frontend pod=redis-django_default(9f35ffcd-391e-11e6-b160-0022195df673)"
How can I reslove this error? any help!
Thanks!
OS: Ubuntu 14
UPDATE
Previsoly I used below yaml file which was found at Creating Multi-Container Pods
apiVersion: v1
kind: Pod
metadata:
name: redis-django
labels:
app: web
spec:
containers:
- name: key-value-store
image: redis
ports:
- containerPort: 6379
- name: frontend
image: django
ports:
- containerPort: 8000
frontend
container was not started. Then I changed the yaml file to two redis
containers with different names and ports. But, the result is same(Getting Error syncing pod, skipping)
Later I have changed yaml file to, only one django
container. This pod status CrashLoopBackOff
and the Error syncing pod, skipping
UPDATE-2
I tail -f /var/log/upstart/kublet.log
, which is giving same error. Kubelet is continously trying to start the container, but it is not!
I0623 12:15:13.943046 445 manager.go:2050] Back-off 5m0s restarting failed container=key-value-store pod=redis-django_default(94683d3c-392e-11e6-b160-0022195df673)
E0623 12:15:13.943100 445 pod_workers.go:138] Error syncing pod 94683d3c-392e-11e6-b160-0022195df673, skipping: failed to "StartContainer" for "key-value-store" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=key-value-store pod=redis-django_default(94683d3c-392e-11e6-b160-0022195df673)"
UPDATE-3
[email protected]:~/kubernetes/cluster/ubuntu/binaries# kubectl describe pod redis-django
Name: redis-django
Namespace: default
Node: 192.168.1.10/192.168.1.10
Start Time: Thu, 23 Jun 2016 22:58:03 -0700
Labels: app=web
Status: Running
IP: 172.16.20.2
Controllers: <none>
Containers:
key-value-store:
Container ID: docker://8dbdd6826c354243964f0306427082223d3da49bf2aaf30e15961ea00362fe42
Image: redis
Image ID: docker://sha256:4465e4bcad80b5b43cef0bace96a8ef0a55c0050be439c1fb0ecd64bc0b8cce4
Port: 6379/TCP
QoS Tier:
cpu: BestEffort
memory: BestEffort
State: Running
Started: Thu, 23 Jun 2016 22:58:10 -0700
Ready: True
Restart Count: 0
Environment Variables:
frontend:
Container ID: docker://9c89602739abe7331b3beb3a79e92a7cc42e2a7e40e11618413c8bcfd0afbc16
Image: django
Image ID: docker://sha256:0cb63b45e2b9a8de5763fc9c98b79c38b6217df718238251a21c8c4176fb3d68
Port: 8000/TCP
QoS Tier:
cpu: BestEffort
memory: BestEffort
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 23 Jun 2016 22:58:41 -0700
Finished: Thu, 23 Jun 2016 22:58:41 -0700
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 23 Jun 2016 22:58:22 -0700
Finished: Thu, 23 Jun 2016 22:58:22 -0700
Ready: False
Restart Count: 2
Environment Variables:
Conditions:
Type Status
Ready False
Volumes:
default-token-0oq7p:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-0oq7p
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
49s 49s 1 {default-scheduler } Normal Scheduled Successfully assigned redis-django to 192.168.1.10
48s 48s 1 {kubelet 192.168.1.10} spec.containers{key-value-store} Normal Pulling pulling image "redis"
43s 43s 1 {kubelet 192.168.1.10} spec.containers{key-value-store} Normal Pulled Successfully pulled image "redis"
43s 43s 1 {kubelet 192.168.1.10} spec.containers{key-value-store} Normal Created Created container with docker id 8dbdd6826c35
42s 42s 1 {kubelet 192.168.1.10} spec.containers{key-value-store} Normal Started Started container with docker id 8dbdd6826c35
37s 37s 1 {kubelet 192.168.1.10} spec.containers{frontend} Normal Started Started container with docker id 3872ceae75d4
37s 37s 1 {kubelet 192.168.1.10} spec.containers{frontend} Normal Created Created container with docker id 3872ceae75d4
30s 30s 1 {kubelet 192.168.1.10} spec.containers{frontend} Normal Created Created container with docker id d97b99b6780c
30s 30s 1 {kubelet 192.168.1.10} spec.containers{frontend} Normal Started Started container with docker id d97b99b6780c
29s 29s 1 {kubelet 192.168.1.10} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "frontend" with CrashLoopBackOff: "Back-off 10s restarting failed container=frontend pod=redis-django_default(9d0a966a-39d0-11e6-9027-000c293d51ab)"
42s 16s 3 {kubelet 192.168.1.10} spec.containers{frontend} Normal Pulling pulling image "django"
11s 11s 1 {kubelet 192.168.1.10} spec.containers{frontend} Normal Started Started container with docker id 9c89602739ab
38s 11s 3 {kubelet 192.168.1.10} spec.containers{frontend} Normal Pulled Successfully pulled image "django"
11s 11s 1 {kubelet 192.168.1.10} spec.containers{frontend} Normal Created Created container with docker id 9c89602739ab
29s 10s 2 {kubelet 192.168.1.10} spec.containers{frontend} Warning BackOff Back-off restarting failed docker container
10s 10s 1 {kubelet 192.168.1.10} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "frontend" with CrashLoopBackOff: "Back-off 20s restarting failed container=frontend pod=redis-django_default(9d0a966a-39d0-11e6-9027-000c293d51ab)"
For container frontend
: Not showing any logs messages
[email protected]:~/kubernetes/cluster/ubuntu/binaries# kubectl logs redis-django -p -c frontend
[email protected]:~/kubernetes/cluster/ubuntu/binaries# kubectl logs redis-django -p -c key-value-store
Error from server: previous terminated container "key-value-store" in pod "redis-django" not found
[email protected]:~/kubernetes/cluster/ubuntu/binaries# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
8dbdd6826c35 redis "docker-entrypoint.sh" 2 minutes ago Up 2 minutes k8s_key-value-store.f572c2d_redis-django_default_9d0a966a-39d0-11e6-9027-000c293d51ab_11101aea
8995bbf9f4f4 gcr.io/google_containers/pause:2.0 "/pause" 2 minutes ago Up 2 minutes k8s_POD.48e5231f_redis-django_default_9d0a966a-39d0-11e6-9027-000c293d51ab_c00025b0
[email protected]:~/kubernetes/cluster/ubuntu/binaries#