Upstream connect error or disconnect reset before headers reset reason connection failure что это

Here’s what “upstream connect error or disconnect/reset before headers connection failure” means and how to fix it:

If you are an everyday user, and you see this message while browsing the internet, then it simply means that you need to clear your cache and cookies.

If you are a developer and see this message, then you need to check your service routes, destination rules, and/or traffic management with applications.

So if you want to learn all about what this 503 error means exactly and how to fix it, then this article is for you.

Let’s delve deeper into it!

Upstream connect error or disconnect reset before headers reset reason connection failure.

That’s a very specific, yet unclear error message to see.

What is it trying to tell you?

Let’s start with an overview.

This is a 503 error message.

It’s a generic message that actually applies to a lot of different scenarios, and the fix for it will depend on the specific scenario at hand.

In general, this error is telling you that there is a connection error, and that error is linked to routing services and rules.

That leaves an absolute ton of possibilities, but I’ll take you through the most common sources.

Then, we can talk about troubleshooting and fixing the problem.

That covers the very zoomed-out picture of this error message, but if you’re getting it, then you probably want to get it to go away.

To fix the problem, we have to address the root cause.

That’s the essence of troubleshooting, and it definitely applies here.

There’s a problem when it comes to identifying the cause of this error.

There are basically two instances where you’re going to see this error, and they are completely different.

One place where you’ll run into it is when you’re coding specific functions that relate to network connection management.

I’m going to break down the three most common scenarios that lead to this error in the next few sections.

But, the other common time you see this error is when you’re browsing the internet.

That means that I’m really answering this question for two very different groups of people.

One group is developing or coding networking resources.

The other group is just browsing the internet.

As you might imagine, it’s hard to consolidate all of that into a single, concise answer.

So, I’m going to split this up.

First, I’ll tackle the developer problems.

If you’re just trying to browse the internet and don’t want to get deep into networking and how it works, then skip to the section that is clearly labeled as not for developers and programmers.

That said, if you want to take a peek behind the curtain and learn a little more about networking, I’ll try to keep these explanations as light as possible.

#1 Reconfiguring Service Routes

I mentioned before that this is a 503 error.

One common place you’ll find it is when reconfiguring service routes.

The boiled-down essence here is that it’s easy to mix up service routing and rules such that the system can receive subsets before they are designed.

Naturally, the system doesn’t know what to do in that case, and you get a 503 error.

The key to avoiding this problem with service route reconfiguring is to follow what you might call a “make-before-break” rule.

Essentially, the steps force the system to add the new subset first and then update the virtual services.

#2 Setting Destination Rules

Considering the issue above, it should not come as a surprise that you can trigger 503 errors when setting destination rules.

Most commonly, destination rules are the issue if you see the 503 errors right after a request to a service.

This issue goes hand in hand with the one above.

The problem is still that the destination rule is creating the issue.

The difference is that this isn’t necessarily a problem with receiving subsets before they have been designed.

Virtually any destination rule error can lead to a 503 message.

Since there are so many ways these rules can break down and so many ways the problems can manifest, I’m going to cheat a little.

If you noticed that the problem correlates with new destination rules, then you can follow this guide.

It breaks down the most common destination rule problems and shows you how to overcome them.

#2 Traffic Management With Applications

The third primary issue is related to conflicts between applications and any proxy sidecar.

In other words, the applications that work with your traffic management rules might not know those rules, and the application can do things that don’t play well with the traffic management system.

That’s pretty vague because, once again, there are a lot of specific possibilities.

The gist is that you’re trying to offload as much error recovery to the applications as you can.

That will minimize these conflicts and resolve most instances of 503 errors.

Considering the detailed problems we just covered, what can you do about the 503 error?

I included some solutions and linked to even more, but if you’re looking for a general guide, then here’s another way to think about the whole thing.

This specific message is telling you that there’s a timing problem between connect errors and disconnect resets.

Somewhere in your system, you have conflicting rules that are trying to do things out of order.

The best way to find the specific area is to focus on rules changes as they relate to traffic management.

Essentially, start with what you touched most recently, and work your way backward from there.

Ok, but What if I’m Not a Developer or Programmer? (3 Steps)

Alright. That was a relatively deep walk-through of connection rules development.

If you’re still with me, that’s great.

We’re going to switch gears and look at this from a simple user perspective.

You don’t need to know any coding to run into this problem, and I’m going to show you how to solve it without any coding either.

It’s actually pretty simple.

#1 The Walmart Bug

But, it still makes more sense when you know more about what went wrong.

So, I’m going to cite one of the most prolific examples of everyday 503 errors.

In 2020, Walmart’s website ran into widespread issues.

Users could browse the site just fine, but when they tried to go to a specific product page to make a purchase, they got the 503 error.

It popped up word for word as I mentioned before: Upstream connect error or disconnect reset before headers reset reason for connection failure.

People were just trying to buy some stuff, and they got hit with this crazy message.

What are you supposed to do with it?

#2 An Easy Fix

Well, the message is actually giving you very specific advice, once you know how to read it.

It’s telling you that your computer and the Walmart servers had a connection failure, and when they tried to automatically fix that connection problem, things broke down.

A quick note: I’m using the famous Walmart bug as an example, but the problems and solutions discussed here will work any time you see this message while browsing the web.

What that means is that there is some piece of information that is tied to your connection to the Walmart site that is messing up the automatic reconnect protocols.

While that might sound a little vague and mysterious, it actually tells us exactly where the problem lies.

The only information that could exist in this space would have to be stored in your browser’s cache.

This is related to your cookies.

Basically, when the error first went wrong, your computer remembered the problem, and so it just kept doing things the wrong way over and over again.

The solution requires you to make your computer forget the bad rule that it’s following.

To do that, you simply need to clear your cache and cookies.

#3 Clearing the Cache

The famous Walmart problem-plagued Chrome users, so I’ll walk you through how to do this on Google Chrome.

If you use a different browser, you can just look up how to clear cache and cookies.

Before we go through the steps, let me explain what is going to happen here.

We’re not deleting anything that is particularly important.

Your internet cache is just storing information related to the websites you visit.

Then, if you go back to that website or reload it, the stored information means that your computer doesn’t actually have to download as much information, and everything can load a little faster and easier.

So, when you delete this cache, it’s going to do a few things.

It’s going to slow down your first visit to any site that no longer has cached files.

But after you visit a site, it will build new cache files, and things will work normally.

This is also going to make your computer forget your sign-in information for any sites that require such.

Sticking with Walmart as an example, if you were signed into the website with your account, then after you clear the cache, you’re going to be automatically signed out again.

Make sure you know your passwords and usernames.

Because of this last issue, some people don’t like to clear their cache.

If you’re worried about that, then you don’t have to clear everything.

Just clear the cache back through the day when the error started.

Ok. With all of that covered, let’s go through the steps:

Look for the three dots and click on them (this opens the tools menu).
Choose “history” from the list.
Choose the time frame on the right that covers the data you want to clear.
Click on “Clear browsing data.”
Look at the checkboxes. You can choose cookies, cached images and files, and browsing history.
To be sure you resolve the 503 error, clear the cookies and cached files.
Click on “Clear Data” and you’re done.

Источник

Содержание

upstream connect error or disconnect/reset before headers #25734
Comments
«upstream connect error or disconnect/reset before headers. reset reason: connection failure» error for .NET Core apps run in docker-compose #15727
Comments

upstream connect error or disconnect/reset before headers #25734

Bug description
Large requests over http frequently give an error upstream connect error or disconnect/reset before headers. reset reason: connection termination . With the bookinfo application but no sidecar, sending a 3MB file fails roughly 3% of the time. With the sidecar proxy enabled, sending the same 3MB file fails roughly 10% of the time.

The detailed output from curl on a failed request is:

Affected product area (please put an X in all that apply)
[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[ X ] Networking
[ X ] Performance and Scalability
[ ] Policies and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure

Affected features (please put an X in all that apply)

[ ] Multi Cluster
[ ] Virtual Machine
[ ] Multi Control Plane

Expected behavior
The expected behavior is that I should be able to send a file to an API a thousand times in a row with zero errors. When I run the API without istio’s functionality, I can do that.

Steps to reproduce the bug
The easiest way to reproduce the bug is using the standard «bookinfo» application.

Run istioctl install —set profile=demo
Run kubectl apply -f samples/bookinfo/platform/kube/bookinfo.yaml; kubectl apply -f samples/bookinfo/networking/bookinfo-gateway.yaml
Repeatedly run curl -F ‘foo=@/path/to/large/file’ $/productpage , where you pass in some large file of a few MB’s. Some will succeed and some will fail. NOTE*

I ran the above experiment 1,000 times WITHOUT sidecar injection enabled. In that experiment, 29 of the 1,000 requests failed to complete and returned the upstream connect error .

I ran the experiment 1,000 times WITH sidecar injection enabled. Interestingly, the error rate INCREASED with the proxy enabled: 96 of 1,000 requests failed to go through; and the other 904 returned the expected response (in this case, a 405).

***NOTE: A «successful» request here should return a 405 response, as we are POSTing to a GET-only endpoint. A failure is when we get the upstream connection error. I know it’s not proper to test this way; but it’s the easiest way to replicate. Just pretend for a minute that a 405 is like a 200, and trust (or verify) that if you want to, you can replicate the same behavior with a POST endpoint—but you’ll have to deploy a different container.

Version (include the output of istioctl version —remote and kubectl version and helm version if you used Helm)

How was Istio installed?
Istio was installed as per documentation: https://istio.io/latest/docs/setup/getting-started/

Environment where bug was observed (cloud vendor, OS, etc)
Docker-Desktop on MacOS

Additionally, please consider attaching a cluster state archive by attaching
the dump file to this issue.

The text was updated successfully, but these errors were encountered:

Источник

«upstream connect error or disconnect/reset before headers. reset reason: connection failure» error for .NET Core apps run in docker-compose #15727

Description:
Hello, I have 2 .NET Core apps (Razor-pages web app and GRPC Service) running in docker-compose. Both are running in different localhost ports. If I access them via localhost, like:

http://localhost:5105/ or http://127.0.0.1:5105 — for the web app,
http://localhost:5104/ or http://127.0.0.1:5104 — for the GRPC
both are working. But when I added the envoy configuration listener and clusters and trying to access via:
http://localhost:8080/imageslibs
http://localhost:8080/imagesservice

Envoy returns the exception upstream connect error or disconnect/reset before headers. reset reason: connection failure for both apps.
The docker-compose.yml:
version: ‘3.4’

Config:
Envoy’s dockerfile:

front-envoy_1 | [2021-03-28 16:47:54.444][14][debug][http] [source/common/http/conn_manager_impl.cc:255] [C6] new stream
front-envoy_1 | [2021-03-28 16:47:54.445][14][debug][http] [source/common/http/conn_manager_impl.cc:883] [C6][S14144009116599918894] request headers complete (end_stream=true):
front-envoy_1 | ‘:authority’, ‘localhost:8080’
front-envoy_1 | ‘:path’, ‘/imageslibs’
front-envoy_1 | ‘:method’, ‘GET’
front-envoy_1 | ‘connection’, ‘keep-alive’
front-envoy_1 | ‘cache-control’, ‘max-age=0’
front-envoy_1 | ‘sec-ch-ua’, ‘»Google Chrome»;v=»89″, «Chromium»;v=»89″, «;Not A Brand»;v=»99″‘
front-envoy_1 | ‘sec-ch-ua-mobile’, ‘?0’
front-envoy_1 | ‘upgrade-insecure-requests’, ‘1’
front-envoy_1 | ‘user-agent’, ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36’
front-envoy_1 | ‘accept’, ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9′
front-envoy_1 | ‘sec-fetch-site’, ‘none’
front-envoy_1 | ‘sec-fetch-mode’, ‘navigate’
front-envoy_1 | ‘sec-fetch-user’, ‘?1’
front-envoy_1 | ‘sec-fetch-dest’, ‘document’
front-envoy_1 | ‘accept-encoding’, ‘gzip, deflate, br’
front-envoy_1 | ‘accept-language’, ‘en-US,en;q=0.9’
front-envoy_1 | ‘cookie’, ‘idsrv.session=NlW8VRtzuNJguQYDdVVpIA; .AspNetCore.Cookie=CfDJ8BR22IBZi6xAvAD2wBqZBlG2IUeWsw7hHPiNq4LrY2HBNRWyhGZ2gZuzRIbMi9MLO7IDORqkSIvDTuZDsLDz6RYtLccXi9x2CwlSzHS169Pgs3hs6biCcFKuriLkWZ4lpWHv4OCqZdO4lGgWmdzcrf2ctQbQOA-xPS7O7NSoQ0-a8VGjjthlIolqaxh5gYLtvvdjSI043UZWVOCb_ZDnFNiD4H_WKAtpKmdENFk_4NbSZmmQ3Indj2ty72kNNUUv8OLEswzxI5dBGA9AYI7i-lzMjbl8GjXNhplHR5j7XJTgG7i9dsF2antRfonV_IpL4sabtmLhdti-ZaumXhPewS702E_1BKo-8ELV3LOMfiE_jdkKJTPR15sCSWkSo0-nllUoQczL7de0F8KMolWK8KoB13z8E388w2juHXnmiDYQIAn3MWzKUvhH_bhgK_ZBCEExWvDqgGRRBroI90Nvg6IAwc_-PoJcPE1HE2i6ouzdkNXoBRg6IQWmelHAtDb8uI2CYzYeBu3zYrnJq28vOhAx_Qpr_y7A0GenqHyJO5cw; .AspNetCore.Antiforgery.9TtSrW0hzOs=CfDJ8Do6rlT2pe5IndjlZXmKm7GvuVL61tmcxXKqGH7eWnem071yNAndO5zwY5WDwxxHjY8CnoRIsalbkPMWIIq_ZFysZ-fkQJJdPm78T8dCxUe5DGeKiJqu5GjjEldMAkcnvmYjNYO9Ht13ldBWwzbBUqs’
front-envoy_1 |
front-envoy_1 | [2021-03-28 16:47:54.445][14][debug][http] [source/common/http/filter_manager.cc:774] [C6][S14144009116599918894] request end stream
front-envoy_1 | [2021-03-28 16:47:54.445][14][debug][router] [source/common/router/router.cc:426] [C6][S14144009116599918894] cluster ‘imageslibs’ match for URL ‘/imageslibs’
front-envoy_1 | [2021-03-28 16:47:54.446][14][debug][router] [source/common/router/router.cc:583] [C6][S14144009116599918894] router decoding headers:
front-envoy_1 | ‘:authority’, ‘localhost:8080’
front-envoy_1 | ‘:path’, ‘/imageslibs’
front-envoy_1 | ‘:method’, ‘GET’
front-envoy_1 | ‘:scheme’, ‘http’
front-envoy_1 | ‘cache-control’, ‘max-age=0’
front-envoy_1 | ‘sec-ch-ua’, ‘»Google Chrome»;v=»89″, «Chromium»;v=»89″, «;Not A Brand»;v=»99″‘
front-envoy_1 | ‘sec-ch-ua-mobile’, ‘?0’
front-envoy_1 | ‘upgrade-insecure-requests’, ‘1’
front-envoy_1 | ‘user-agent’, ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36’
front-envoy_1 | ‘accept’, ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9′
front-envoy_1 | ‘sec-fetch-site’, ‘none’
front-envoy_1 | ‘sec-fetch-mode’, ‘navigate’
front-envoy_1 | ‘sec-fetch-user’, ‘?1’
front-envoy_1 | ‘sec-fetch-dest’, ‘document’
front-envoy_1 | ‘accept-encoding’, ‘gzip, deflate, br’
front-envoy_1 | ‘accept-language’, ‘en-US,en;q=0.9’
front-envoy_1 | ‘cookie’, ‘idsrv.session=NlW8VRtzuNJguQYDdVVpIA; .AspNetCore.Cookie=CfDJ8BR22IBZi6xAvAD2wBqZBlG2IUeWsw7hHPiNq4LrY2HBNRWyhGZ2gZuzRIbMi9MLO7IDORqkSIvDTuZDsLDz6RYtLccXi9x2CwlSzHS169Pgs3hs6biCcFKuriLkWZ4lpWHv4OCqZdO4lGgWmdzcrf2ctQbQOA-xPS7O7NSoQ0-a8VGjjthlIolqaxh5gYLtvvdjSI043UZWVOCb_ZDnFNiD4H_WKAtpKmdENFk_4NbSZmmQ3Indj2ty72kNNUUv8OLEswzxI5dBGA9AYI7i-lzMjbl8GjXNhplHR5j7XJTgG7i9dsF2antRfonV_IpL4sabtmLhdti-ZaumXhPewS702E_1BKo-8ELV3LOMfiE_jdkKJTPR15sCSWkSo0-nllUoQczL7de0F8KMolWK8KoB13z8E388w2juHXnmiDYQIAn3MWzKUvhH_bhgK_ZBCEExWvDqgGRRBroI90Nvg6IAwc_-PoJcPE1HE2i6ouzdkNXoBRg6IQWmelHAtDb8uI2CYzYeBu3zYrnJq28vOhAx_Qpr_y7A0GenqHyJO5cw; .AspNetCore.Antiforgery.9TtSrW0hzOs=CfDJ8Do6rlT2pe5IndjlZXmKm7GvuVL61tmcxXKqGH7eWnem071yNAndO5zwY5WDwxxHjY8CnoRIsalbkPMWIIq_ZFysZ-fkQJJdPm78T8dCxUe5DGeKiJqu5GjjEldMAkcnvmYjNYO9Ht13ldBWwzbBUqs’
front-envoy_1 | ‘x-forwarded-proto’, ‘http’
front-envoy_1 | ‘x-request-id’, ‘6def488d-7020-4a79-acee-d1bd5a9f7252’
front-envoy_1 | ‘x-envoy-expected-rq-timeout-ms’, ‘15000’
front-envoy_1 |
front-envoy_1 | [2021-03-28 16:47:54.446][14][debug][pool] [source/common/http/conn_pool_base.cc:79] queueing stream due to no available connections
front-envoy_1 | [2021-03-28 16:47:54.446][14][debug][pool] [source/common/conn_pool/conn_pool_base.cc:229] trying to create new connection
front-envoy_1 | [2021-03-28 16:47:54.446][14][debug][pool] [source/common/conn_pool/conn_pool_base.cc:132] creating a new connection
front-envoy_1 | [2021-03-28 16:47:54.446][14][debug][client] [source/common/http/codec_client.cc:41] [C8] connecting
front-envoy_1 | [2021-03-28 16:47:54.446][14][debug][connection] [source/common/network/connection_impl.cc:861] [C8] connecting to 127.0.0.1:5105
front-envoy_1 | [2021-03-28 16:47:54.446][14][debug][connection] [source/common/network/connection_impl.cc:880] [C8] connection in progress
front-envoy_1 | [2021-03-28 16:47:54.446][14][debug][connection] [source/common/network/connection_impl.cc:671] [C8] delayed connection error: 111
front-envoy_1 | [2021-03-28 16:47:54.447][14][debug][connection] [source/common/network/connection_impl.cc:243] [C8] closing socket: 0
front-envoy_1 | [2021-03-28 16:47:54.447][14][debug][client] [source/common/http/codec_client.cc:101] [C8] disconnect. resetting 0 pending requests
front-envoy_1 | [2021-03-28 16:47:54.447][14][debug][pool] [source/common/conn_pool/conn_pool_base.cc:380] [C8] client disconnected, failure reason:
front-envoy_1 | [2021-03-28 16:47:54.447][14][debug][router] [source/common/router/router.cc:1040] [C6][S14144009116599918894] upstream reset: reset reason: connection failure, transport failure reason:
front-envoy_1 | [2021-03-28 16:47:54.447][14][debug][http] [source/common/http/filter_manager.cc:858] [C6][S14144009116599918894] Sending local reply with details upstream_reset_before_response_started
front-envoy_1 | [2021-03-28 16:47:54.447][14][debug][http] [source/common/http/conn_manager_impl.cc:1454] [C6][S14144009116599918894] encoding headers via codec (end_stream=false):
front-envoy_1 | ‘:status’, ‘503’
front-envoy_1 | ‘content-length’, ’91’
front-envoy_1 | ‘content-type’, ‘text/plain’
front-envoy_1 | ‘date’, ‘Sun, 28 Mar 2021 16:47:54 GMT’
front-envoy_1 | ‘server’, ‘envoy’

Here is the localhost:9999/clusters output:

imageslibs::default_priority::max_connections::1024
imageslibs::default_priority::max_pending_requests::1024
imageslibs::default_priority::max_requests::1024
imageslibs::default_priority::max_retries::3
imageslibs::high_priority::max_connections::1024
imageslibs::high_priority::max_pending_requests::1024
imageslibs::high_priority::max_requests::1024
imageslibs::high_priority::max_retries::3
imageslibs::added_via_api::false
imageslibs::127.0.0.1:5105::cx_active::0
imageslibs::127.0.0.1:5105::cx_connect_fail::2
imageslibs::127.0.0.1:5105::cx_total::2
imageslibs::127.0.0.1:5105::rq_active::0
imageslibs::127.0.0.1:5105::rq_error::2
imageslibs::127.0.0.1:5105::rq_success::0
imageslibs::127.0.0.1:5105::rq_timeout::0
imageslibs::127.0.0.1:5105::rq_total::0
imageslibs::127.0.0.1:5105::hostname::127.0.0.1
imageslibs::127.0.0.1:5105::health_flags::healthy
imageslibs::127.0.0.1:5105::weight::1
imageslibs::127.0.0.1:5105::region::
imageslibs::127.0.0.1:5105::zone::
imageslibs::127.0.0.1:5105::sub_zone::
imageslibs::127.0.0.1:5105::canary::false
imageslibs::127.0.0.1:5105::priority::0
imageslibs::127.0.0.1:5105::success_rate::-1.0
imageslibs::127.0.0.1:5105::local_origin_success_rate::-1.0
secure_imageslibs::default_priority::max_connections::1024
secure_imageslibs::default_priority::max_pending_requests::1024
secure_imageslibs::default_priority::max_requests::1024
secure_imageslibs::default_priority::max_retries::3
secure_imageslibs::high_priority::max_connections::1024
secure_imageslibs::high_priority::max_pending_requests::1024
secure_imageslibs::high_priority::max_requests::1024
secure_imageslibs::high_priority::max_retries::3
secure_imageslibs::added_via_api::false
secure_imageslibs::127.0.0.1:9105::cx_active::0
secure_imageslibs::127.0.0.1:9105::cx_connect_fail::0
secure_imageslibs::127.0.0.1:9105::cx_total::0
secure_imageslibs::127.0.0.1:9105::rq_active::0
secure_imageslibs::127.0.0.1:9105::rq_error::0
secure_imageslibs::127.0.0.1:9105::rq_success::0
secure_imageslibs::127.0.0.1:9105::rq_timeout::0
secure_imageslibs::127.0.0.1:9105::rq_total::0
secure_imageslibs::127.0.0.1:9105::hostname::127.0.0.1
secure_imageslibs::127.0.0.1:9105::health_flags::healthy
secure_imageslibs::127.0.0.1:9105::weight::1
secure_imageslibs::127.0.0.1:9105::region::
secure_imageslibs::127.0.0.1:9105::zone::
secure_imageslibs::127.0.0.1:9105::sub_zone::
secure_imageslibs::127.0.0.1:9105::canary::false
secure_imageslibs::127.0.0.1:9105::priority::0
secure_imageslibs::127.0.0.1:9105::success_rate::-1.0
secure_imageslibs::127.0.0.1:9105::local_origin_success_rate::-1.0
imagesservice::default_priority::max_connections::1024
imagesservice::default_priority::max_pending_requests::1024
imagesservice::default_priority::max_requests::1024
imagesservice::default_priority::max_retries::3
imagesservice::high_priority::max_connections::1024
imagesservice::high_priority::max_pending_requests::1024
imagesservice::high_priority::max_requests::1024
imagesservice::high_priority::max_retries::3
imagesservice::added_via_api::false
imagesservice::127.0.0.1:5104::cx_active::0
imagesservice::127.0.0.1:5104::cx_connect_fail::1
imagesservice::127.0.0.1:5104::cx_total::1
imagesservice::127.0.0.1:5104::rq_active::0
imagesservice::127.0.0.1:5104::rq_error::1
imagesservice::127.0.0.1:5104::rq_success::0
imagesservice::127.0.0.1:5104::rq_timeout::0
imagesservice::127.0.0.1:5104::rq_total::0
imagesservice::127.0.0.1:5104::hostname::127.0.0.1
imagesservice::127.0.0.1:5104::health_flags::healthy
imagesservice::127.0.0.1:5104::weight::1
imagesservice::127.0.0.1:5104::region::
imagesservice::127.0.0.1:5104::zone::
imagesservice::127.0.0.1:5104::sub_zone::
imagesservice::127.0.0.1:5104::canary::false
imagesservice::127.0.0.1:5104::priority::0
imagesservice::127.0.0.1:5104::success_rate::-1.0
imagesservice::127.0.0.1:5104::local_origin_success_rate::-1.0

The text was updated successfully, but these errors were encountered:

Источник

Version

cli version: 0.18.1

Description

Intermittent 503 errors on AWS cluster.

Configuration

cortex.yaml

# cortex.yaml

- name: offer-features
  predictor:
    type: python
    path: predictor.py
    config:
      bucket: XXXXXXXXXXXXXXXXXXXX
  compute:
    cpu: 1  # CPU request per replica, e.g. 200m or 1 (200m is equivalent to 0.2) (default: 200m)
    gpu: 0  # GPU request per replica (default: 0)
    inf: 0 # Inferentia ASIC request per replica (default: 0)
    mem: 1Gi
  autoscaling:
    min_replicas: 2
    max_replicas: 3
    init_replicas: 2
    max_replica_concurrency: 13
    target_replica_concurrency: 5
    window: 1m0s
    downscale_stabilization_period: 5m0s
    upscale_stabilization_period: 1m0s
    max_downscale_factor: 0.75
    max_upscale_factor: 1.5
    downscale_tolerance: 0.05
    upscale_tolerance: 0.05

# cluster.yaml

# AWS credentials (if not specified, ~/.aws/credentials will be checked) (can be overridden by $AWS_ACCESS_KEY_ID and $AWS_SECRET_ACCESS_KEY)
aws_access_key_id: XXXXXXXXXXXXXX
aws_secret_access_key: XXXXXXXXXXXXXXXXX

# optional AWS credentials for the operator which may be used to restrict its AWS access (defaults to the AWS credentials set above)
cortex_aws_access_key_id: XXXXXXXXXXXXXXXX
cortex_aws_secret_access_key: XXXXXXXXXXXXXXXXXXXXX

# EKS cluster name for cortex (default: cortex)
cluster_name: cortex

# AWS region
region: us-east-1

# S3 bucket (default: <cluster_name>-<RANDOM_ID>)
# note: your cortex cluster uses this bucket for metadata storage, and it should not be accessed directly (a separate bucket should be used for your models)
bucket: # cortex-<RANDOM_ID>

# list of availability zones for your region (default: 3 random availability zones from the specified region)
availability_zones: # e.g. [us-east-1a, us-east-1b, us-east-1c]

# instance type
instance_type: t3.medium

# minimum number of instances (must be >= 0)
min_instances: 1

# maximum number of instances (must be >= 1)
max_instances: 5

# disk storage size per instance (GB) (default: 50)
instance_volume_size: 50

# instance volume type [gp2, io1, st1, sc1] (default: gp2)
instance_volume_type: gp2

# instance volume iops (only applicable to io1 storage type) (default: 3000)
# instance_volume_iops: 3000

# whether the subnets used for EC2 instances should be public or private (default: "public")
# if "public", instances will be assigned public IP addresses; if "private", instances won't have public IPs and a NAT gateway will be created to allow outgoing network requests
# see https://docs.cortex.dev/v/0.18/miscellaneous/security#private-cluster for more information
subnet_visibility: public  # must be "public" or "private"

# whether to include a NAT gateway with the cluster (a NAT gateway is necessary when using private subnets)
# default value is "none" if subnet_visibility is set to "public"; "single" if subnet_visibility is "private"
nat_gateway: none  # must be "none", "single", or "highly_available" (highly_available means one NAT gateway per availability zone)

# whether the API load balancer should be internet-facing or internal (default: "internet-facing")
# note: if using "internal", APIs will still be accessible via the public API Gateway endpoint unless you also disable API Gateway in your API's configuration (if you do that, you must configure VPC Peering to connect to your APIs)
# see https://docs.cortex.dev/v/0.18/miscellaneous/security#private-cluster for more information
api_load_balancer_scheme: internet-facing  # must be "internet-facing" or "internal"

# whether the operator load balancer should be internet-facing or internal (default: "internet-facing")
# note: if using "internal", you must configure VPC Peering to connect your CLI to your cluster operator (https://docs.cortex.dev/v/0.18/guides/vpc-peering)
# see https://docs.cortex.dev/v/0.18/miscellaneous/security#private-cluster for more information
operator_load_balancer_scheme: internet-facing  # must be "internet-facing" or "internal"

# CloudWatch log group for cortex (default: <cluster_name>)
log_group: cortex

# additional tags to assign to aws resources for labelling and cost allocation (by default, all resources will be tagged with cortex.dev/cluster-name=<cluster_name>)
tags:  # <string>: <string> map of key/value pairs

# whether to use spot instances in the cluster (default: false)
# see https://docs.cortex.dev/v/0.18/cluster-management/spot-instances for additional details on spot configuration
spot: false

# see https://docs.cortex.dev/v/0.18/guides/custom-domain for instructions on how to set up a custom domain
ssl_certificate_arn: XXXXXXXXXXXXXXXXXXXXXXXXXXXX

Steps to reproduce

Spin up instances on AWS.
Wait a couple of days / hours (varies).
Notice sudden 503 errors

Expected behavior

It should work

Actual behavior

503 errors with the message

upstream connect error or disconnect/reset before headers. reset reason: connection failure

Screenshots

NOTE: The endpoint stopped responding around 15:30 in the graphs below.

Monitoring nr of bytes in:

Nr of requests:

Stack traces

Nothing useful, just:

2020-08-16 05:38:34.697979:cortex:pid-448:INFO:200 OK POST /predict
2020-08-16 05:38:37.643022:cortex:pid-448:INFO:200 OK POST /predict
2020-08-16 05:38:40.577522:cortex:pid-448:INFO:200 OK POST /predict
2020-08-16 05:38:42.008412:cortex:pid-448:INFO:200 OK POST /predict
2020-08-16 05:38:43.513294:cortex:pid-448:INFO:200 OK POST /predict
2020-08-16 05:38:45.425255:cortex:pid-448:INFO:200 OK POST /predict
2020-08-16 05:38:48.327276:cortex:pid-448:INFO:200 OK POST /predict
2020-08-16 05:38:51.316962:cortex:pid-447:INFO:200 OK POST /predict
2020-08-16 05:38:54.009212:cortex:pid-447:INFO:200 OK POST /predict
2020-08-16 05:38:55.852878:cortex:pid-447:INFO:200 OK POST /predict
2020-08-16 05:38:57.525264:cortex:pid-447:INFO:200 OK POST /predict
2020-08-16 05:39:00.795236:cortex:pid-447:INFO:200 OK POST /predict
2020-08-16 05:39:04.437013:cortex:pid-448:INFO:200 OK POST /predict
2020-08-16 05:39:05.981920:cortex:pid-448:INFO:200 OK POST /predict
2020-08-16 05:39:09.314293:cortex:pid-448:INFO:200 OK POST /predict
2020-08-16 05:39:12.343143:cortex:pid-448:INFO:200 OK POST /predict
2020-08-16 05:39:15.821708:cortex:pid-448:INFO:200 OK POST /predict
2020-08-16 05:39:19.083554:cortex:pid-448:INFO:200 OK POST /predict
2020-08-16 05:39:22.048843:cortex:pid-448:INFO:200 OK POST /predict
2020-08-16 05:39:24.943968:cortex:pid-448:INFO:200 OK POST /predict
2020-08-16 05:39:26.613330:cortex:pid-448:INFO:200 OK POST /predict
2020-08-16 05:39:29.702703:cortex:pid-448:INFO:200 OK POST /predict

Additional context

All the load balancer target are marked as «unhealthy», even when they work (i.e. I can send requests and receive 2XX responses)
The load balancer healthcheck endpoint returns the following

/healthz
{
        "service": {
                "namespace": "istio-system",
                "name": "ingressgateway-operator"
        },
        "localEndpoints": 0
}%

описание проблемы

Среда K8s (v1.13.5) + Istio (v1.1.7) была установлена в тестовой среде, и в один день в кластере Istio было выпущено более 30 сервисов (интерфейсные, внутренние, шлюз), и связанные с Istio были настроены правила маршрутизации. Позже я с полной уверенностью проверил маршрутизацию между службами, только щелкнув внешнюю страницу, чтобы вызвать шлюз, а затем шлюз вызвал другие внутренние службы (веб-интерфейс -> шлюз -> серверная часть). end service), но в фактическом тесте В процессе, шлюз всегда будет сообщать код ответа http внутренней службы 503, а сам шлюз также время от времени будет сообщать код ошибки 503, и кажется, что нет никакой закономерности в сроках сообщения об ошибке, что меня смущает … ..

Связанная проблема

Первое, что приходит в голову, это найти связанные проблемы в github-> istio. Для конкретных проблем, пожалуйста, перейдите по следующей ссылке:

503 «upstream connect error or disconnect/reset before headers» in 1.1 with low traffic

Sporadic 503 errors

Almost every app gets UC errors, 0.012% of all requests in 24h period

В выпуске много дискуссий по поводу 503. Istio представила концепцию sidecar (посланник). Простое понимание sidecar — это прокси локальной сети, висящий перед каждым конкретным приложением в сервисной сетке (соответствует Pod в K8s. . Существует несколько контейнеров: istio-proxy, app, оба могут обмениваться данными через localhost). В Istio дополнительный компонент реализован за счет расширения Envoy. Дополнительный элемент обеспечивает удобство (маршрутизация, предохранитель, конфигурация пула соединений и т. Д.), Но В то же время это также усложняет вызовы между службами. Исходный простой вызов Application1-> Application2 становится вызовом Application1-> Envoy1-> Envoy2-> Application2 в Istio, как показано ниже:

По сути, любые проблемы в процессе связи между Envoy2 и Application2 будут упакованы в 503, отправлены обратно в Enovy1 и, наконец, возвращены в Application1.

Путем повторного изучения Issue было обнаружено, что проблема 503, обычно упоминаемая в Issue, связана с тем, что пул соединений в Envoy2 кэширует недопустимые соединения в Applicaiton2. Envoy2 вызывает Application2 через недопустимое соединение, вызывая сброс соединения, а затем инкапсулирует Envoy2 как 503 и вернулся к нижестоящему вызывающему,

Типичные характеристики этого 503 можно просмотреть в журнале istio-proxy соответствующего приложения.Команда для настройки уровня журнала istio-proxy выглядит следующим образом:

curl -X POST localhost:15000/logging?level=trace

Типичный журнал 503 выглядит следующим образом:

[2019-06-28 13:02:36.790][37][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:97] [C26] using existing connection
[2019-06-28 13:02:36.790][37][debug][router] [external/envoy/source/common/router/router.cc:1210] [C21][S3699665653477458718] pool ready
[2019-06-28 13:02:36.790][37][debug][connection] [external/envoy/source/common/network/connection_impl.cc:518] [C26] remote close
[2019-06-28 13:02:36.790][37][debug][connection] [external/envoy/source/common/network/connection_impl.cc:188] [C26] closing socket: 0
[2019-06-28 13:02:36.791][37][debug][client] [external/envoy/source/common/http/codec_client.cc:82] [C26] disconnect. resetting 1 pending requests
[2019-06-28 13:02:36.791][37][debug][client] [external/envoy/source/common/http/codec_client.cc:105] [C26] request reset
[2019-06-28 13:02:36.791][37][debug][router] [external/envoy/source/common/router/router.cc:671] [C21][S3699665653477458718] upstream reset: reset reason connection termination
[2019-06-28 13:02:36.791][37][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1137] [C21][S3699665653477458718] Sending local reply with details upstream_reset_before_response_started{connection termination}
[2019-06-28 13:02:36.791][37][debug][filter] [src/envoy/http/mixer/filter.cc:141] Called Mixer::Filter : encodeHeaders 2
[2019-06-28 13:02:36.791][37][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1329] [C21][S3699665653477458718] encoding headers via codec (end_stream=false):
‘:status’, ‘503’
‘content-length’, ’95’
‘content-type’, ‘text/plain’
‘date’, ‘Fri, 28 Jun 2019 13:02:36 GMT’
‘server’, ‘istio-envoy’

В приведенном выше журналеupstream reset: reset reason connection terminationЭто означает, что соединение в пуле соединений посланника было прервано;

Базовое решение

Для решения вышеуказанных проблем можно использовать следующие 4 метода оптимизации:
(1) Измените HTTPRetry (попытки, perTryTimeout, retryOn) в VirtualService и установите стратегию повтора ошибок.
(Примечание: вам необходимо установить тайм-аут в Envoy одновременно (ссылка на Envoy), то есть общее время повтора должно быть меньше тайм-аута,
HttpRoute.timeout необходимо установить одновременно в Istio);

(2) Измените HTTPSettings.idleTimeout в DestinationRule, чтобы установить время простоя кэширования соединений в пуле соединений envoy;

(3) Измените HTTPSettings.maxRequestsPerConnection в DestinationRule на 1 (закройте Keeplive, соединение не будет повторно использоваться и производительность снизится);

(4) Измените tomcat connectionTimeout (конфигурация Springboot server.connectionTimeout), чтобы увеличить время ожидания соединения для веб-контейнера;

В то же время вы можете обратиться к следующей статье, чтобы узнать о методах устранения неполадок 503 в Istio:

[Английская версия] Istio: 503’s с UC’s и TCP Fun Times

[Китайская версия] Istio: 503, UC и TCP

В целом расследование делится на 4 основных метода:

(1) Просмотр записей отслеживания JagerUI (установка теговerror=true）；

(2) Просмотр метрик (Istio, Envoy);

(3) Просмотрите журнал отладки istio-proxy;

(4) захват сетевых пакетов;

Я использовал только методы (1) (3) (4) в самом процессе устранения неполадок.

JaggerUI

При использовании метода (1) Jagger для устранения проблем (вы можете временно установить PILOT_TRACE_SAMPLING на 100, то есть отслеживать все), вам необходимо обратить внимание на следующие моменты:

(1) Установите ошибку тегов = true в условиях запроса, чтобы быстро найти информацию для отслеживания ошибок;

(2) Обратите внимание на информацию response_flags в деталях отслеживания. Это поле указывает тип отказа ответа и может быстро определить причину отказа;

См. Описание response_flagsДокументация посланника：

журнал istio-proxy

В методе использования (3) установите уровень журнала istio-proxy на отладку (трассировку) и сосредоточьтесь на следующем содержимом журнала:

(1) код ответа HTTP, например «503»;

(2) Найдите соответствующий журнал над кодом ответа http (например, 503):upstream reset: reset reason connection termination, Причина неудачного позиционирования;

(3) Продолжайте искать способ подключения выше:using existing connection | creating a new connection(Существующее соединение ИЛИ новое соединение);

обычноУже подключенПроблема в том, что соединение, кэшированное в пуле соединений enovy, вначале недействительно, иНовое соединениеЕсли есть проблема, вам нужно найти другие причины. Ниже будет показано, с чем я столкнулся на практике.Новое соединениеОбъясните проблему;

Сетевой захват

Вы можете использовать плагин kubectl ksniff, но мне не удалось выполнить фактический процесс использования (проблема в том, что wirehark-gtk не запустился), поэтому была использована исходная команда tcpdump. Основные шаги следующие:

(1) Войдите в операционную среду контейнера приложения: kubectl exec -it xxx -c app -n tsp / bin / bash;

(2) Выполните команду tcpdump и выведите результат в файл: sudo tcpdump -ni lo port 8080 -vv -w my-packets.pcap;
-i определяет сетевую карту как lo (loopback) и наблюдает только за трафиком между локальным Envoy и приложением (Envoy и приложение находятся на одном хосте и обмениваются данными через localhost)
-n display ip (преобразовать домен в IP)
порт указывает, что отслеживается только порт 8080 (порт связи, доступный для приложения)
-vv показать подробную информацию
-w Указанный результат выводится в файл my-packet.pcap

(3) Войдите в рабочий узел модуля и скопируйте файл результатов my-packets.pcap на шаге (2) на узел узла через docker cp;

(4) Получите my-packets.pcap на хосте узла и просмотрите его через wirehark;

Примечание. Контейнер istio-proxy является файловой системой только для чтения и не может записывать файлы, поэтому выберите приложение для tcpdump в конкретном контейнере приложения;

Источник моей проблемы

После вышеупомянутого броска я изменил свои VirtualService и DestionationRule, но проблема 503. Я также рассмотрел, было ли это связано с ограничениями подключения хоста и настройками сети (ulimit, tcp_tw_recycle и т. Д.). Версия Istio была обновлена (с 1.1.7 до 1.1.11, версия после 1.1.7 содержит исправление для ошибки 503), но, как бы сложно это ни было, версия 503 не была удалена;

И что странно, на гитхабе все говорилиusing existing connectionПроблема возникает, но яcreating a new connectionПроблема, мой полный журнал выглядит следующим образом:

[2019-07-16 08:59:23.853][31][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:92] creating a new connection
[2019-07-16 08:59:23.853][31][debug][client] [external/envoy/source/common/http/codec_client.cc:26] [C297] connecting
[2019-07-16 08:59:23.853][31][debug][connection] [external/envoy/source/common/network/connection_impl.cc:644] [C297] connecting to 127.0.0.1:8080
[2019-07-16 08:59:23.853][31][debug][connection] [external/envoy/source/common/network/connection_impl.cc:653] [C297] connection in progress
[2019-07-16 08:59:23.853][31][debug][pool] [external/envoy/source/common/http/conn_pool_base.cc:20] queueing request due to no available connections
[2019-07-16 08:59:23.853][31][debug][filter] [src/envoy/http/mixer/filter.cc:94] Called Mixer::Filter : decodeData (84, false)
[2019-07-16 08:59:23.853][31][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1040] [C93][S18065063288515590867] request end stream
[2019-07-16 08:59:23.853][31][debug][filter] [src/envoy/http/mixer/filter.cc:94] Called Mixer::Filter : decodeData (0, true)
[2019-07-16 08:59:23.853][31][debug][connection] [external/envoy/source/common/network/connection_impl.cc:526] [C297] delayed connection error: 111
[2019-07-16 08:59:23.853][31][debug][connection] [external/envoy/source/common/network/connection_impl.cc:183] [C297] closing socket: 0
[2019-07-16 08:59:23.853][31][debug][client] [external/envoy/source/common/http/codec_client.cc:82] [C297] disconnect. resetting 0 pending requests
[2019-07-16 08:59:23.853][31][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:133] [C297] client disconnected, failure reason:
[2019-07-16 08:59:23.853][31][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:173] [C297] purge pending, failure reason:
[2019-07-16 08:59:23.853][31][debug][router] [external/envoy/source/common/router/router.cc:644] [C93][S18065063288515590867] upstream reset: reset reason connection failure
[2019-07-16 08:59:23.853][31][debug][filter] [src/envoy/http/mixer/filter.cc:133] Called Mixer::Filter : encodeHeaders 2
[2019-07-16 08:59:23.853][31][trace][http] [external/envoy/source/common/http/conn_manager_impl.cc:1200] [C93][S18065063288515590867] encode headers called: filter=0x5c79f40 status=0
[2019-07-16 08:59:23.853][31][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1305] [C93][S18065063288515590867] encoding headers via codec (end_stream=false):
‘:status’, ‘503’
‘content-length’, ’91’
‘content-type’, ‘text/plain’
‘date’, ‘Tue, 16 Jul 2019 08:59:23 GMT’
‘server’, ‘istio-envoy’

Через журнал я обнаружил, что моя проблема возникла, когда Enovy подключился к локальному приложению 127.0.0.1:8080 порту.connection failure, И response_flags в JaggerUI — это UF (сбой соединения с восходящей службой), и этот сбой является периодическим, иногда успешным, а иногда — неудачным;

В пятницу утром, когда погода была ясной (после почти недели метания> _ <|||), я заметил следующее явление:

Проверьте мой контейнер приложения через приложение docker ps | grep, почему все контейнеры приложения были активны 6 или 7 минут;

Похоже, проблема обнаружена. Так много контейнеров обычно работают в течение 6 или 7 минут, что означает, что контейнер приложения постоянно перезапускается. Причина перезапуска контейнера приложения заключается в том, что проверка работоспособности K8s не удалась. Сразу поехал проверять работоспособность K8s. Проверяем конфигурацию:

Порт, предоставленный контейнером, содержитPort = 8080, а tcpSocket.port, установленный в livenessProbe, равен 80. Эти два значения не совсем правильные, и из-за конфигурации проверки работоспособности:

Отложенное обнаружение 300 с (5 минут) + первая ошибка обнаружения + неудачная повторная попытка (3-1) раза * Интервал повторной попытки 60 с = 5 минут + 2 * 1 минута = более 7 минут (примерно от 7 до 8 минут)

В результате приложение будет обнаружено как неисправное через 7-8 минут, что приведет к тому, что контейнер приложения будет работать не более 8 минут, и он будет постоянно перезапускаться, а процесс перезапуска неизбежно приведет к тому, что посланник будет подключиться к приложению.connection failure, Существует также периодически возникающая проблема 503. В то же время, наблюдение, что интерфейсный интерфейс (синхронизированный пульс) запрашивает у серверной службы отчет за период времени 503, также согласуется со временем перезапуска контейнера приложения, кроме того подтверждающие причину сбоя подключения:

Ошибка конфигурации проверки работоспособностиВызвать непрерывный перезапуск контейнера приложения и вызвать его во время процесса перезапускаconnection failure；

После изменения livenessProbe во всех развертываниях предыдущая проблема 503 исчезла …

Я могу снова пойти повеселиться в эти выходные …

подводить итоги

Из-за моей неосторожности была вызвана ошибка конфигурации проверки работоспособности, которая, в свою очередь, вызвала проблемы с Istio 503. У меня до сих пор нет полного понимания соответствующей конфигурации, и мне нужно углубить исследование;

Однако, устраняя проблему 503, я лучше понимаю метод устранения неполадок Isito, и я могу быстро найти проблему в будущем;

Не сдавайся легкомысленно …

Источник

#1 Reconfiguring Service Routes

#2 Setting Destination Rules

#2 Traffic Management With Applications

Ok, but What if I’m Not a Developer or Programmer? (3 Steps)

#1 The Walmart Bug

#2 An Easy Fix

#3 Clearing the Cache

upstream connect error or disconnect/reset before headers #25734

«upstream connect error or disconnect/reset before headers. reset reason: connection failure» error for .NET Core apps run in docker-compose #15727

Version

Description

Configuration

Steps to reproduce

Expected behavior

Actual behavior

Screenshots

Stack traces

Additional context

Suggested solution

описание проблемы

Связанная проблема

Базовое решение

Источник моей проблемы

подводить итоги

Читайте также: