Upstream connect error or disconnect reset before headers reset reason connection failure перевод - Исправление ошибок и поиск оптимальных решений проблем

Содержание

Русские Блоги
Помните об устранении неполадок Istio intermittent 503
описание проблемы
Связанная проблема
Базовое решение
Источник моей проблемы
подводить итоги
ошибка восходящего подключения или отключение / сброс перед заголовками. причина сброса: разрыв соединения при использовании Spring Boot

Русские Блоги

Помните об устранении неполадок Istio intermittent 503

описание проблемы

Среда K8s (v1.13.5) + Istio (v1.1.7) была установлена в тестовой среде, и в один день в кластере Istio было выпущено более 30 сервисов (интерфейсные, внутренние, шлюз), и связанные с Istio были настроены правила маршрутизации. Позже я с полной уверенностью проверил маршрутизацию между службами, только щелкнув внешнюю страницу, чтобы вызвать шлюз, а затем шлюз вызвал другие внутренние службы (веб-интерфейс -> шлюз -> серверная часть). end service), но в фактическом тесте В процессе, шлюз всегда будет сообщать код ответа http внутренней службы 503, а сам шлюз также время от времени будет сообщать код ошибки 503, и кажется, что нет никакой закономерности в сроках сообщения об ошибке, что меня смущает . ..

Связанная проблема

Первое, что приходит в голову, это найти связанные проблемы в github-> istio. Для конкретных проблем, пожалуйста, перейдите по следующей ссылке:

В выпуске много дискуссий по поводу 503. Istio представила концепцию sidecar (посланник). Простое понимание sidecar — это прокси локальной сети, висящий перед каждым конкретным приложением в сервисной сетке (соответствует Pod в K8s. . Существует несколько контейнеров: istio-proxy, app, оба могут обмениваться данными через localhost). В Istio дополнительный компонент реализован за счет расширения Envoy. Дополнительный элемент обеспечивает удобство (маршрутизация, предохранитель, конфигурация пула соединений и т. Д.), Но В то же время это также усложняет вызовы между службами. Исходный простой вызов Application1-> Application2 становится вызовом Application1-> Envoy1-> Envoy2-> Application2 в Istio, как показано ниже:

По сути, любые проблемы в процессе связи между Envoy2 и Application2 будут упакованы в 503, отправлены обратно в Enovy1 и, наконец, возвращены в Application1.

Путем повторного изучения Issue было обнаружено, что проблема 503, обычно упоминаемая в Issue, связана с тем, что пул соединений в Envoy2 кэширует недопустимые соединения в Applicaiton2. Envoy2 вызывает Application2 через недопустимое соединение, вызывая сброс соединения, а затем инкапсулирует Envoy2 как 503 и вернулся к нижестоящему вызывающему,

Типичные характеристики этого 503 можно просмотреть в журнале istio-proxy соответствующего приложения.Команда для настройки уровня журнала istio-proxy выглядит следующим образом:

curl -X POST localhost:15000/logging?level=trace

Типичный журнал 503 выглядит следующим образом:

[2019-06-28 13:02:36.790][37][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:97] [C26] using existing connection
[2019-06-28 13:02:36.790][37][debug][router] [external/envoy/source/common/router/router.cc:1210] [C21][S3699665653477458718] pool ready
[2019-06-28 13:02:36.790][37][debug][connection] [external/envoy/source/common/network/connection_impl.cc:518] [C26] remote close
[2019-06-28 13:02:36.790][37][debug][connection] [external/envoy/source/common/network/connection_impl.cc:188] [C26] closing socket: 0
[2019-06-28 13:02:36.791][37][debug][client] [external/envoy/source/common/http/codec_client.cc:82] [C26] disconnect. resetting 1 pending requests
[2019-06-28 13:02:36.791][37][debug][client] [external/envoy/source/common/http/codec_client.cc:105] [C26] request reset
[2019-06-28 13:02:36.791][37][debug][router] [external/envoy/source/common/router/router.cc:671] [C21][S3699665653477458718] upstream reset: reset reason connection termination
[2019-06-28 13:02:36.791][37][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1137] [C21][S3699665653477458718] Sending local reply with details upstream_reset_before_response_started
[2019-06-28 13:02:36.791][37][debug][filter] [src/envoy/http/mixer/filter.cc:141] Called Mixer::Filter : encodeHeaders 2
[2019-06-28 13:02:36.791][37][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1329] [C21][S3699665653477458718] encoding headers via codec (end_stream=false):
‘:status’, ‘503’
‘content-length’, ’95’
‘content-type’, ‘text/plain’
‘date’, ‘Fri, 28 Jun 2019 13:02:36 GMT’
‘server’, ‘istio-envoy’

В приведенном выше журнале upstream reset: reset reason connection termination Это означает, что соединение в пуле соединений посланника было прервано;

Базовое решение

Для решения вышеуказанных проблем можно использовать следующие 4 метода оптимизации:
(1) Измените HTTPRetry (попытки, perTryTimeout, retryOn) в VirtualService и установите стратегию повтора ошибок.
(Примечание: вам необходимо установить тайм-аут в Envoy одновременно (ссылка на Envoy), то есть общее время повтора должно быть меньше тайм-аута,
HttpRoute.timeout необходимо установить одновременно в Istio);

(2) Измените HTTPSettings.idleTimeout в DestinationRule, чтобы установить время простоя кэширования соединений в пуле соединений envoy;

(3) Измените HTTPSettings.maxRequestsPerConnection в DestinationRule на 1 (закройте Keeplive, соединение не будет повторно использоваться и производительность снизится);

(4) Измените tomcat connectionTimeout (конфигурация Springboot server.connectionTimeout), чтобы увеличить время ожидания соединения для веб-контейнера;

В то же время вы можете обратиться к следующей статье, чтобы узнать о методах устранения неполадок 503 в Istio:

В целом расследование делится на 4 основных метода:

(1) Просмотр записей отслеживания JagerUI (установка теговerror=true）；

(2) Просмотр метрик (Istio, Envoy);

(3) Просмотрите журнал отладки istio-proxy;

(4) захват сетевых пакетов;

Я использовал только методы (1) (3) (4) в самом процессе устранения неполадок.

JaggerUI

При использовании метода (1) Jagger для устранения проблем (вы можете временно установить PILOT_TRACE_SAMPLING на 100, то есть отслеживать все), вам необходимо обратить внимание на следующие моменты:

(1) Установите ошибку тегов = true в условиях запроса, чтобы быстро найти информацию для отслеживания ошибок;

(2) Обратите внимание на информацию response_flags в деталях отслеживания. Это поле указывает тип отказа ответа и может быстро определить причину отказа;

журнал istio-proxy

В методе использования (3) установите уровень журнала istio-proxy на отладку (трассировку) и сосредоточьтесь на следующем содержимом журнала:

(1) код ответа HTTP, например «503»;

(2) Найдите соответствующий журнал над кодом ответа http (например, 503): upstream reset: reset reason connection termination , Причина неудачного позиционирования;

(3) Продолжайте искать способ подключения выше: using existing connection | creating a new connection (Существующее соединение ИЛИ новое соединение);

обычноУже подключенПроблема в том, что соединение, кэшированное в пуле соединений enovy, вначале недействительно, иНовое соединениеЕсли есть проблема, вам нужно найти другие причины. Ниже будет показано, с чем я столкнулся на практике.Новое соединениеОбъясните проблему;

Сетевой захват

Вы можете использовать плагин kubectl ksniff, но мне не удалось выполнить фактический процесс использования (проблема в том, что wirehark-gtk не запустился), поэтому была использована исходная команда tcpdump. Основные шаги следующие:

(1) Войдите в операционную среду контейнера приложения: kubectl exec -it xxx -c app -n tsp / bin / bash;

(2) Выполните команду tcpdump и выведите результат в файл: sudo tcpdump -ni lo port 8080 -vv -w my-packets.pcap;
-i определяет сетевую карту как lo (loopback) и наблюдает только за трафиком между локальным Envoy и приложением (Envoy и приложение находятся на одном хосте и обмениваются данными через localhost)
-n display ip (преобразовать домен в IP)
порт указывает, что отслеживается только порт 8080 (порт связи, доступный для приложения)
-vv показать подробную информацию
-w Указанный результат выводится в файл my-packet.pcap

(3) Войдите в рабочий узел модуля и скопируйте файл результатов my-packets.pcap на шаге (2) на узел узла через docker cp;

(4) Получите my-packets.pcap на хосте узла и просмотрите его через wirehark;

Примечание. Контейнер istio-proxy является файловой системой только для чтения и не может записывать файлы, поэтому выберите приложение для tcpdump в конкретном контейнере приложения;

Источник моей проблемы

После вышеупомянутого броска я изменил свои VirtualService и DestionationRule, но проблема 503. Я также рассмотрел, было ли это связано с ограничениями подключения хоста и настройками сети (ulimit, tcp_tw_recycle и т. Д.). Версия Istio была обновлена (с 1.1.7 до 1.1.11, версия после 1.1.7 содержит исправление для ошибки 503), но, как бы сложно это ни было, версия 503 не была удалена;

И что странно, на гитхабе все говорили using existing connection Проблема возникает, но я creating a new connection Проблема, мой полный журнал выглядит следующим образом:

[2019-07-16 08:59:23.853][31][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:92] creating a new connection
[2019-07-16 08:59:23.853][31][debug][client] [external/envoy/source/common/http/codec_client.cc:26] [C297] connecting
[2019-07-16 08:59:23.853][31][debug][connection] [external/envoy/source/common/network/connection_impl.cc:644] [C297] connecting to 127.0.0.1:8080
[2019-07-16 08:59:23.853][31][debug][connection] [external/envoy/source/common/network/connection_impl.cc:653] [C297] connection in progress
[2019-07-16 08:59:23.853][31][debug][pool] [external/envoy/source/common/http/conn_pool_base.cc:20] queueing request due to no available connections
[2019-07-16 08:59:23.853][31][debug][filter] [src/envoy/http/mixer/filter.cc:94] Called Mixer::Filter : decodeData (84, false)
[2019-07-16 08:59:23.853][31][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1040] [C93][S18065063288515590867] request end stream
[2019-07-16 08:59:23.853][31][debug][filter] [src/envoy/http/mixer/filter.cc:94] Called Mixer::Filter : decodeData (0, true)
[2019-07-16 08:59:23.853][31][debug][connection] [external/envoy/source/common/network/connection_impl.cc:526] [C297] delayed connection error: 111
[2019-07-16 08:59:23.853][31][debug][connection] [external/envoy/source/common/network/connection_impl.cc:183] [C297] closing socket: 0
[2019-07-16 08:59:23.853][31][debug][client] [external/envoy/source/common/http/codec_client.cc:82] [C297] disconnect. resetting 0 pending requests
[2019-07-16 08:59:23.853][31][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:133] [C297] client disconnected, failure reason:
[2019-07-16 08:59:23.853][31][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:173] [C297] purge pending, failure reason:
[2019-07-16 08:59:23.853][31][debug][router] [external/envoy/source/common/router/router.cc:644] [C93][S18065063288515590867] upstream reset: reset reason connection failure
[2019-07-16 08:59:23.853][31][debug][filter] [src/envoy/http/mixer/filter.cc:133] Called Mixer::Filter : encodeHeaders 2
[2019-07-16 08:59:23.853][31][trace][http] [external/envoy/source/common/http/conn_manager_impl.cc:1200] [C93][S18065063288515590867] encode headers called: filter=0x5c79f40 status=0
[2019-07-16 08:59:23.853][31][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1305] [C93][S18065063288515590867] encoding headers via codec (end_stream=false):
‘:status’, ‘503’
‘content-length’, ’91’
‘content-type’, ‘text/plain’
‘date’, ‘Tue, 16 Jul 2019 08:59:23 GMT’
‘server’, ‘istio-envoy’

Через журнал я обнаружил, что моя проблема возникла, когда Enovy подключился к локальному приложению 127.0.0.1:8080 порту. connection failure , И response_flags в JaggerUI — это UF (сбой соединения с восходящей службой), и этот сбой является периодическим, иногда успешным, а иногда — неудачным;

В пятницу утром, когда погода была ясной (после почти недели метания> _ connection failure , Существует также периодически возникающая проблема 503. В то же время, наблюдение, что интерфейсный интерфейс (синхронизированный пульс) запрашивает у серверной службы отчет за период времени 503, также согласуется со временем перезапуска контейнера приложения, кроме того подтверждающие причину сбоя подключения:

Ошибка конфигурации проверки работоспособности Вызвать непрерывный перезапуск контейнера приложения и вызвать его во время процесса перезапуска connection failure ；

После изменения livenessProbe во всех развертываниях предыдущая проблема 503 исчезла .

Я могу снова пойти повеселиться в эти выходные .

подводить итоги

Из-за моей неосторожности была вызвана ошибка конфигурации проверки работоспособности, которая, в свою очередь, вызвала проблемы с Istio 503. У меня до сих пор нет полного понимания соответствующей конфигурации, и мне нужно углубить исследование;

Однако, устраняя проблему 503, я лучше понимаю метод устранения неполадок Isito, и я могу быстро найти проблему в будущем;

Источник

ошибка восходящего подключения или отключение / сброс перед заголовками. причина сброса: разрыв соединения при использовании Spring Boot

Я использую Spring Boot со встроенным Tomcat 9.0.36. Он используется как образ Docker в Kubernetes. Недавно после обновления envoy у меня стали появляться исключения.

Некоторые люди предлагали увеличить время простоя соединения до 60 секунд, но во время весенней загрузки я смог узнать «Тайм-аут соединения» и «Тайм-аут Keep-Alive». Я увеличил их до 5 минут, используя приведенный ниже код.

Тем не менее, я получаю ту же ошибку. Это приложение вызывает внутри себя другую службу, которая также размещена в Kubernetes. Я вижу успешный ответ в своей службе, но после этого я не вижу никаких журналов.

Я потратил неделю на анализ этого с точки зрения приложения. Я выполнил несколько шагов, предложенных командой Ops.

Увеличьте тайм-аут в Tomcat Server до 60 секунд, потому что они настроили то же самое в Envoy.
Я увеличил время, но не смог решить проблему.
Я использовал Spring Cloud Gateway для службы шлюза, я подумал, что это проблема, поэтому я изменил ее на Rest Templates, но это также не решило проблему.
К счастью, API проверки работоспособности работают нормально, за исключением тех, которые взаимодействуют с другими службами внутри компании. В API работоспособности они также связывались с другими службами, чтобы проверить их работоспособность, но я не отвечал напрямую. Я заканчивал тело ответа, изменяя его, и пересылал его в пользовательский интерфейс. Я также применил то же самое и использовал приведенный ниже код, который вы можете легко понять. Я создал новую сущность ответа и отбросил все заголовки, которые я получил от внутренних API, и вернулся в пользовательский интерфейс. Это сработало как шарм.

Источник

описание проблемы

Связанная проблема

503 «upstream connect error or disconnect/reset before headers» in 1.1 with low traffic

Sporadic 503 errors

Almost every app gets UC errors, 0.012% of all requests in 24h period

curl -X POST localhost:15000/logging?level=trace

Типичный журнал 503 выглядит следующим образом:

[2019-06-28 13:02:36.790][37][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:97] [C26] using existing connection
[2019-06-28 13:02:36.790][37][debug][router] [external/envoy/source/common/router/router.cc:1210] [C21][S3699665653477458718] pool ready
[2019-06-28 13:02:36.790][37][debug][connection] [external/envoy/source/common/network/connection_impl.cc:518] [C26] remote close
[2019-06-28 13:02:36.790][37][debug][connection] [external/envoy/source/common/network/connection_impl.cc:188] [C26] closing socket: 0
[2019-06-28 13:02:36.791][37][debug][client] [external/envoy/source/common/http/codec_client.cc:82] [C26] disconnect. resetting 1 pending requests
[2019-06-28 13:02:36.791][37][debug][client] [external/envoy/source/common/http/codec_client.cc:105] [C26] request reset
[2019-06-28 13:02:36.791][37][debug][router] [external/envoy/source/common/router/router.cc:671] [C21][S3699665653477458718] upstream reset: reset reason connection termination
[2019-06-28 13:02:36.791][37][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1137] [C21][S3699665653477458718] Sending local reply with details upstream_reset_before_response_started{connection termination}
[2019-06-28 13:02:36.791][37][debug][filter] [src/envoy/http/mixer/filter.cc:141] Called Mixer::Filter : encodeHeaders 2
[2019-06-28 13:02:36.791][37][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1329] [C21][S3699665653477458718] encoding headers via codec (end_stream=false):
‘:status’, ‘503’
‘content-length’, ’95’
‘content-type’, ‘text/plain’
‘date’, ‘Fri, 28 Jun 2019 13:02:36 GMT’
‘server’, ‘istio-envoy’

В приведенном выше журналеupstream reset: reset reason connection terminationЭто означает, что соединение в пуле соединений посланника было прервано;

Базовое решение

В то же время вы можете обратиться к следующей статье, чтобы узнать о методах устранения неполадок 503 в Istio:

[Английская версия] Istio: 503’s с UC’s и TCP Fun Times

[Китайская версия] Istio: 503, UC и TCP

В целом расследование делится на 4 основных метода:

(1) Просмотр записей отслеживания JagerUI (установка теговerror=true）；

(2) Просмотр метрик (Istio, Envoy);

(3) Просмотрите журнал отладки istio-proxy;

(4) захват сетевых пакетов;

Я использовал только методы (1) (3) (4) в самом процессе устранения неполадок.

JaggerUI

(1) Установите ошибку тегов = true в условиях запроса, чтобы быстро найти информацию для отслеживания ошибок;

См. Описание response_flagsДокументация посланника：

журнал istio-proxy

(1) код ответа HTTP, например «503»;

(2) Найдите соответствующий журнал над кодом ответа http (например, 503):upstream reset: reset reason connection termination, Причина неудачного позиционирования;

(3) Продолжайте искать способ подключения выше:using existing connection | creating a new connection(Существующее соединение ИЛИ новое соединение);

Сетевой захват

(1) Войдите в операционную среду контейнера приложения: kubectl exec -it xxx -c app -n tsp / bin / bash;

(3) Войдите в рабочий узел модуля и скопируйте файл результатов my-packets.pcap на шаге (2) на узел узла через docker cp;

(4) Получите my-packets.pcap на хосте узла и просмотрите его через wirehark;

Источник моей проблемы

И что странно, на гитхабе все говорилиusing existing connectionПроблема возникает, но яcreating a new connectionПроблема, мой полный журнал выглядит следующим образом:

[2019-07-16 08:59:23.853][31][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:92] creating a new connection
[2019-07-16 08:59:23.853][31][debug][client] [external/envoy/source/common/http/codec_client.cc:26] [C297] connecting
[2019-07-16 08:59:23.853][31][debug][connection] [external/envoy/source/common/network/connection_impl.cc:644] [C297] connecting to 127.0.0.1:8080
[2019-07-16 08:59:23.853][31][debug][connection] [external/envoy/source/common/network/connection_impl.cc:653] [C297] connection in progress
[2019-07-16 08:59:23.853][31][debug][pool] [external/envoy/source/common/http/conn_pool_base.cc:20] queueing request due to no available connections
[2019-07-16 08:59:23.853][31][debug][filter] [src/envoy/http/mixer/filter.cc:94] Called Mixer::Filter : decodeData (84, false)
[2019-07-16 08:59:23.853][31][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1040] [C93][S18065063288515590867] request end stream
[2019-07-16 08:59:23.853][31][debug][filter] [src/envoy/http/mixer/filter.cc:94] Called Mixer::Filter : decodeData (0, true)
[2019-07-16 08:59:23.853][31][debug][connection] [external/envoy/source/common/network/connection_impl.cc:526] [C297] delayed connection error: 111
[2019-07-16 08:59:23.853][31][debug][connection] [external/envoy/source/common/network/connection_impl.cc:183] [C297] closing socket: 0
[2019-07-16 08:59:23.853][31][debug][client] [external/envoy/source/common/http/codec_client.cc:82] [C297] disconnect. resetting 0 pending requests
[2019-07-16 08:59:23.853][31][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:133] [C297] client disconnected, failure reason:
[2019-07-16 08:59:23.853][31][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:173] [C297] purge pending, failure reason:
[2019-07-16 08:59:23.853][31][debug][router] [external/envoy/source/common/router/router.cc:644] [C93][S18065063288515590867] upstream reset: reset reason connection failure
[2019-07-16 08:59:23.853][31][debug][filter] [src/envoy/http/mixer/filter.cc:133] Called Mixer::Filter : encodeHeaders 2
[2019-07-16 08:59:23.853][31][trace][http] [external/envoy/source/common/http/conn_manager_impl.cc:1200] [C93][S18065063288515590867] encode headers called: filter=0x5c79f40 status=0
[2019-07-16 08:59:23.853][31][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1305] [C93][S18065063288515590867] encoding headers via codec (end_stream=false):
‘:status’, ‘503’
‘content-length’, ’91’
‘content-type’, ‘text/plain’
‘date’, ‘Tue, 16 Jul 2019 08:59:23 GMT’
‘server’, ‘istio-envoy’

Через журнал я обнаружил, что моя проблема возникла, когда Enovy подключился к локальному приложению 127.0.0.1:8080 порту.connection failure, И response_flags в JaggerUI — это UF (сбой соединения с восходящей службой), и этот сбой является периодическим, иногда успешным, а иногда — неудачным;

В пятницу утром, когда погода была ясной (после почти недели метания> _ <|||), я заметил следующее явление:

Проверьте мой контейнер приложения через приложение docker ps | grep, почему все контейнеры приложения были активны 6 или 7 минут;

Похоже, проблема обнаружена. Так много контейнеров обычно работают в течение 6 или 7 минут, что означает, что контейнер приложения постоянно перезапускается. Причина перезапуска контейнера приложения заключается в том, что проверка работоспособности K8s не удалась. Сразу поехал проверять работоспособность K8s. Проверяем конфигурацию:

Порт, предоставленный контейнером, содержитPort = 8080, а tcpSocket.port, установленный в livenessProbe, равен 80. Эти два значения не совсем правильные, и из-за конфигурации проверки работоспособности:

Отложенное обнаружение 300 с (5 минут) + первая ошибка обнаружения + неудачная повторная попытка (3-1) раза * Интервал повторной попытки 60 с = 5 минут + 2 * 1 минута = более 7 минут (примерно от 7 до 8 минут)

В результате приложение будет обнаружено как неисправное через 7-8 минут, что приведет к тому, что контейнер приложения будет работать не более 8 минут, и он будет постоянно перезапускаться, а процесс перезапуска неизбежно приведет к тому, что посланник будет подключиться к приложению.connection failure, Существует также периодически возникающая проблема 503. В то же время, наблюдение, что интерфейсный интерфейс (синхронизированный пульс) запрашивает у серверной службы отчет за период времени 503, также согласуется со временем перезапуска контейнера приложения, кроме того подтверждающие причину сбоя подключения:

Ошибка конфигурации проверки работоспособностиВызвать непрерывный перезапуск контейнера приложения и вызвать его во время процесса перезапускаconnection failure；

После изменения livenessProbe во всех развертываниях предыдущая проблема 503 исчезла …

Я могу снова пойти повеселиться в эти выходные …

подводить итоги

Не сдавайся легкомысленно …

Источник

#http #kubernetes #devops #load-balancing #envoyproxy

Вопрос:

Я новичок в «посланнике». Я боролся в течение недели с ошибкой ниже. Поэтому мой нисходящий поток(сервер, который запрашивает некоторые данные/обновление) получает ответ:

 Status code: 503  Headers: ... Server:"envoy" X-Envoy-Response-Code-Details:"upstream_reset_before_response_started{connection_failure}" X-Envoy-Response-Flags: "UF,URX"  Body: upstream connect error or disconnect/reset before headers. reset reason: connection failure

С другой стороны, мой восходящий поток отключается(контекст отменяется). А вышестоящая служба вообще не возвращает 503 кода.

Вся сеть работает по протоколу http1.

Мой посланник.ямл:

 admin:  access_log_path: /tmp/admin_access.log  address:  socket_address: { address: 0.0.0.0, port_value: 9901 }   static_resources:  listeners:  - name: listener_0  address:  socket_address: { address: 0.0.0.0, port_value: 80 }  filter_chains:  - filters:  - name: envoy.filters.network.http_connection_manager  typed_config:  "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager  stat_prefix: ingress_http  codec_type: http1  route_config:  name: local_route  virtual_hosts:  - name: local_service  domains: [ "*" ]  response_headers_to_add: # added for debugging  - header:  key: x-envoy-response-code-details  value: "%RESPONSE_CODE_DETAILS%"  - header:  key: x-envoy-response-flags  value: "%RESPONSE_FLAGS%"  routes:  - match: # consistent routing  safe_regex:  google_re2: { }  regex: SOME_STRANGE_REGEX_FOR_CONSISTENT_ROUTING  route:  cluster: consistent_cluster  hash_policy:  header:  header_name: ":path"  regex_rewrite:  pattern:  google_re2: { }  regex: SOME_STRANGE_REGEX_FOR_CONSISTENT_ROUTING  substitution: "1"  retry_policy: # attempt to avoid 503 errors by retries  retry_on: "connect-failure,refused-stream,unavailable,cancelled,resource-exhausted,retriable-status-codes"  retriable_status_codes: [ 503 ]  num_retries: 3  retriable_request_headers:  - name: ":method"  exact_match: "GET"   - match: { prefix: "/" } # default routing (all routes except consistent)  route:  cluster: default_cluster  retry_policy: # attempt to avoid 503 errors by retries  retry_on: "connect-failure,refused-stream,unavailable,cancelled,resource-exhausted,retriable-status-codes"  retriable_status_codes: [ 503 ]  retry_host_predicate:  - name: envoy.retry_host_predicates.previous_hosts  host_selection_retry_max_attempts: 3  http_filters:  - name: envoy.filters.http.router   clusters:  - name: consistent_cluster  connect_timeout: 0.05s  type: STRICT_DNS  dns_refresh_rate: 1s  dns_lookup_family: V4_ONLY  lb_policy: MAGLEV  health_checks:  - timeout: 1s  interval: 1s  unhealthy_threshold: 1  healthy_threshold: 1  http_health_check:  path: "/health"  load_assignment:  cluster_name: consistent_cluster  endpoints:  - lb_endpoints:  - endpoint:  address:  socket_address:  address: consistent-host  port_value: 80    - name: default_cluster  connect_timeout: 0.05s  type: STRICT_DNS  dns_refresh_rate: 1s  dns_lookup_family: V4_ONLY  lb_policy: ROUND_ROBIN  health_checks:  - timeout: 1s  interval: 1s  unhealthy_threshold: 1  healthy_threshold: 1  http_health_check:  path: "/health"  outlier_detection: # attempt to avoid 503 errors by ejecting unhealth pods  consecutive_gateway_failure: 1  load_assignment:  cluster_name: default_cluster  endpoints:  - lb_endpoints:  - endpoint:  address:  socket_address:  address: default-host  port_value: 80

Я также попытался добавить журналы:

 access_log:  - name: accesslog  typed_config:  "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog  path: /tmp/http_access.log  log_format:  text_format: "[%START_TIME%] "%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%" %RESPONSE_CODE% %CONNECTION_TERMINATION_DETAILS% %RESPONSE_FLAGS% %BYTES_RECEIVED% %BYTES_SENT% %DURATION% %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)% "%REQ(X-FORWARDED-FOR)%" "%REQ(USER-AGENT)%" "%REQ(X-REQUEST-ID)%" "%REQ(:AUTHORITY)%" "%UPSTREAM_HOST%"n"  filter:  status_code_filter:  comparison:  op: GE  value:  default_value: 500  runtime_key: access_log.access_error.status

Это ничего мне не дало, потому %CONNECTION_TERMINATION_DETAILS% что всегда пусто (» -«), а флаги ответов я уже видел в заголовках в последующих ответах.

Я увеличился connect_timeout в два раза (0,01 с -gt; 0,02 с -gt;gt; 0,05 с). Это совсем не помогло. И другие службы(по прямой маршрутизации) нормально работают с таймаутом подключения 10 мс. Кстати, все работает хорошо после повторного развертывания в течение примерно 20 минут наверняка.

Надеюсь услышать ваши идеи, что это может быть и где мне следует копаться)

P. S: Я также иногда получаю ошибки проверки работоспособности(в журналах), но я понятия не имею, почему. И все без посланника работало хорошо(без ошибок, без тайм-аутов): проверка работоспособности, прямые запросы и т.д.

Ответ №1:

Я столкнулся с аналогичной проблемой при запуске envoy в качестве контейнера docker. В конце концов, причиной была отсутствующая --network host опция в docker run команде, которая привела к тому, что кластеры не были видны из контейнера docker envoy. Может быть, это вам тоже поможет?

1. Все это в кубернетесе. У меня есть отдельные развертывания и службы для посланника и приложения. Так что сеть здесь в порядке, и я редко ловлю эту ошибку, но иногда она нарушает поток.

Источник

Here’s what “upstream connect error or disconnect/reset before headers connection failure” means and how to fix it:

If you are an everyday user, and you see this message while browsing the internet, then it simply means that you need to clear your cache and cookies.

If you are a developer and see this message, then you need to check your service routes, destination rules, and/or traffic management with applications.

So if you want to learn all about what this 503 error means exactly and how to fix it, then this article is for you.

Let’s delve deeper into it!

Upstream connect error or disconnect reset before headers reset reason connection failure.

That’s a very specific, yet unclear error message to see.

What is it trying to tell you?

Let’s start with an overview.

This is a 503 error message.

It’s a generic message that actually applies to a lot of different scenarios, and the fix for it will depend on the specific scenario at hand.

In general, this error is telling you that there is a connection error, and that error is linked to routing services and rules.

That leaves an absolute ton of possibilities, but I’ll take you through the most common sources.

Then, we can talk about troubleshooting and fixing the problem.

That covers the very zoomed-out picture of this error message, but if you’re getting it, then you probably want to get it to go away.

To fix the problem, we have to address the root cause.

That’s the essence of troubleshooting, and it definitely applies here.

There’s a problem when it comes to identifying the cause of this error.

There are basically two instances where you’re going to see this error, and they are completely different.

One place where you’ll run into it is when you’re coding specific functions that relate to network connection management.

I’m going to break down the three most common scenarios that lead to this error in the next few sections.

But, the other common time you see this error is when you’re browsing the internet.

That means that I’m really answering this question for two very different groups of people.

One group is developing or coding networking resources.

The other group is just browsing the internet.

As you might imagine, it’s hard to consolidate all of that into a single, concise answer.

So, I’m going to split this up.

First, I’ll tackle the developer problems.

If you’re just trying to browse the internet and don’t want to get deep into networking and how it works, then skip to the section that is clearly labeled as not for developers and programmers.

That said, if you want to take a peek behind the curtain and learn a little more about networking, I’ll try to keep these explanations as light as possible.

#1 Reconfiguring Service Routes

I mentioned before that this is a 503 error.

One common place you’ll find it is when reconfiguring service routes.

The boiled-down essence here is that it’s easy to mix up service routing and rules such that the system can receive subsets before they are designed.

Naturally, the system doesn’t know what to do in that case, and you get a 503 error.

The key to avoiding this problem with service route reconfiguring is to follow what you might call a “make-before-break” rule.

Essentially, the steps force the system to add the new subset first and then update the virtual services.

#2 Setting Destination Rules

Considering the issue above, it should not come as a surprise that you can trigger 503 errors when setting destination rules.

Most commonly, destination rules are the issue if you see the 503 errors right after a request to a service.

This issue goes hand in hand with the one above.

The problem is still that the destination rule is creating the issue.

The difference is that this isn’t necessarily a problem with receiving subsets before they have been designed.

Virtually any destination rule error can lead to a 503 message.

Since there are so many ways these rules can break down and so many ways the problems can manifest, I’m going to cheat a little.

If you noticed that the problem correlates with new destination rules, then you can follow this guide.

It breaks down the most common destination rule problems and shows you how to overcome them.

#2 Traffic Management With Applications

The third primary issue is related to conflicts between applications and any proxy sidecar.

In other words, the applications that work with your traffic management rules might not know those rules, and the application can do things that don’t play well with the traffic management system.

That’s pretty vague because, once again, there are a lot of specific possibilities.

The gist is that you’re trying to offload as much error recovery to the applications as you can.

That will minimize these conflicts and resolve most instances of 503 errors.

Considering the detailed problems we just covered, what can you do about the 503 error?

I included some solutions and linked to even more, but if you’re looking for a general guide, then here’s another way to think about the whole thing.

This specific message is telling you that there’s a timing problem between connect errors and disconnect resets.

Somewhere in your system, you have conflicting rules that are trying to do things out of order.

The best way to find the specific area is to focus on rules changes as they relate to traffic management.

Essentially, start with what you touched most recently, and work your way backward from there.

Ok, but What if I’m Not a Developer or Programmer? (3 Steps)

Alright. That was a relatively deep walk-through of connection rules development.

If you’re still with me, that’s great.

We’re going to switch gears and look at this from a simple user perspective.

You don’t need to know any coding to run into this problem, and I’m going to show you how to solve it without any coding either.

It’s actually pretty simple.

#1 The Walmart Bug

But, it still makes more sense when you know more about what went wrong.

So, I’m going to cite one of the most prolific examples of everyday 503 errors.

In 2020, Walmart’s website ran into widespread issues.

Users could browse the site just fine, but when they tried to go to a specific product page to make a purchase, they got the 503 error.

It popped up word for word as I mentioned before: Upstream connect error or disconnect reset before headers reset reason for connection failure.

People were just trying to buy some stuff, and they got hit with this crazy message.

What are you supposed to do with it?

#2 An Easy Fix

Well, the message is actually giving you very specific advice, once you know how to read it.

It’s telling you that your computer and the Walmart servers had a connection failure, and when they tried to automatically fix that connection problem, things broke down.

A quick note: I’m using the famous Walmart bug as an example, but the problems and solutions discussed here will work any time you see this message while browsing the web.

What that means is that there is some piece of information that is tied to your connection to the Walmart site that is messing up the automatic reconnect protocols.

While that might sound a little vague and mysterious, it actually tells us exactly where the problem lies.

The only information that could exist in this space would have to be stored in your browser’s cache.

This is related to your cookies.

Basically, when the error first went wrong, your computer remembered the problem, and so it just kept doing things the wrong way over and over again.

The solution requires you to make your computer forget the bad rule that it’s following.

To do that, you simply need to clear your cache and cookies.

#3 Clearing the Cache

The famous Walmart problem-plagued Chrome users, so I’ll walk you through how to do this on Google Chrome.

If you use a different browser, you can just look up how to clear cache and cookies.

Before we go through the steps, let me explain what is going to happen here.

We’re not deleting anything that is particularly important.

Your internet cache is just storing information related to the websites you visit.

Then, if you go back to that website or reload it, the stored information means that your computer doesn’t actually have to download as much information, and everything can load a little faster and easier.

So, when you delete this cache, it’s going to do a few things.

It’s going to slow down your first visit to any site that no longer has cached files.

But after you visit a site, it will build new cache files, and things will work normally.

This is also going to make your computer forget your sign-in information for any sites that require such.

Sticking with Walmart as an example, if you were signed into the website with your account, then after you clear the cache, you’re going to be automatically signed out again.

Make sure you know your passwords and usernames.

Because of this last issue, some people don’t like to clear their cache.

If you’re worried about that, then you don’t have to clear everything.

Just clear the cache back through the day when the error started.

Ok. With all of that covered, let’s go through the steps:

Look for the three dots and click on them (this opens the tools menu).
Choose “history” from the list.
Choose the time frame on the right that covers the data you want to clear.
Click on “Clear browsing data.”
Look at the checkboxes. You can choose cookies, cached images and files, and browsing history.
To be sure you resolve the 503 error, clear the cookies and cached files.
Click on “Clear Data” and you’re done.

Источник

I’m having a problem migrating my pure Kubernetes app to an Istio managed. I’m using Google Cloud Platform (GCP), Istio 1.4, Google Kubernetes Engine (GKE), Spring Boot and JAVA 11.

I had the containers running in a pure GKE environment without a problem. Now I started the migration of my Kubernetes cluster to use Istio. Since then I’m getting the following message when I try to access the exposed service.

upstream connect error or disconnect/reset before headers. reset reason: connection failure

This error message looks like a really generic. I found a lot of different problems, with the same error message, but no one was related to my problem.

Bellow the version of the Istio:

client version: 1.4.10
control plane version: 1.4.10-gke.5
data plane version: 1.4.10-gke.5 (2 proxies)

Bellow my yaml files:

apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    account: tree-guest
  name: tree-guest-service-account
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: tree-guest
    service: tree-guest
  name: tree-guest
spec:
  ports:
  - name: http
    port: 8080
    targetPort: 8080
  selector:
    app: tree-guest
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: tree-guest
    version: v1
  name: tree-guest-v1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tree-guest
      version: v1
  template:
    metadata:
      labels:
        app: tree-guestaz
        version: v1
    spec:
      containers:
      - image: registry.hub.docker.com/victorsens/tree-quest:circle_ci_build_00923285-3c44-4955-8de1-ed578e23c5cf
        imagePullPolicy: IfNotPresent
        name: tree-guest
        ports:
        - containerPort: 8080
      serviceAccount: tree-guest-service-account
---
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: tree-guest-gateway
spec:
  selector:
    istio: ingressgateway # use istio default controller
  servers:
    - port:
        number: 80
        name: http
        protocol: HTTP
      hosts:
        - "*"
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: tree-guest-virtual-service
spec:
  hosts:
    - "*"
  gateways:
    - tree-guest-gateway
  http:
    - match:
        - uri:
            prefix: /v1
      route:
        - destination:
            host: tree-guest
            port:
              number: 8080

To apply the yaml file I used the following argument:

kubectl apply -f <(istioctl kube-inject -f ./tree-guest.yaml)

Below the result of the Istio proxy argument, after deploying the application:

istio-ingressgateway-6674cc989b-vwzqg.istio-system SYNCED SYNCED SYNCED SYNCED 
istio-pilot-ff4489db8-2hx5f 1.4.10-gke.5 tree-guest-v1-774bf84ddd-jkhsh.default SYNCED SYNCED SYNCED SYNCED istio-pilot-ff4489db8-2hx5f 1.4.10-gke.5

If someone have a tip about what is going wrong, please let me know. I’m stuck in this problem for a couple of days.

Thanks.

Источник

upstream connect error or disconnect/reset before headers. reset reason: connection termination #19966

Comments

rnkhouse commented Jan 7, 2020 •

Bug description
I upgraded istio from 1.3.6 to 1.4.2 and suddenly getting below error. Are there any changes that I need to make on version 1.4.2 to run previous applications? How can I debug this error to know the actual issue? In the logs there is no info other than error code 503.

upstream connect error or disconnect/reset before headers. reset reason: connection termination

I checked service is up and running with the valid endpoint.

service.yaml

Application istio-proxy logs

ingress gateway logs

Extra info

Expected behavior
The application should run without error message over ingress gateway.

Version (include the output of istioctl version —remote and kubectl version and helm version if you used Helm)
1.4.2

How was Istio installed?
helm template

Environment where bug was observed (cloud vendor, OS, etc)
AKS

The text was updated successfully, but these errors were encountered:

rnkhouse commented Jan 9, 2020 •

Not sure why is this happening but when I added name in Service ports it worked.

bishtawi commented Jan 25, 2020 •

Just commenting here to say that I encountered this same error ( upstream connect error or disconnect/reset before headers. reset reason: connection termination ) when I upgraded from 1.3 to 1.4 and wasted a ton of time trying to debug and figure out what exactly was causing it. I was able to downgrade to 1.3.x with no issue so it was not a huge blocker or anything but just had no idea how to fix it.

Your solution of adding names to the ports in the Kubernetes Services worked for me and I am very grateful.

This should be documented somewhere as it is not obvious. Kubernetes Service port names are optional if you only have a single port and I am sure a lot of other people are hitting this wall. Here for example.

baocang commented Mar 5, 2020

Thx @rnkhouse, it works for me too

JoeJasinski commented Mar 6, 2020

I had this same issue too when I upgraded to Istio 1.4.6, but I did NOT see it with Istio 1.4.3. However, simply giving the port a name did not work. I had previously named it interface , but that resulted in the above error. When I named it http , then it worked fine.

trieszklr commented Mar 17, 2020

see it too with istio 1.4.4

Krenair commented Mar 19, 2020 •

I’ve just run into this as well (tested in 1.4.0 — same symptom was observed on 1.4.6) — this feels like something that should’ve been mentioned at https://istio.io/news/releases/1.4.x/announcing-1.4/upgrade-notes/
It looks like things like https://github.com/helm/charts/blob/master/stable/concourse/templates/web-svc.yaml#L36 are incompatible with this requirement?

Krenair commented Mar 19, 2020 •

Setting PILOT_ENABLE_PROTOCOL_SNIFFING_FOR_OUTBOUND=false in the istio-pilot deployment environment and deleting the istio-ingressgateway/concourse-web pods has also done the trick, with an atc ServicePort name.
I’ve also found that skipping 1.4.x entirely and going to 1.5 is fine.

fatimariaz17 commented Mar 26, 2020

Had same issue for jaeger service. Having istio 1.4.3 version.
Changed port name from query-http to http-query and it worked!
Please fix it.

sourabhparsekar commented Nov 20, 2020

Not sure why is this happening but when I added name in Service ports it worked.

This one worked for us too.. phew.. great save.

xh3b4sd commented Dec 23, 2020

FWIW I had the same problem with the service port names. Though in my case it was that grpcurl could talk to the gRPC server backend behind envoy, where some webapp could not. So I changed name from grpc to grpc-web and made it work for both the webapp and grpcurl . There is something about upgrading HTTP 1.1 to HTTP 2 that I do not fully understand why the Kubernetes service name would have such an effect. grpcurl speaks HTTP 2 natively whereas the gRPC web magic does not.

eooall commented Feb 26, 2021

ports:
— port: 12306
name: web-http
targetPort: 12306

ports:
— port: 12306
name: grpc-web-http
targetPort: 12306

jonaseicher commented Apr 26, 2021

said-saifi commented Feb 18, 2022 •

Just to make it clear for others, the name is not a free text.

Источник

upstream connect error or disconnect/reset before headers #25734

Comments

colt-rex commented Jul 22, 2020

Bug description
Large requests over http frequently give an error upstream connect error or disconnect/reset before headers. reset reason: connection termination . With the bookinfo application but no sidecar, sending a 3MB file fails roughly 3% of the time. With the sidecar proxy enabled, sending the same 3MB file fails roughly 10% of the time.

The detailed output from curl on a failed request is:

Affected product area (please put an X in all that apply)
[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[ X ] Networking
[ X ] Performance and Scalability
[ ] Policies and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure

Affected features (please put an X in all that apply)

[ ] Multi Cluster
[ ] Virtual Machine
[ ] Multi Control Plane

Expected behavior
The expected behavior is that I should be able to send a file to an API a thousand times in a row with zero errors. When I run the API without istio’s functionality, I can do that.

Steps to reproduce the bug
The easiest way to reproduce the bug is using the standard «bookinfo» application.

Run istioctl install —set profile=demo
Run kubectl apply -f samples/bookinfo/platform/kube/bookinfo.yaml; kubectl apply -f samples/bookinfo/networking/bookinfo-gateway.yaml
Repeatedly run curl -F ‘foo=@/path/to/large/file’ $/productpage , where you pass in some large file of a few MB’s. Some will succeed and some will fail. NOTE*

I ran the above experiment 1,000 times WITHOUT sidecar injection enabled. In that experiment, 29 of the 1,000 requests failed to complete and returned the upstream connect error .

I ran the experiment 1,000 times WITH sidecar injection enabled. Interestingly, the error rate INCREASED with the proxy enabled: 96 of 1,000 requests failed to go through; and the other 904 returned the expected response (in this case, a 405).

***NOTE: A «successful» request here should return a 405 response, as we are POSTing to a GET-only endpoint. A failure is when we get the upstream connection error. I know it’s not proper to test this way; but it’s the easiest way to replicate. Just pretend for a minute that a 405 is like a 200, and trust (or verify) that if you want to, you can replicate the same behavior with a POST endpoint—but you’ll have to deploy a different container.

Version (include the output of istioctl version —remote and kubectl version and helm version if you used Helm)

How was Istio installed?
Istio was installed as per documentation: https://istio.io/latest/docs/setup/getting-started/

Environment where bug was observed (cloud vendor, OS, etc)
Docker-Desktop on MacOS

Additionally, please consider attaching a cluster state archive by attaching
the dump file to this issue.

The text was updated successfully, but these errors were encountered:

Источник

upstream connect error or disconnect/reset before headers #2852

Comments

rileyjbauer commented Mar 28, 2019

I and others have recently been seeing the «upstream connect error or disconnect/reset before headers» error with some frequency.

It doesn’t seem to be deterministic, for example, only one of the below requests failed.

and upon refreshing the page, a different one, or more, of those same requests may fail.

The errors seem to dissipate after refreshing the page a few times, and I have not yet encountered this while port-forwarding, as opposed to using the «cluster.endpoints.project.cloud.goog» URL for my deployment.

I wasn’t sure if this should be its own issue, or should be added to #1710.

The text was updated successfully, but these errors were encountered:

jlewi commented Mar 31, 2019

I think upstream errors are an issue indicating that Ambassador thinks the backend its forwarding traffic to is unhealthy.

Are there particular backends you are seeing this error with?

IronPan commented Apr 4, 2019

I have seen the same error. This is happening to me when loading runs for a scheduled pipeline in pipeline UI. @jlewi Do you think this can be caused by pipeline?

IronPan commented Apr 4, 2019

FWIW this is happening among a batch of requests. the rest of the requests succeeded indicating the backend should be running.

Ark-kun commented Apr 8, 2019

I think upstream errors are an issue indicating that Ambassador thinks the backend its forwarding traffic to is unhealthy.

Are there particular backends you are seeing this error with?

This is happening with the root Kubeflow UX on Kubeflow deploayments with IAM enabled.
It seems to be happening more and more. Previously it was happening after waiting for several hours. Not it can happen after few minutes.

jlewi commented Apr 16, 2019

@Ark-kun @IronPan @rileyjbauer when you observe this error can you take a look and provide your Ambassador pod logs?

I noticed this and when I looked at the logs (see below) I saw errors like the following

If you observe this I might suggest trying to kill all your Ambassador pods.

Ambassador tries to setup a K8s watch on the APIServer to be notified about service changes. It looks like it is having a problem establishing a connection to the APIServer.

The problem might be dependent on Ambassador as well as your APIServer; is your APIServer under a lot of load?

We are using
quay.io/datawire/ambassador:0.37.0

It might be worth trying a newer version of Ambassador.

@ellis-bigelow Do you recall what the performance issues with Ambassador you saw were?
ambassador-5cf8cd97d5-pqrsw.pods.txt

mcminis1 commented Apr 16, 2019

I ran into this problem while installing seldon to my cluster. I added it in twice, once as seldon and another time as seldon-core. This might have been the root cause for this issue, as well as argocd not syncing.

rileyjbauer commented Apr 16, 2019

Thanks for the direction @jlewi

I tried killing the pods and after the new ones were up, but I continued to see the errors, and there didn’t seem to be anything notable in the ambassador or API server pod logs

pdmack commented Apr 18, 2019

Seeing this too from recent master in EC2. Went down to 1 ambassador replica but no joy.

ChrisMagnuson commented May 3, 2019 •

Re posting as this thread seems more recent and active.

Envoy upstream had an issue that only recently was fixed in dev but not yet fixed in any stable version where if the service it was proxying to ended the connection with a FIN/ACK envoy would responding with only an ACK and still leave it in its connection pool and would send the next request to that service using that connection.

The service would receive it, say a get request, and then send a RST as since it had already FIN/ACK ed it doens’t have a way to reply to the request.

Its a roll of the dice whether your request get loaded to an http connection in the pool that is already dead but envoy doesn’t know it or goes to a live one which is why the symptoms of this issue are so intermittent.

May be related to what your seeing, to confirm if you have a way to capture packets on the service side you should see the weird behavoir of the service doing a FIN/ACK but envoy only responding with ACK and then sometime later sending another request on that TCP stream triggering the service to send a RST .

In envoy 1.10 they improved the message you get back so after upstream connect error or disconnect/reset before headers you will get more information, in my case got a message like connection terminated so if you upgrade to the latest envoy you may at least get additional information to confirm this what the source of the problem is even if it isn’t this specific envoy issue.

Источник

upstream connect error or disconnect/reset before headers. reset reason: connection failure #1269

Comments

cristianmtr commented Aug 17, 2020 •

Version

cli version: 0.18.1

Description

Intermittent 503 errors on AWS cluster.

Configuration

Steps to reproduce

Spin up instances on AWS.
Wait a couple of days / hours (varies).
Notice sudden 503 errors

Expected behavior

Actual behavior

503 errors with the message

Screenshots

NOTE: The endpoint stopped responding around 15:30 in the graphs below.

Monitoring nr of bytes in:

Stack traces

Nothing useful, just:

Additional context

All the load balancer target are marked as «unhealthy», even when they work (i.e. I can send requests and receive 2XX responses)

The load balancer healthcheck endpoint returns the following

upstream connect error or disconnect/reset before headers. #4999

Comments

johnzheng1975 commented Apr 17, 2018 •

[Environment]
Kubernetes v1.9.2 + istio 0.7.1 + cilium rs6, installed by kubespary
Kubernetes v1.9.2 + istio 0.5.1 + cilium rs6, installed by kubespary

[Steps]
1 Deploy a new serviceA
2 Waiting for new pods of ServiceA started up, make sure all container started up and run successfully.
3 Using istio ingress url, to invoke https://xxx.ing.xxx.com/serviceA/health

[Expect Result]
The api will return 200

[Actual Result]

The api will return 503
The error message is: upstream connect error or disconnect/reset before headers.
In the log of istio-proxy or service A, there is no any messages. (It means the request did not arrive the container at all I think)

[Debug message]
• curl localhost:9090/servicePath/v1/companies (In container of this svc) ok
• curl serviceA.namespace/servicePath/v1/companies (In container of this svc) ok
• curl serviceA.namespace/servicePath/v1/companies (In container of other svc) fail or ok
• curl -k https://xxx.ing.dev.XXXXX.com/servicePath/v1/companies fail

[Temp solution]
Way 1: Wait for 10 minutes, after new svc deployed. (This issue only raise sometimes)

Way 2: If way 1 not work, try below (Recovery rate: 95%)
kubectl delete pod istio-ingress-67ff757554-8wlmd -n istio-system
kubectl delete pod istio-pilot-67d6ddbdf6-kfjlp -n istio-system;

Way 3: If way 2 not work, try below (Recovery rate: 80%)
kubectl delete pod kube-dns-79d99cdcd5-t8kv5 -n kube-system
kubectl delete pod kube-dns-79d99cdcd5-zgzg8 -n kube-system

Way 4: If way 3 still not work, try below (Recovery rate: 100% so far)
Delete all pod of serviceA, let they recreated.
Do way 2 again
Do way 3 again.

The text was updated successfully, but these errors were encountered:

thoslin commented Apr 17, 2018

imjoey commented Apr 17, 2018 •

The same issue +1, right after I ran kubectl apply -f to update my services. I uses Kubernetes v1.9.3, istio v0.7.1 and Cloud Provider: Aliyun.

I tried all of your four ways but no one worked. Meanwhile I found that the pods IPs in the access log of istio-ingress, were not same as the ones in kubectl get pods — o wide output. So this may caused by kube-dns , as it still took outdated data.

imjoey commented Apr 17, 2018

@johnzheng1975 You should file issues in https://github.com/istio/issues/issues, not here. Only confirmed, triaged and labelled issues should be filed here.

johnzheng1975 commented Apr 18, 2018

@imjoey
Since your pod ip is incorrect, restart kube-dns and istio-ingress should work from my experience.
And, thanks for reminder for filing issue in correct address.

imjoey commented Apr 18, 2018

@johnzheng1975 Thank you, it works! I also encountered an issue rds: fetch failure , after I restarted the istio-pilot pods then everything is OK. Maybe restarting istio-pilot pods is also a helpful solution.

johnzheng1975 commented Apr 18, 2018

Cool!
One confirm: are you using istio 0.7.1 and this issue still raise?
And, just want to know your issue will raise again? We need resolve it completely.

imjoey commented Apr 18, 2018

@johnzheng1975 Yes, istio v0.7.1 still has the upstream connect error or disconnect/reset before headers. and rds: fetch failure issues. Thanks for your highly appreciated work. 😄

johnzheng1975 commented Apr 18, 2018 •

In my istio 0.5.1, there is no http2_protocol_options at all.

johnzheng1975 commented Apr 19, 2018 •

add envoy configure file in istio-proxy

johnzheng1975 commented Apr 20, 2018 •

Today I reproduced this issue again in another istio0.5.1 platform. I record all the details and logs. As below:
After ham service deployed new version, pod running well.

[40 minutes later]
Access https://xxx.ing.dev.xxxx.com/ham/api/v1/health still show “upstream connect error or disconnect/reset before headers”,
Note: No logs is increased in ham container, no log is increased istio-proxy container(same pod)

In its own ham container, or other service’s container (order container), access api all successfully.
root@hp-order-deploy-6c7d84fb55-h6rk6:/go# curl hp-ham-service.hp/ham/api/v1/health
<«health»:<«status»:»UP»>,»service_name»:»L2 HAM»>
Note: Logs both showed in ham container, and istio-proxy container(same pod)

[48 minutes later]
Access https://xxx.ing.dev.xxxx.com/ham/api/v1/health works. It recovery automatically.

[Logs attached]
• Ham service log
• Ham service istio-proxy container log
• Istio-ingress log
• Cilium log
istio-ingress.log
ham-istio-proxy.log
ham-service.log
cilium .log

manalibhutiyani commented May 2, 2018

@johnzheng1975 : Did you try removing http2_protocol_options: <> in istio0.7.1 . Does it work on removing this option from envoy config file for 0.71 . I see this option in the 0.7.1_envoy_configure_in_istio_proxy.txt you attached.

johnzheng1975 commented May 3, 2018

@manalibhutiyani , this is not the root reason after investigation.
And, it is difficult to change envoy configure file except created a new image. FYI

johnzheng1975 commented May 3, 2018

Here is comments of Romain (cilium engineer)

discovery fatal error: concurrent map read and map write #4903 Istio Pilot crashes and sometimes enters CrashloopBackoff.
One symptom is that Istio Ingress returns 503s, because it tries to connect to the upstream pods that have been deleted.

This is a known upstream issue, for which we just submitted a fix: #5373
The fix will likely be merged into the Istio 0.8.0 release.

This bug causes all proxies, including Istio Ingress and sidecars, to not get configured until Pilot recovers.
We have observed that this can take up to 6 minutes.

There is no workaround for that issue.
However, the severity is not as high as the 2nd issue, since Pilot seems to recover on its own. (edited)

Pilot 0.7.1 sometimes never configures in.80 cluster in sidecar proxy #5376 (we filed that issue)
The symptom is that an app’s sidecar proxy always returns 503s for all inbound traffic.
We verified that, in this situation:

Istio Ingress is healthy and correctly configured. It tries to connect to the right upstream pod.
Istio Ingress has no network connectivity issues. Especially, it can connect with the upstream pod.
Istio Ingress can correctly establish TCP connections to the backend pod, incl. HTTP connections.
The upstream pod has no network connectivity issues. It can connect to Ingress, Pilot, other services, DNS, www.google.com, etc.

The root cause is that Pilot never pushes the configuration of the in.80 cluster to the sidecar. That cluster is the one that handles all inbound traffic to port 80 in the application’s pod. Without that cluster configured, no inbound connection can be established between the pod’s Envoy proxy and the server within the same pod, so the sidecar proxy returns a 503 for every HTTP request.
When in that state, the sidecar is stuck forever, and never recovers.

The workaround is to restart Pilot.
Immediately after restarting, it pushes the right configuration to the app’s sidecar, which immediately becomes healthy and stops returning 503s.

Источник

Русские Блоги

Помните об устранении неполадок Istio intermittent 503

описание проблемы

Связанная проблема

Базовое решение

Источник моей проблемы

подводить итоги

ошибка восходящего подключения или отключение / сброс перед заголовками. причина сброса: разрыв соединения при использовании Spring Boot

описание проблемы

Связанная проблема

Базовое решение

Источник моей проблемы

подводить итоги

Вопрос:

Ответ №1:

Комментарии:

#1 Reconfiguring Service Routes

#2 Setting Destination Rules

#2 Traffic Management With Applications

Ok, but What if I’m Not a Developer or Programmer? (3 Steps)

#1 The Walmart Bug

#2 An Easy Fix

#3 Clearing the Cache

upstream connect error or disconnect/reset before headers. reset reason: connection termination #19966

Comments

rnkhouse commented Jan 7, 2020 •

rnkhouse commented Jan 9, 2020 •

bishtawi commented Jan 25, 2020 •

baocang commented Mar 5, 2020

JoeJasinski commented Mar 6, 2020

trieszklr commented Mar 17, 2020

Krenair commented Mar 19, 2020 •

Krenair commented Mar 19, 2020 •

fatimariaz17 commented Mar 26, 2020

sourabhparsekar commented Nov 20, 2020

xh3b4sd commented Dec 23, 2020

eooall commented Feb 26, 2021

jonaseicher commented Apr 26, 2021

said-saifi commented Feb 18, 2022 •

upstream connect error or disconnect/reset before headers #25734

Comments

colt-rex commented Jul 22, 2020

upstream connect error or disconnect/reset before headers #2852

Comments

rileyjbauer commented Mar 28, 2019

jlewi commented Mar 31, 2019

IronPan commented Apr 4, 2019

IronPan commented Apr 4, 2019

Ark-kun commented Apr 8, 2019

jlewi commented Apr 16, 2019

mcminis1 commented Apr 16, 2019

rileyjbauer commented Apr 16, 2019

pdmack commented Apr 18, 2019

ChrisMagnuson commented May 3, 2019 •

upstream connect error or disconnect/reset before headers. reset reason: connection failure #1269

Comments

cristianmtr commented Aug 17, 2020 •

Version

Description

Configuration

Steps to reproduce

Expected behavior

Actual behavior

Screenshots

Stack traces

Additional context

Suggested solution

upstream connect error or disconnect/reset before headers. #4999

Comments

johnzheng1975 commented Apr 17, 2018 •

thoslin commented Apr 17, 2018

imjoey commented Apr 17, 2018 •

imjoey commented Apr 17, 2018

johnzheng1975 commented Apr 18, 2018

imjoey commented Apr 18, 2018

johnzheng1975 commented Apr 18, 2018

imjoey commented Apr 18, 2018

johnzheng1975 commented Apr 18, 2018 •

johnzheng1975 commented Apr 19, 2018 •

johnzheng1975 commented Apr 20, 2018 •