Passive down error during connection collision

Troubleshoot Border Gateway Protocol Basics Available Languages Download Options Bias-Free Language The documentation set for this product strives to use

Troubleshoot Border Gateway Protocol Basics

Available Languages

Download Options

Bias-Free Language

The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.

Contents

Introduction

This document describes how to troubleshoot the most common issues in Border Gateway Protocol (BGP) and provides basic guidelines.

Prerequisites

Requirements

There are no specific prerequisites for this document, basic BGP protocol knowledge is useful, you can refer to the BGP configuration guide for reference.

Components used

This document is not restricted to specific software and hardware versions, but commands are applicable for Cisco IOS® and Cisco IOS-XE®.

The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, ensure that you understand the potential impact of any command.

Background Information

This document describes a basic guide to troubleshoot the most common issues in Border Gateway Protocol (BGP), gives corrective actions, useful commands/debugs to detect the root cause of the problems, and best practices to avoid potential issues. Keep in mind that all possible variables and scenarios cannot be considered and a deeper analysis could be required by Cisco TAC.

Topology

Use this diagram as reference for outputs provided in this document.

Scenarios and Problems

Adjacency Down

If a BGP session is down and does not come up, issue the command: «show ip bgp all summary». Here you can find the current status of the session:

  • If the session is not up state can vary between IDLE and ACTIVE (depends on the Finite State Machine process).
  • If session is up, you see the number of prefixes received.

No Connectivity

The first requirement that has to be ensured, is the connectivity between both peers so TCP session on port 179 can be established, either they are directly connected or not. A simple ping is useful for this matter. If peering is established between loopback interfaces, a loopback to loopback ping must be done. If a ping test is performed without specific loopback as the source interface, the outgoing physical interface IP address is used as the packet’s source IP address instead of the router’s loopback IP address.

If ping is not successful, consider these causes:

  • No connected route peer or no route at all: «show ip route peer_IP_address« can be used
  • Layer 1 issue: physical interface, SFP (connector), cable or external issue (transport and provider if applicable) needs to be considered.
  • Check any firewall or access lists which can block connection.

If ping is successful, consider this:

Configuration Issues

  • Wrong IP address or AS configured: For wrong IP address, there is no such message displayed but ensure proper configuration is done. For wrong AS, you must see a message like this via «show logging» :

Check BGP configuration on both ends to correct AS numbers or peer IP address.

Check BGP identifier on both ends via «show ip bgp all summary» and correct the duplicate issue, this can be achieved manually with global command «bgp router-id X.X.X.X» under bgp router configuration. As a best practice, ensure router ID is set manually to unique number.

Most of the iBGP sessions are configured over the loopback interfaces reachable via an IGP. This loopback interface must be explicitly defined as the source, this is done by command «neighbor ip-address update-source interface-id«.

For eBGP peer, directly connected interfaces are usually used for peering and there is a check for Cisco IOS/Cisco IOS-XE to fulfill this purpose or does not even try to establish session. If eBGP is tried from loopback to loopback on directly connected routers, this check can be disabled for a specific neighbor on both ends via «neighbor ip-address disable-connected-check».

However, if there are multiple hops between the eBGP peers, a proper hop count is required, ensure the «neighbor ip-address ebgp-multihop [hop-count]« is configured with the correct hop count so session can be established.

If the hop-count is not specified, the default TTL value for iBGP sessions is 255, while the default TTL value for eBGP sessions is 1.

TCP Session Issues

A useful action to test port 179 is a manual telnet from one peer to the other:

Either Open/connection closed, or Connection refused by remote host indicates packets reach remote end, then, ensure there are no problems with control plane at far end. Otherwise, if there is a Destination unreachable, check any firewall or access lists which can block TCP port 179 or BGP packets or any packet loss on the path.

In case of authentication problem, the messages you can see:

Check authentication methods, password and related configuration, and to further troubleshoot refer to this page: MD5 Authentication Between BGP Peers Configuration Example

If the TCP session does not come up, you can use the next commands for isolation:

Adjacency Bounces

If session is up and down, please look for «show log» and we can see some scenarios.

Interface Flap

As message indicates, reason for this failure is the interface down situation, look for any physical issues on port/SFP, cable or disconnections.

Hold Timer Expired

It is a very common situation; it means that router did not receive or process a keepalive message or any update message before the hold timer expired. Device sends a notification message and closes the session. The most commons reasons for this issue are listed here:

  • Interface issues: Look for any input errors, input queue drops or physical issues on both peers’ connected interfaces; «show interface» can be used for this purpose.
  • Packet loss in transit: Sometimes, Hello packets can be dropped in transit, the best way to ensure this is a packet capture at interface level.
    • You can use Embedded Packet Capture on Cisco IOS and Cisco IOS — XE devices.
    • In case packets are seen at interface level we need to be sure they reach control plane, EPC on control plane or «debug bgp [vrf name] ipv4 unicast keepalives» is useful.
  • High CPU: A high CPU condition can cause drops on control plane, «show processes cpu [sorted|history]» is useful to identify problem and, depends on platform, the next troubleshooting. CPU Reference document
  • CoPP policy issues: Troubleshoot methodology varies for each platform and is out of scope for this document.
  • MTU mismatch: If there are MTU discrepancies in the path, and if ICMP messages are blocked in the path from source to destination, PMTUD does not function and can result in session flap. Updates are sent with the negotiated MSS value and a DF bit set. If a device in the path or even the destination is not able to accept the packets with higher MTU, it sends an ICMP error message back to BGP speaker. The destination router either waits for the BGP keepalive or BGP update packet to update its hold down timer.
    • You can check the MSS negotiated with «show ip bgp neighbors ip_address«.

A Ping test to a specific neighbor with df set can show you if such MTU is valid along the path:

If MTU issues are found, an accurate review of the configuration must be done to ensure that the MTU values are consistent throughout the network.

Note: More information on MTU can be found here: BGP Neighbor Flaps with MTU Troubleshooting

AFI/SAFI Issues

Address-family identifier (AFI) is a capability extension added by Multi-Protocol BGP (MP-BGP), it correlates to a specific network protocol, such as IPv4, IPv6, and the like, and additional granularity through a subsequent address-family identifier (SAFI), such as unicast and multicast. MBGP achieves this separation by BGP path attributes (PAs) MP_REACH_NLRI and MP_UNREACH_NLRI. These attributes are carried inside BGP update messages and are used to carry network reachability information for different address families.

The message gives you the numbers of these AFI/SAFI registered by IANA:

  • Check BGP configuration for the address-families intended on both sides to correct any undesired address families.
  • Use «neighbor ip-address dont-capability-negotiate» on both ends, for further information you can check: Unsupported Capabilities Cause BGP Peer Malfunction

Path Install and Selection

For a better explanation about how BGP works and select best path refer to document BGP Best Path Selection Algorithm.

Next Hop

For a route to be installed into our routing table, next hop needs to be reachable, otherwise, even if prefix is on our Loc-RIB BGP table, it does not get into RIB. As a loop avoidance rule, on Cisco IOS/Cisco IOS-XE, iBGP does not change next hop attribute and leaves AS_PATH alone while eBGP rewrites next hop and prepends its AS_PATH.

You can check next hop via «show ip bgp [prefix]», it gives you the next hop and inaccessible word. In the example, this is a prefix announced by R1 via eBGP to R2 and learnt by R3 via iBGP connection from R2.

On the output, next hop is the outgoing interface of R1 which is not known by R3. In order to fix this situation either you can advertise next-hop via IGP, static route. or use «neighbor ip-address next-hop-self» command on iBGP peer to modify the next-hop IP (which is directly connected). On diagram example, this configuration needs to be on R2; the neighbor towards R3 (neighbor 10.0.23.3 next-hop-self).

As a result, next hop changes (after a «clear ip bgp 10.0.23.2 soft») to directly connected interface (reachable) and prefix is installed.

RIB failure

This happens when route cannot be installed into the Global RIB, which results in a RIB failure, common reason is when same prefix is already on RIB for another routing protocol with lower administrative distance but the exact reason for a RIB failure is seen with the command show ip bgp rib-failure. For deeper explanation you can consult link below:

Note: You can identify and correct such issue as explained in link Understand BGP RIB-failure and The Command bgp suppress-inactive

Race condition

The most common issue seen is when IGP is preferred over eBGP on mutual redistribution scenario. When an IGP route is redistributed into BGP, it is considered locally generated by BGP and gets a weight of 32768 by default. All prefixes received from a BGP peer are assigned a local weight of 0 by default. Therefore, if the same prefix must be compared, the prefix with the higher weight is installed in the routing table based on the BGP best path selection process and this is why IGP route is installed on RIB.

The solution for this problem, is to set a higher weight to for all routes received from the BGP peer under router bgp configuration:

Other Issues

BGP Slow Peer

It is a peer that cannot keep up with the rate at which the sender generates update messages. There are many reasons for a peer to exhibit this problem; high CPU in one of the peers, excess traffic or traffic loss on a link, bandwidth resource, among others.

Note: Refer to this page to help identify and correct slow peers issues Use the BGP «Slow Peer» Feature to Resolve Slow Peer Issues

Memory Issues

BGP uses memory that is assigned to the Cisco IOS process to maintain network prefixes, best paths, polices and all related configuration to operate properly. The overall processes are seen with command «show processes memory sorted»

Processor pool is the memory used; around 2.1 GB in the example. Next, we must look at the Holding column to identify the sub-process holding most of it. Then, we need to check the BGP sessions we have, how many routes are received, and configuration used.

Common steps to reduce memory holding by BGP:

  • BGP filtering: If it is not necessary to receive a full BGP table, use policies to filter routes and install only the prefixes you need.
  • Soft reconfiguration: Look for «neighbor ip_address soft-reconfiguration inbound» under BGP configuration; this command allows you to see all prefixes received before any inbound policy (Adj-RIB-in). However, this table needs around half of the current BGP Local RIB table to store this information so you can avoid this configuration unless it is compulsorily required, or your current prefixes are few.

Note: For further information on how to optimize BGP refer to this page Configure BGP Routers for Optimal Performance and Reduced Memory Consumption

High CPU

Routers use different processes for BGP to operate. To verify the BGP process is the cause of high CPU utilization, use the «show process cpu sorted» command.

Here are the common processes, causes, and general steps to overcome high CPU utilization due to BGP:

  • BGP Router: Runs once per second to safeguard faster convergence. Is one of the most important threads, it reads the bgp update messages, validates the prefixes/networks and attributes, update the per AFI/SAFI network/prefix table and attribute table, perform best-path calculation among many other tasks.
    Huge route churn is a very common scenario that leads to this situation.
  • BGP Scanner: Low-priority process that runs every 60 seconds by default. This process checks the entire BGP table to verify the next-hop reachability and updates the BGP table accordingly in case there is any change for a path. It runs through the Routing Information Base (RIB) for redistribution purposes.
    Check platform scale, as more prefixes and routes installed and TCAM used, more resources needed and, usually, an overloaded device leads into such situations.

Note: For further information on how to troubleshoot these two processes reference this page Troubleshoot High CPU Caused by the BGP Scanner or Router Process

  • BGP I/O: Runs when BGP control packets are received and manages the queuing and processing of BGP packets. If there are excessive packets received in the BGP queue for a long period, or if there is a problem with TCP, the router shows symptoms of high CPU due to BGP I/O process. (Usually, BGP Router is also high in this situation. Look at the message counts to identify peer and capture packets to identify the source of these messages.)
  • BGP Open: Process used on session establishment. Not a common high CPU issue unless session is stuck in Open State.
  • BGP Event: Is responsible for next-hop processing. Look for next-hops flaps on prefixes received.

Источник

    Introduction

    This document describes how to troubleshoot the most common issues in Border Gateway Protocol (BGP) and provides basic guidelines.

    Prerequisites

    Requirements

    There are no specific prerequisites for this document, basic BGP protocol knowledge is useful, you can refer to the BGP configuration guide for reference.

    Components used

    This document is not restricted to specific software and hardware versions, but commands are applicable for Cisco IOS® and Cisco IOS-XE®.

    The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, ensure that you understand the potential impact of any command.

    Background Information

    This document describes a basic guide to troubleshoot the most common issues in Border Gateway Protocol (BGP), gives corrective actions, useful commands/debugs to detect the root cause of the problems, and best practices to avoid potential issues. Keep in mind that all possible variables and scenarios cannot be considered and a deeper analysis could be required by Cisco TAC.

    Topology

    Use this diagram as reference for outputs provided in this document.

    Output Reference Diagram

    Scenarios and Problems

    Adjacency Down

    If a BGP session is down and does not come up, issue the command: «show ip bgp all summary». Here you can find the current status of the session:

    • If the session is not up state can vary between IDLE and ACTIVE (depends on the Finite State Machine process).
    • If session is up, you see the number of prefixes received.
    R2#show ip bgp all summary 
    For address family: IPv4 Unicast
    BGP router identifier 198.51.100.2, local AS number 65537
    BGP table version is 19, main routing table version 19
    18 network entries using 4464 bytes of memory
    18 path entries using 2448 bytes of memory
    1/1 BGP path/bestpath attribute entries using 296 bytes of memory
    0 BGP route-map cache entries using 0 bytes of memory
    0 BGP filter-list cache entries using 0 bytes of memory
    BGP using 7208 total bytes of memory
    BGP activity 18/0 prefixes, 18/0 paths, scan interval 60 secs
    18 networks peaked at 11:21:00 Jun 30 2022 CST (00:01:35.450 ago)
    
    Neighbor        V           AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down  State/PfxRcd
    10.0.23.3       4        65537       6       5       19    0    0 00:01:34       18
    198.51.100.1    4        65536       0       0        1    0    0 never    Idle

    No Connectivity

    The first requirement that has to be ensured, is the connectivity between both peers so TCP session on port 179 can be established, either they are directly connected or not. A simple ping is useful for this matter. If peering is established between loopback interfaces, a loopback to loopback ping must be done. If a ping test is performed without specific loopback as the source interface, the outgoing physical interface IP address is used as the packet’s source IP address instead of the router’s loopback IP address.

    If ping is not successful, consider these causes:

    • No connected route peer or no route at all: «show ip route peer_IP_address« can be used
    • Layer 1 issue: physical interface, SFP (connector), cable or external issue (transport and provider if applicable) needs to be considered.
    • Check any firewall or access lists which can block connection.

    If ping is successful, consider this:

    Configuration Issues

    • Wrong IP address or AS configured: For wrong IP address, there is no such message displayed but ensure proper configuration is done. For wrong AS,  you must see a message like this via «show logging»:
    %BGP-3-NOTIFICATION: sent to neighbor 198.51.100.1 passive 2/2 (peer in wrong AS) 2 bytes 1B39

    Check BGP configuration on both ends to correct AS numbers or peer IP address.

    • Duplicate router ID:
    %BGP-3-NOTIFICATION: sent to neighbor 198.51.100.1 passive 2/3 (BGP identifier wrong) 4 bytes 0A0A0A0A

    Check BGP identifier on both ends via «show ip bgp all summary» and correct the duplicate issue, this can be achieved manually with global command «bgp router-id X.X.X.X» under bgp router configuration. As a best practice, ensure router ID is set manually to unique number.

    • BGP source and TTL:

    Most of the iBGP sessions are configured over the loopback interfaces reachable via an IGP. This loopback interface must be explicitly defined as the source, this is done by command «neighbor ip-address update-source interface-id«.

    For eBGP peer, directly connected interfaces are usually used for peering and there is a check for Cisco IOS/Cisco IOS-XE to fulfill this purpose or does not even try to establish session. If eBGP is tried from loopback to loopback on directly connected routers, this check can be disabled for a specific neighbor on both ends via «neighbor ip-address disable-connected-check».

    However, if there are multiple hops between the eBGP peers, a proper hop count is required, ensure the «neighbor ip-address ebgp-multihop [hop-count]« is configured with the correct hop count so session can be established.

    If the hop-count is not specified, the default TTL value for iBGP sessions is 255, while the default TTL value for eBGP sessions is 1.

    TCP Session Issues

    A useful action to test port 179 is a manual telnet from one peer to the other:

    R1#telnet 198.51.100.2 179
    Trying 198.51.100.2, 179 ... Open
    
    [Connection to 198.51.100.2 closed by foreign host]

    Either Open/connection closed, or Connection refused by remote host indicates packets reach remote end, then, ensure there are no problems with control plane at far end. Otherwise, if there is a Destination unreachable, check any firewall or access lists which can block TCP port 179 or BGP packets or any packet loss on the path.

    In case of authentication problem, the messages you can see:

    %TCP-6-BADAUTH: Invalid MD5 digest from 198.51.100.1(179) to 198.51.100.2(20062) tableid - 0
    %TCP-6-BADAUTH: No MD5 digest from 198.51.100.1(179) to 198.51.100.2(20062) tableid - 0

    Check authentication methods, password and related configuration, and to further troubleshoot refer to this page: MD5 Authentication Between BGP Peers Configuration Example

    If the TCP session does not come up, you can use the next commands for isolation:

    show tcp brief all
    show control-plane host open-ports
    debug ip tcp transactions

    Adjacency Bounces

    If session is up and down, please look for «show log» and we can see some scenarios.

    Interface Flap

    %BGP-5-ADJCHANGE: neighbor 198.51.100.2 Down Interface flap

    As message indicates, reason for this failure is the interface down situation, look for any physical issues on port/SFP, cable or disconnections.

    Hold Timer Expired

    %BGP-3-NOTIFICATION: sent to neighbor 198.51.100.2 4/0 (hold time expired) 0 bytes

    It is a very common situation; it means that router did not receive or process a keepalive message or any update message before the hold timer expired. Device sends a notification message and closes the session. The most commons reasons for this issue are listed here:

    • Interface issues: Look for any input errors, input queue drops or physical issues on both peers’ connected interfaces; «show interface» can be used for this purpose.
    • Packet loss in transit: Sometimes, Hello packets can be dropped in transit, the best way to ensure this is a packet capture at interface level.
      • You can use Embedded Packet Capture on Cisco IOS and Cisco IOSXE devices. 
      • In case packets are seen at interface level we need to be sure they reach control plane, EPC on control plane or «debug bgp [vrf name] ipv4 unicast keepalives» is useful.
    • High CPU: A high CPU condition can cause drops on control plane, «show processes cpu [sorted|history]» is useful to identify problem and, depends on platform, the next troubleshooting. CPU Reference document 
    • CoPP policy issues: Troubleshoot methodology varies for each platform and is out of scope for this document.
    • MTU mismatch: If there are MTU discrepancies in the path, and if ICMP messages are blocked in the path from source to destination, PMTUD does not function and can result in session flap. Updates are sent with the negotiated MSS value and a DF bit set. If a device in the path or even the destination is not able to accept the packets with higher MTU, it sends an ICMP error message back to BGP speaker. The destination router either waits for the BGP keepalive or BGP update packet to update its hold down timer.
      • You can check the MSS negotiated with «show ip bgp neighbors ip_address«.

    A Ping test to a specific neighbor with df set can show you if such MTU is valid along the path:

    ping 198.51.100.2 size max_seg_size df

    If MTU issues are found, an accurate review of the configuration must be done to ensure that the MTU values are consistent throughout the network.

    Note: More information on MTU can be found here: BGP Neighbor Flaps with MTU Troubleshooting

    AFI/SAFI Issues

    %BGP-5-ADJCHANGE: neighbor 198.51.100.2 passive Down AFI/SAFI not supported
    %BGP-3-NOTIFICATION: received from neighbor 198.51.100.2 active 2/8 (no supported AFI/SAFI) 3 bytes 000000

    Address-family identifier (AFI) is a capability extension added by Multi-Protocol BGP (MP-BGP), it correlates to a specific network protocol, such as IPv4, IPv6, and the like, and additional granularity through a subsequent address-family identifier (SAFI), such as unicast and multicast. MBGP achieves this separation by BGP path attributes (PAs) MP_REACH_NLRI and MP_UNREACH_NLRI. These attributes are carried inside BGP update messages and are used to carry network reachability  information for different address families.

    The message gives you the numbers of these AFI/SAFI registered by IANA:

    IANA Address Family Numbers

    Subsequent Address Family Identifiers (SAFI) Parameters

    • Check BGP configuration for the address-families intended on both sides to correct any undesired address families.
    • Use «neighbor ip-address dont-capability-negotiate» on both ends, for further information you can check: Unsupported Capabilities Cause BGP Peer Malfunction

    Path Install and Selection

    For a better explanation about how BGP works and select best path refer to document BGP Best Path Selection Algorithm.

    Next Hop

    For a route to be installed into our routing table, next hop needs to be reachable, otherwise, even if prefix is on our Loc-RIB BGP table, it does not get into RIB. As a loop avoidance rule, on Cisco IOS/Cisco IOS-XE, iBGP does not change next hop attribute and leaves AS_PATH alone while eBGP rewrites next hop and prepends its AS_PATH. 

    You can check next hop via «show ip bgp [prefix]», it gives you the next hop and inaccessible word. In the example, this is a prefix announced by R1 via eBGP to R2 and learnt by R3 via iBGP connection from R2.

    R3#show ip bgp 192.0.2.1
    BGP routing table entry for 192.0.2.1/32, version 0
    Paths: (1 available, no best path)
      Not advertised to any peer
      Refresh Epoch 1
      65536
        198.51.100.1 (inaccessible) from 10.0.23.2 (10.2.2.2)
          Origin incomplete, metric 0, localpref 100, valid, internal
          rx pathid: 0, tx pathid: 0
          Updated on Jul 1 2022 13:44:19 CST

    On the output, next hop is the outgoing interface of R1 which is not known by R3. In order to fix this situation either you can advertise next-hop via IGP, static route… or use «neighbor ip-address next-hop-self» command on iBGP peer to modify the next-hop IP (which is directly connected). On diagram example, this configuration needs to be on R2; the neighbor towards R3 (neighbor 10.0.23.3 next-hop-self).

    As a result, next hop changes (after a «clear ip bgp 10.0.23.2 soft») to directly connected interface (reachable) and prefix is installed.

    R3#show ip bgp 192.0.2.1
    BGP routing table entry for 192.0.2.1/32, version 24
    Paths: (1 available, best #1, table default)
      Not advertised to any peer
      Refresh Epoch 1
      65536
        10.0.23.2 from 10.0.23.2 (10.2.2.2)
          Origin incomplete, metric 0, localpref 100, valid, internal, best
          rx pathid: 0, tx pathid: 0x0
          Updated on Jul 1 2022 13:46:53 CST

    RIB failure

    This happens when route cannot be installed into the Global RIB, which results in a RIB failure, common reason is when same prefix is already on RIB for another routing protocol with lower administrative distance but the exact reason for a RIB failure is seen with the command show ip bgp rib-failure. For deeper explanation you can consult link below:

    Note: You can identify and correct such issue as explained in link Understand BGP RIB-failure and The Command bgp suppress-inactive

    Race condition

    The most common issue seen is when IGP is preferred over eBGP on mutual redistribution scenario. When an IGP route is redistributed into BGP, it is considered locally generated by BGP and gets a weight of 32768 by default. All prefixes received from a BGP peer are assigned a local weight of 0 by default. Therefore, if the same prefix must be compared, the prefix with the higher weight is installed in the routing table based on the BGP best path selection process and this is why IGP route is installed on RIB.

    The solution for this problem, is to set a higher weight to for all routes received from the BGP peer under router bgp configuration:

    neighbor ip-address weight 40000

    Note: Refer to this page for a deeper explanation Understand the Importance of BGP Weight Path Attribute in Network Failover Scenarios

    Other Issues

    BGP Slow Peer

    It is a peer that cannot keep up with the rate at which the sender generates update messages. There are many reasons for a peer to exhibit this problem; high CPU in one of the peers, excess traffic or traffic loss on a link, bandwidth resource, among others.

    Note: Refer to this page to help identify and correct slow peers issues Use the BGP «Slow Peer» Feature to Resolve Slow Peer Issues

    Memory Issues

    BGP uses memory that is assigned to the Cisco IOS process to maintain network prefixes, best paths, polices and all related configuration to operate properly. The overall processes are seen with command «show processes memory sorted» 

    R1#show processes memory sorted 
    Processor Pool Total: 2121414332 Used: 255911152 Free: 1865503180
    reserve P Pool Total: 102404 Used: 88 Free: 102316 lsmpi_io Pool Total: 3149400 Used: 3148568 Free: 832 PID TTY Allocated Freed Holding Getbufs Retbufs Process 0 0 266231616 81418808 160053760 0 0 *Init* 662 0 34427640 51720 34751920 0 0 SBC main process 85 0 9463568 0 8982224 0 0 IOSD ipc task 0 0 34864888 25213216 8513400 8616279 0 *Dead* 504 0 696632 0 738576 0 0 QOS_MODULE_MAIN 518 0 940000 8616 613760 0 0 BGP Router 228 0 856064 345488 510080 0 0 mDNS 82 0 547096 118360 417520 0 0 SAMsgThread 0 0 0 0 395408 0 0 *MallocLite*

    Processor pool is the memory used; around 2.1 GB in the example. Next, we must look at the Holding column to identify the sub-process holding most of it. Then, we need to check the BGP sessions we have, how many routes are received, and configuration used.

    Common steps to reduce memory holding by BGP:

    • BGP filtering: If it is not necessary to receive a full BGP table, use policies to filter routes and install only the prefixes you need.
    • Soft reconfiguration: Look for «neighbor ip_address soft-reconfiguration inbound» under BGP configuration; this command allows you to see all prefixes received before any inbound policy (Adj-RIB-in). However, this table needs around half of the current BGP Local RIB table to store this information so you can avoid this configuration unless it is compulsorily required, or your current prefixes are few.

    Note: For further information on how to optimize BGP refer to this page Configure BGP Routers for Optimal Performance and Reduced Memory Consumption

    High CPU

    Routers use different processes for BGP to operate. To verify the BGP process is the cause of high CPU utilization, use the «show process cpu sorted» command.

    R3#show processes cpu sorted 
    CPU utilization for five seconds: 0%/0%; one minute: 0%; five minutes: 0%
     PID Runtime(ms)     Invoked      uSecs   5Sec   1Min   5Min TTY Process
     PID Runtime(ms)     Invoked      uSecs   5Sec   1Min   5Min TTY Process 
     163          36        1463         24  0.07%  0.00%  0.00%   0 ADJ background   
      62          28         132        212  0.07%  0.00%  0.00%   0 Exec             
       2          39         294        132  0.00%  0.00%  0.00%   0 Load Meter       
       1           0           4          0  0.00%  0.00%  0.00%   0 Chunk Manager    
       3          27        1429         18  0.00%  0.00%  0.00%   0 BGP Scheduler    
       4           0           1          0  0.00%  0.00%  0.00%   0 RO Notify Timers
      63           4          61         65  0.00%  0.00%  0.00%   0 BGP I/O          
      83         924          26      35538  0.00%  0.03%  0.04%   0 BGP Scanner      
      96         142       11651         12  0.00%  0.00%  0.00%   0 Tunnel BGP  
       7           0           1          0  0.00%  0.00%  0.00%   0 DiscardQ Backgro 

    Here are the common processes, causes, and general steps to overcome high CPU utilization due to BGP:

    • BGP Router: Runs once per second to safeguard faster convergence. Is one of the most important threads, it reads the bgp update messages, validates the prefixes/networks and attributes, update the per AFI/SAFI network/prefix table and attribute table, perform best-path calculation among many other tasks. 
      Huge route churn is a very common scenario that leads to this situation.
    • BGP Scanner: Low-priority process that runs every 60 seconds by default. This process checks the entire BGP table to verify the next-hop reachability and updates the BGP table accordingly in case there is any change for a path. It runs through the Routing Information Base (RIB) for redistribution purposes.
      Check platform scale, as more prefixes and routes installed and TCAM used, more resources needed and, usually, an overloaded device leads into such situations.

    Note: For further information on how to troubleshoot these two processes reference this page Troubleshoot High CPU Caused by the BGP Scanner or Router Process

    • BGP I/O: Runs when BGP control packets are received and manages the queuing and processing of BGP packets. If there are excessive packets received in the BGP queue for a long period, or if there is a problem with TCP, the router shows symptoms of high CPU due to BGP I/O process. (Usually, BGP Router is also high in this situation. Look at the message counts to identify peer and capture packets to identify the source of these messages.)
    • BGP Open: Process used on session establishment. Not a common high CPU issue unless session is stuck in Open State.
    • BGP Event: Is responsible for next-hop processing. Look for next-hops flaps on prefixes received.

    Related Information

    • Technical Support & Documentation — Cisco Systems
    • BGP configuration guide
    • MD5 Authentication Between BGP Peers Configuration Example
    • Embedded Packet Capture
    • BGP Neighbor Flaps with MTU Troubleshooting
    • IANA Address Family Numbers

    • Subsequent Address Family Identifiers (SAFI) Parameters

    • Unsupported Capabilities Cause BGP Peer Malfunction
    • BGP Best Path Selection Algorithm
    • Understand BGP RIB-failure and The Command bgp suppress-inactive
    • Understand the Importance of BGP Weight Path Attribute in Network Failover Scenarios
    • Use the BGP «Slow Peer» Feature to Resolve Slow Peer Issues
    • Configure BGP Routers for Optimal Performance and Reduced Memory Consumption
    • Troubleshoot High CPU Caused by the BGP Scanner or Router Process

    Comments

    @WillieWookiee

    @danderson
    danderson

    changed the title
    Message Type error in Logs

    Peering with Cisco 4500 in a Calico-enabled cluster fails with BGP notification error messages

    Dec 19, 2017

    @danderson
    danderson

    changed the title
    Peering with Cisco 4500 in a Calico-enabled cluster fails with BGP notification error messages

    MetalLB cannot peer with BGP routers that Calico is already peering with

    Dec 19, 2017

    @danderson
    danderson

    changed the title
    MetalLB cannot peer with BGP routers that Calico is already peering with

    MetalLB cannot peer with BGP routers that Calico or Romana are already peering with

    Jan 12, 2018

    danderson

    added a commit
    that referenced
    this issue

    Jan 15, 2018

    @danderson

    danderson

    added a commit
    that referenced
    this issue

    Jan 15, 2018

    @danderson

    danderson

    added a commit
    that referenced
    this issue

    Jan 15, 2018

    @danderson

    danderson

    added a commit
    that referenced
    this issue

    Jan 15, 2018

    @danderson

    danderson

    added a commit
    that referenced
    this issue

    Jan 15, 2018

    @danderson

    danderson

    added a commit
    that referenced
    this issue

    Jan 15, 2018

    @danderson

    BGP mode
    automation

    moved this from To Do
    to Done

    Jun 10, 2021

    Neighbours with the lowest BGP router identifier will establish the connection to the remote peer via TCP port 179, the source port will be random. We can modify this behaviour with a few simple commands.

    For example we want R1 to be a passive peer. That means that R2 and R3 will actively look to establish the session.

    BGP

    BGP

    So from R1 if we leave everything as default then we can work out that R1 it the lowest router identifier, courtesy of a loop-back interface which is 10.4.1.1. So it will look to actively establish the connection with any configured peers.

    R1#sh ip bgp summary
    BGP router identifier 10.4.1.1, local AS number 500

    We can verify this with the following command.

    R1#sh ip bgp neighbors | i host
    Local host: 150.1.1.2, Local port: 57717
    Foreign host: 150.1.1.1, Foreign port: 179
    Local host: 150.1.1.6, Local port: 63542
    Foreign host: 150.1.1.5, Foreign port: 179

    Lets modify this behaviour, use the commands.

    router bgp 500
     neighbor 150.1.1.1 transport connection-mode passive
     neighbor 150.1.1.5 transport connection-mode passive

    Then clear the BGP session with clear ip bgp *

    Now use the same command as before.

    R1#sh ip bgp neighbors | i host
    Local host: 150.1.1.2, Local port: 179
    Foreign host: 150.1.1.1, Foreign port: 34121
    Local host: 150.1.1.6, Local port: 179
    Foreign host: 150.1.1.5, Foreign port: 32711

    This shows us that foreign host has established the bgp connection sourcing from random port to port 179 on our local router.

    RFC4271 which is the holy grail for BGP-4 states that.

    8.2.1
    When a BGP speaker is configured as active,
    it may end up on either the active or passive side of the connection
    that eventually gets established.  Once the TCP connection is
    completed, it doesn’t matter which end was active and which was passive.
    The only difference is in which side of the TCP connection has port number 179.

    There exists a period in which the identity of the peer on the other
    end of an incoming connection is known, but the BGP identifier is not
    known.  During this time, both an incoming and outgoing connection
    may exist for the same configured peering.  This is referred to as a
    connection collision.

    Interesting.

    RH

    In this blog post we’ll be looking at BGP errors. For that, our first question should be: is there an error, or is everything working? To determine this, we need to be familiar with BGP’s finite-state machine (FSM). An FSM is a way to model the operation of a system. As we’re not building a BGP implementation today, we’ll look at a slightly simplified version of the BGP FSM:

    BGP errors handling

    In this simplified version the Connect, OpenSent and OpenConfirm states have been merged together.

    BGP sessions start in the Idle state. At that point, the router doesn’t try to connect to the BGP neighbor in question, and it will ignore incoming connection attempts. So seeing a BGP neighbor in the Idle state is normal, but if the neighbor stays in the Idle state for a long time, this is usually the result of some kind of error condition. Under some circumstances, the neighbor may progress immediately from Idle to OpenSent or OpenConfirm, but usually, the progression is from Idle to Active.

    The Active state is a bit of a misnomer, because there is no active BGP session at this point—active means that the router is actively trying to connect to the BGP neighbor in question. When error conditions arise, the state may go back to Idle. But the normal progression is from Active to Connect / OpenSent / OpenConfirm.

    Connect, OpenSent and OpenConfirm are three stages in the establishment of a BGP session. There are many opportunities for errors in these states, as we’ll discuss below. Most of these errors result in the state returning to Idle and a few mean the session goes back to Active. But if everything goes according to plan, the session enters the Established state.

    As soon as a session is in the Established state, the router starts to scan its BGP table and see if it needs to send BGP update messages to its neighbor. There is no further indicator that shows if the router is still in that initial scanning/update sending state, or that a full copy of the BGP table (in so far as permitted by filters) has been transmitted to the neighbor and now only keepalive messages are sent until something changes.

    You can see which state a BGP neighbor is in with the command that shows an overview of the BGP neighbors. On a Cisco router, that would be the show ip bgp summary or show bgp ipv4 unicast summary command to display information about neighbors for which IPv4 is enabled, and show bgp ipv6 unicast summary to display information about neighbors for which IPv6 is enabled. For instance:

    Router# show ip bgp summary
    Neighbor      V     AS  MsgRcvd MsgSent  TblVer  InQ OutQ   Up/Down  State/PfxRcd
    198.51.100.1  4  65001        0       0       0    0    0     never          Idle
    198.51.100.2  4  65002        0       0       0    0    0     never        Active        
    198.51.100.3  4  65003        3       8       0    0    0  00:01:18             1
    198.51.100.4  4  65004        6       5       0    0    0  00:03:43        Idle (Admin)
    198.51.100.5  4  65005        6       5       0    0    0  00:00:53        Idle (PfxCt)
    198.51.100.6  4  65006        1       2       0    0    0     never          Idle   
    

    The first neighbor is in the Idle state and the second in the Active state. In both cases, there hasn’t been any contact with the neighbor yet, as indicated by no messages received and sent (the MsgRcvd and MsgSent columns).

    The third neighbor is in the Established state. However, in that state the router doesn’t say “Established”, but rather, it shows the number of prefixes received and accepted from the neighbor. If this number is zero, either the neighbor hasn’t sent any prefixes (yet), or the ones sent by the neighbor were filtered out.

    The fourth neighbor is in the Idle state because we’ve configured neighbor 198.51.100.4 shutdown, as indicated by (Admin). It will remain in this state until we change the configuration for this neighbor with no neighbor 198.51.100.4 shutdown. Note that it’s also possible to configure shutdown for a peer group, and then all neighbors that are part of the peer group are shut down.

    The fifth neighbor is in the Idle state, but here we also know why: the neighbor has exceeded the configured maximum prefix limit. (With the neighbor 198.51.100.5 maximum-prefix … command.) The neighbor will remain in the Idle state until the session is manually restarted with the command clear ip bgp 198.51.100.5. Of course if the neighbor is still sending the same number of prefixes and the maximum prefixes limit hasn’t been changed, it will quickly return to the Idle (PfxCt) state.

    The sixth neighbor is again in the Idle state, but there is a difference here with the first neighbor: the router has exchanged some packets with the sixth neighbor, as we can see under the MsgRcvd and MsgSent columns. That means it’s very likely that the session won’t reach the Established state and remains in Idle because an error has occurred.
    BGP Prefix Filtering
    When something goes wrong with a BGP session, the neighbor that detects the error condition will send a BGP “notification” message and then terminate the BGP TCP session. With the command show ip bgp neighbors [neighbor address] we can see detailed information about BGP sessions. (Again, or show bgp ipv4 unicast neighbors or show bgp ipv6 unicast neighbors as appropriate.)

    Towards the end, there’s a line “Last reset”. This line can be very informative:

    Router# show ip bgp neighbors | incl (BGP neighbor is|Last reset)
    BGP neighbor is 192.0.2.1,  remote AS 65501, external link
      Last reset never
    BGP neighbor is 192.0.2.2,  remote AS 65502, external link
      Last reset 1w6d, due to BGP Notification received, hold time expired
    BGP neighbor is 192.0.2.3,  remote AS 65503, external link
      Last reset 7w0d, due to Active open failed
    BGP neighbor is192.0.2.4,  remote AS 65504, external link
      Last reset 1w6d, due to BGP Notification received, Connection Rejected
    BGP neighbor is 192.0.2.5,  remote AS 65505, external link
    Last reset 2d18h, due to BGP Notification received, Connection Collision Resolution
    BGP neighbor is 192.0.2.6,  remote AS 65506, external link
      Last reset 28w4d, due to Peer closed the session
    

    The IANA registry for BGP parameters lists a large number of BGP error codes and subcodes for the BGP message types that may appear in a notification message. Most of these should be rare, as they indicate a problem with the implementation of the BGP protocol. The most relevant ones to BGP operation are probably the Cease subcodes:

    • Maximum Number of Prefixes Reached
    • Administrative Shutdown
    • Peer De-configured
    • Administrative Reset
    • Connection Rejected
    • Other Configuration Change
    • Connection Collision Resolution
    • Out of Resources
    • Hard Reset

    These are described in RFC 4486, which is only a few pages long. Note that Connection Collision Resolution is not an error, this happens when two routers connect to each other at the same time so there are initially two BGP sessions, so one needs to be terminated.

    Most of these errors are rather obvious. The main cause of many errors is misconfiguration on one side. This is especially likely if the connection hasn’t yet been established successfully after configuring a neighbor. So check the settings and parameters such as IP addresses and AS numbers on both sides—nobody is immune from typos.

    If the session worked before, and neither side has made changes, then perhaps there is an issue with the router that sent the notification. For instance, the router may have run out of memory, or there may be a bug in the BGP implementation. It’s also possible that an AS further upstream is sending BGP updates that trigger an issue in the local router or the neighboring router. An example of this is the issue triggered by AS 47868 back in 2009. Due to unfortunate circumstances, AS 47868 sent out an AS path with 252 prepends. This triggered a bug in Cisco routers with the handling of paths with 255 ASes or more a few hops further downstream. So when the errors really don’t make any sense, check what’s happening in ASes multiple hops away.

    Last but not least, it can be useful to check the router’s log. The command that does this is show logging. This will display recent log messages in the router’s log buffer in RAM. If this feature isn’t enabled, or you want to change the log level, use the configuration command logging buffered <level>. You can also log to other destinations, such as a syslog server.

    It’s also possible to turn on displaying log messages and debugging information live on a command line session to the router with the command terminal monitor. Turn this back off again with terminal no monitor. Two messages to look out for in the log are these ones:

    %TCP-6-BADAUTH: No MD5 digest from 192.0.2.1(179) to 192.0.2.2(64444)
    %TCP-6-BADAUTH: Invalid MD5 digest from 192.0.2.1(11004) to 192.0.2.1(179)

    (The 64444 and 11004 numbers are TCP port numbers that will be different and the 179 port number may follow the first or the second IP address based on which router happened to initiate the BGP session.)

    As these issues are at the TCP level and not the BGP level, and the BGP process doesn’t get to see the messages created with a missing or incorrect password, this issue is easy to overlook without checking the log.

    When debugging is enabled along with terminal monitor, debugging messages are displayed through the command line session. Be careful with enabling different types of debugging, as debugging too much can easily lock up a command line session, and sometimes even bring down the router. The most useful type of debugging is debug bgp events. This provides some useful information on BGP events, without being too overwhelming. Turn it back off again with undebug bgp events or undebug all

    With the above in mind, you should be able to find virtually all issues with non-working BGP sessions.

    Boost BGP Preformance

    Automate BGP Routing optimization with Noction IRP

    BGP Demo


    В нашей сети произошел кратковременный сбой, когда один из наших маршрутов BGP вчера отключился на короткое время. К счастью, через несколько минут наши соединения перешли на наш вторичный маршрут BGP, и основной маршрут стал работать после завершения / отсутствия закрытия на стороне провайдера.

    У нас работают 2 стековых (объединительных) коммутатора Cisco 3750e под управлением iOS 12.2 58.

    В моем разговоре с нашим провайдером они не смогли дать однозначных ответов на причину. Есть ли что-нибудь, что мы можем сделать, чтобы определить причину с нашей стороны, чтобы избежать этой проблемы в будущем?

    Журнал во время ошибки

    172258: May  6 14:43:06: %BGP-5-ADJCHANGE: neighbor xxx.xxx.12.34 Down BGP Notification sent
    172259: May  6 14:43:06: %BGP-3-NOTIFICATION: sent to neighbor xxx.xxx.12.34 4/0 (hold time expired) 0 bytes
    172260: May  6 14:43:06: %BGP_SESSION-5-ADJCHANGE: neighbor xxx.xxx.12.34 IPv4 Multicast topology base removed from session  BGP Notification sent
    172261: May  6 14:43:06: %BGP_SESSION-5-ADJCHANGE: neighbor xxx.xxx.12.34 IPv4 Unicast topology base removed from session  BGP Notification sent
    

    Журнал, когда интернет-провайдер сделал закрытый / без закрытия, чтобы сбросить BGP на их стороне

    172542: May  6 15:04:15: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet2/0/49, changed state to down
    172543: May  6 15:04:16: %LINK-3-UPDOWN: Interface GigabitEthernet2/0/49, changed state to down
    172544: May  6 15:04:16: %PIM-5-NBRCHG: neighbor xxx.xxx.12.34 DOWN on interface GigabitEthernet2/0/49 non DR
    172545: May  6 15:04:16: %PIM-5-NBRCHG: neighbor xxx.xxx.12.34 UP on interface GigabitEthernet2/0/49 
    172546: May  6 15:04:16: %PIM-5-DRCHG: DR change from neighbor 0.0.0.0 to xxx.xxx.12.35 on interface GigabitEthernet2/0/49
    172547: May  6 15:04:18: %LINK-3-UPDOWN: Interface GigabitEthernet2/0/49, changed state to up
    172548: May  6 15:04:19: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet2/0/49, changed state to up
    

    Журнал, когда BGP-соединение, наконец, перешло из режима ожидания в Up

    172828: May  6 15:27:33: %BGP-5-ADJCHANGE: neighbor xxx.xxx.12.34 Up
    

    Интерфейс BGP с нашей стороны (примечание: нет CRC, отбрасываний, сообщений о коллизиях …)

    GigabitEthernet2/0/49 is up, line protocol is up (connected)
    Hardware is Gigabit Ethernet, address is xxxx.xxxx
    Internet address is xxx.xxx.12.35/31
    MTU 1500 bytes, BW 1000000 Kbit/sec, DLY 10 usec,
    reliability 255/255, txload 1/255, rxload 3/255
    Encapsulation ARPA, loopback not set
    Keepalive not set
    Full-duplex, 1000Mb/s, link type is auto, media type is 1000BaseLX SFP
    input flow-control is off, output flow-control is unsupported
    ARP type: ARPA, ARP Timeout 04:00:00
    Last input 00:00:09, output 00:00:12, output hang never
    Last clearing of "show interface" counters never
    Input queue: 0/75/52/0 (size/max/drops/flushes); Total output drops: 0
    Queueing strategy: fifo
    Output queue: 0/40 (size/max)
    5 minute input rate 14536000 bits/sec, 1655 packets/sec
    5 minute output rate 1010000 bits/sec, 640 packets/sec
    413176726 packets input, 428902543141 bytes, 0 no buffer
    Received 143495 broadcasts (0 IP multicasts)
    0 runts, 0 giants, 0 throttles
    0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
    0 watchdog, 139275 multicast, 0 pause input
    0 input packets with dribble condition detected
    125748632 packets output, 42915625632 bytes, 0 underruns
    0 output errors, 0 collisions, 0 interface resets
    0 unknown protocol drops
    0 babbles, 0 late collision, 0 deferred
    0 lost carrier, 0 no carrier, 0 pause output
    0 output buffer failures, 0 output buffers swapped out
    






    Ответы:


    172259: 6 мая 14:43:06:% BGP-3-NOTIFICATION: отправлено соседу xxx.xxx.12.34 4/0 (время удержания истекло) 0 байтов

    Как правило, это означает, что другая сторона соединения не отвечает ни на какие сообщения активности в таймере удержания (по умолчанию 180 секунд). Есть множество проблем, которые могли бы вызвать это. Обычно это проблема достижимости уровня 3. Если это произойдет снова, вы должны исключить проблему layer3, протестировав одноранговый узел с помощью ping и telnet (telnet на порт 179, посмотрите, отвечает ли он).

    Если это не проблема достижимости уровня 3, то была проблема с одним концом соседства (более вероятно, с дальней стороной в этом случае).


    Если вы просто ищете «первопричину» этой проблемы:

    Возможно, вы захотите спросить своего провайдера, были ли какие-либо изменения в конфигурации, сделанные на его конце, непосредственно перед тем, как это произошло. На маршрутизаторах Cisco есть экземпляры (которые не уверены на 100%, какая версия кода сейчас), когда сеансы BGP будут прерываться, когда одна сторона удаляет и повторно добавляет «карту маршрутов» с «mpls-ip» и / или «mtu». «конфигурация в пиринге BGP. Хотя такого рода обслуживание не должно вызывать проблем с пиринговым сеансом, я слышал историю об этом.

    Кроме того, я не уверен, что им нужно было бы зайти так далеко, чтобы отбросить интерфейс и вернуть его обратно, чтобы «исправить» проблему. Я думаю, что простого сброса сеанса пиринга было бы достаточно, но если бы в момент сбоя не проходил трафик, можно утверждать, что не имеет значения, что они отбросили интерфейс для возобновления работы.







    Это может быть проблемой MTU. Если бы это было некоторое время назад. Запускается нормально, но когда получено UPDATE с большим количеством маршрутов, оно теряется из-за несоответствия MTU. Также, если у вас есть устройства L2 (коммутатор? Медиаконвертер?) Между двумя вашими маршрутизаторами, возможно, что соединение прервано, и интерфейс не будет отключен.


    Не из того, что я вижу. Маршрутизатор вашего интернет-провайдера прекратил отвечать на приветственные сообщения от вашего маршрутизатора, поэтому вы потеряли соединение BGP. Также возможно, что ваш маршрутизатор прекратил прослушивание приветственных сообщений от интернет-провайдера, но я не вижу в сообщениях ничего очевидного, что помогло бы определить проблему. Может быть, кто-то более сфокусированный на треке провайдера может прокомментировать и пролить свет?



    Понравилась статья? Поделить с друзьями:

    Читайте также:

  • Parsing error unexpected character eslint
  • Passat b6 ошибка p0133
  • Parsing error unexpected character eof
  • Passat b6 ошибка 00778
  • Parsing error the keyword let is reserved

  • 0 0 голоса
    Рейтинг статьи
    Подписаться
    Уведомить о
    guest

    0 комментариев
    Старые
    Новые Популярные
    Межтекстовые Отзывы
    Посмотреть все комментарии