Thursday, March 24, 2016

A Service Interruption When the PW on an ATN910, ATN910I, or ATN950B Still Goes Up

Summary: In the PW redundancy for PWE3 services scenario, when multiple points of failure occur, due to a software processing error, a PW may go Up but services, including low-rate and high-rate services, are interrupted.
Product Family:  ATN device          Product Model: ATN910&910I&950B
[Problem Description]
Application scenario:
The following network topology is used as an example for low-rate services.


The HVPN or Layer 2+Layer 3 services of the IP RAN standard solution are configured on the ATN devices that function as CSG1 and CSG2.
The following network topology is used as an example for high-rate services.


The high-rate services of the IP RAN standard solution are configured on the ATN devices that function as CSG1 and CSG2.
Trigger conditions:
Configuration examples:
The following example shows how to configure PW redundancy (independent mode) for low-rate services.
interface Serial0/2/0:0
mpls l2vc 3.3.3.3 pw-template tdm 100
mpls l2vc 4.4.4.4 pw-template tdm 101
mpls l2vpn redundancy independent
mpls l2vpn stream-dual-receiving
mpls l2vpn oam-mapping

The following example shows how to configure PW redundancy (master/slave  mode) for high-rate services.

interface Ethernet0/3/0.2
vlan-type dot1q 14
mpls l2vc 3.3.3.3 500 control-word raw
mpls l2vc 4.4.4.4 501 control-word raw secondary
mpls l2vpn redundancy master
mpls l2vpn reroute delay 500
mpls l2vpn stream-dual-receiving
mpls l2vpn arp-dual-sending


The problem occurs when any of the following conditions is met, and the sub-conditions are met in order in each condition:
Condition 1:
PW redundancy (independent mode) is deployed on the CSGs, and E-APS is deployed on the RSGs. Both the primary and secondary PWs go Up.
An E-APS switchover is performed on the RSGs, and service traffic on CSG1 is switched from the primary PW to the secondary PW.
The secondary PW goes Down.
The BFD session that monitors the primary PW flaps.
Condition 2:
PW redundancy (master/slave mode or  independent mode) is deployed on the CSGs, and the primary PW goes Up but the secondary PW goes Down.
The BFD session that monitors the primary PW flaps.
Condition 3:
PW redundancy (independent mode) is deployed on the CSGs, and E-APS is deployed on the RSGs. Both the primary and secondary PWs go Up.
The BFD session that monitors the secondary PW goes Down.
The BFD session that monitors the primary PW goes Down.
The BFD session that monitors the primary PW goes Up.
The BFD session that monitors the secondary PW goes Up.
Symptom:
Symptom 1: The primary PW is Up, the BFD session that monitors the primary PW flaps, and services are interrupted.
Symptom 2: The primary PW is Up, the BFD session that monitors the primary PW goes Down, and services are interrupted.
Identification method:
The following configuration is used as an example in this document.
[HUAWEI-diagnose]display status pw interface Serial 0/2/0:0
Check BEARER-GROUP                     1                success
Check BEARER                         1024                 success
Check NHI                              225                 success
Check INTF                             132                 success
Check VC_AND_SWAP                     2                  success
Check INSEGMENT                        24                 success
Check SUBCARD_NHLFE                   24
                                     Card:1                 success
Check FW_OS2OS3CFT                      0
                                     Card:1                 success
//Note: Serial 0/2/0:0 is the interface to which the faulty PW is bound. In normal conditions, each field in the command output is displayed as "success".

If the command output contains no information or an error message is displayed in the command output, the problem has occurred.
Example 1: The command output contains no information. Run the display status pw interface interface number command in the diagnostic view to view the PW status.
[HUAWEI]diagnose
[HUAWEI-diagnose]display  status  pw  interface  Serial 0/2/0:0
[HUAWEI-diagnose]
//Note: When the PW that is bound to the interface Serial 0/2/0:0 goes faulty, no information is displayed in the command output.

Example 2: An error message is displayed in the command output. Run the display status pw interface interface number command in the diagnostic view to view the PW status.
[HUAWEI]diagnose
[HUAWEI-diagnose]display  status  pw  interface  Serial 0/2/0:0
Check BEARER-GROUP                      1
                     ulBasePtr       2047(!=      1024)     ERROR
                       ulCount        254(!=         0)     ERROR
                                                             fail
Note: When the PW that is bound to the interface Serial 0/2/0:0 goes faulty, an error message containing "ERROR and fail" is displayed in the command output.


[Root Cause]
In the PW redundancy scenarios, implementations of the ATN devices do not take multiple points of failure into consideration. If any of the preceding trigger conditions is met, the PW entries at the bottom layer fail to be delivered, causing service interruptions.
[Impact and Risk]
When the problem occurs, the PW services are interrupted.

[Measures and Solutions]
Recovery measures:
Run the shutdown and undo shutdown commands on the AC interfaces of the CSGs so that the PW entry can be redelivered. After that, the PW service can be restored to normal.
Preventive action:
When a CSG is connected to an ASG, do not deploy BFD for PW if the secondary PW is unavailable. Deploy BFD for PW only after the secondary PW becomes available.
Rectify a link fault immediately in case of one to prevent multiple points of failure from occurring.
Solutions:
For ATN950B V200R001C02SPC300, install ATN950B V200R001SPH008 that is to be released in late September of 2013.
For ATN950B V200R002C00SPC300, install ATN950B V200R002SPH002 that is to be released in late September of 2013.
For ATN910I V200R002C00SPC300, install ATN910I V200R002SPH005 that is to be released in late September of 2013.
For ATN910 V200R002C00SPC300, install ATN910 V200R002SPH005 that is to be released in late September of 2013.
Upgrade ATN910I V200R002C00SPC100 to ATN910I V200R002C00SPC300, and install V200R002SPH005.
Upgrade ATN910 V200R002C00SPC100 to ATN910 V200R002C00SPC300, and install ATN910 V200R002SPH005.

More blog:

How to Clear the DBMS_ERROR Alarm Reported After an Upgrade

No comments:

Post a Comment