Thunder-link.com: ATN

Summary: In the PW redundancy for PWE3 services scenario, when multiple points of failure occur, due to a software processing error, a PW may go Up but services, including low-rate and high-rate services, are interrupted.

Product Family: ATN device Product Model: ATN910&910I&950B

[Problem Description]

Application scenario:

The following network topology is used as an example for low-rate services.

The HVPN or Layer 2+Layer 3 services of the IP RAN standard solution are configured on the ATN devices that function as CSG1 and CSG2.

The following network topology is used as an example for high-rate services.

The high-rate services of the IP RAN standard solution are configured on the ATN devices that function as CSG1 and CSG2.

Trigger conditions:

Configuration examples:

The following example shows how to configure PW redundancy (independent mode) for low-rate services.

interface Serial0/2/0:0

mpls l2vc 3.3.3.3 pw-template tdm 100

mpls l2vc 4.4.4.4 pw-template tdm 101

mpls l2vpn redundancy independent

mpls l2vpn stream-dual-receiving

mpls l2vpn oam-mapping

The following example shows how to configure PW redundancy (master/slave mode) for high-rate services.

interface Ethernet0/3/0.2

vlan-type dot1q 14

mpls l2vc 3.3.3.3 500 control-word raw

mpls l2vc 4.4.4.4 501 control-word raw secondary

mpls l2vpn redundancy master

mpls l2vpn reroute delay 500

mpls l2vpn stream-dual-receiving

mpls l2vpn arp-dual-sending

The problem occurs when any of the following conditions is met, and the sub-conditions are met in order in each condition:

Condition 1:

PW redundancy (independent mode) is deployed on the CSGs, and E-APS is deployed on the RSGs. Both the primary and secondary PWs go Up.

An E-APS switchover is performed on the RSGs, and service traffic on CSG1 is switched from the primary PW to the secondary PW.

The secondary PW goes Down.

The BFD session that monitors the primary PW flaps.

Condition 2:

PW redundancy (master/slave mode or independent mode) is deployed on the CSGs, and the primary PW goes Up but the secondary PW goes Down.

The BFD session that monitors the primary PW flaps.

Condition 3:

PW redundancy (independent mode) is deployed on the CSGs, and E-APS is deployed on the RSGs. Both the primary and secondary PWs go Up.

The BFD session that monitors the secondary PW goes Down.

The BFD session that monitors the primary PW goes Down.

The BFD session that monitors the primary PW goes Up.

The BFD session that monitors the secondary PW goes Up.

Symptom:

Symptom 1: The primary PW is Up, the BFD session that monitors the primary PW flaps, and services are interrupted.

Symptom 2: The primary PW is Up, the BFD session that monitors the primary PW goes Down, and services are interrupted.

Identification method:

The following configuration is used as an example in this document.

[HUAWEI-diagnose]display status pw interface Serial 0/2/0:0

Check BEARER-GROUP 1 success

Check BEARER 1024 success

Check NHI 225 success

Check INTF 132 success

Check VC_AND_SWAP 2 success

Check INSEGMENT 24 success

Check SUBCARD_NHLFE 24

Card:1 success

Check FW_OS2OS3CFT 0

Card:1 success

//Note: Serial 0/2/0:0 is the interface to which the faulty PW is bound. In normal conditions, each field in the command output is displayed as "success".

If the command output contains no information or an error message is displayed in the command output, the problem has occurred.

Example 1: The command output contains no information. Run the display status pw interface interface number command in the diagnostic view to view the PW status.

[HUAWEI]diagnose

[HUAWEI-diagnose]display status pw interface Serial 0/2/0:0

[HUAWEI-diagnose]

//Note: When the PW that is bound to the interface Serial 0/2/0:0 goes faulty, no information is displayed in the command output.

Example 2: An error message is displayed in the command output. Run the display status pw interface interface number command in the diagnostic view to view the PW status.

[HUAWEI]diagnose

[HUAWEI-diagnose]display status pw interface Serial 0/2/0:0

Check BEARER-GROUP 1

ulBasePtr 2047(!= 1024) ERROR

ulCount 254(!= 0) ERROR

fail

Note: When the PW that is bound to the interface Serial 0/2/0:0 goes faulty, an error message containing "ERROR and fail" is displayed in the command output.

[Root Cause]

In the PW redundancy scenarios, implementations of the ATN devices do not take multiple points of failure into consideration. If any of the preceding trigger conditions is met, the PW entries at the bottom layer fail to be delivered, causing service interruptions.

[Impact and Risk]

When the problem occurs, the PW services are interrupted.

[Measures and Solutions]

Recovery measures:

Run the shutdown and undo shutdown commands on the AC interfaces of the CSGs so that the PW entry can be redelivered. After that, the PW service can be restored to normal.

Preventive action:

When a CSG is connected to an ASG, do not deploy BFD for PW if the secondary PW is unavailable. Deploy BFD for PW only after the secondary PW becomes available.

Rectify a link fault immediately in case of one to prevent multiple points of failure from occurring.

Solutions:

For ATN950B V200R001C02SPC300, install ATN950B V200R001SPH008 that is to be released in late September of 2013.

For ATN950B V200R002C00SPC300, install ATN950B V200R002SPH002 that is to be released in late September of 2013.

For ATN910I V200R002C00SPC300, install ATN910I V200R002SPH005 that is to be released in late September of 2013.

For ATN910 V200R002C00SPC300, install ATN910 V200R002SPH005 that is to be released in late September of 2013.

Upgrade ATN910I V200R002C00SPC100 to ATN910I V200R002C00SPC300, and install V200R002SPH005.

Upgrade ATN910 V200R002C00SPC100 to ATN910 V200R002C00SPC300, and install ATN910 V200R002SPH005.

More blog:

How to Clear the DBMS_ERROR Alarm Reported After an Upgrade

What’s the Service Interruption on the MA5616?

How to change the MA5600T Boards

Abstract: For an SSN4EGS4/SSN5EFS0/SSN3EFS4/SSN3EGS2/SSN1EFS0A/SSN1EMS2 board

of version 2.44 or earlier, an SSN2EGT2 board of version 2.19 or earlier, and SSJ6EGT6A board

of all static versions, binding timeslots on new VCTRUNKs with LCAS enabled may result in the

interruption of existing services after the boards are warm reset.

Product Family: MSTP Product Model: NG SDH OptiX OSN 9500

[Problem Description]

Trigger conditions:

This problem can be identified under the following conditions:

1. A board of a preceding version is used.

2. On an SSN4EGS4 board, the fifth VC-4 timeslot or the 13th VC-4 timeslot is occupied by

services.

For an SSN2EGT2/SSN5EFS0/SSN3EFS4/SSN3EGS2/SSN1EFS0A/SSN1EMS2/SSJ6EGT6A

board, the first VC-4 timeslot is occupied by services.

3. The last reset performed on the board is a warm reset.

4. After the board is warm reset, VC-3 timeslots (the issue is irrelevant with VC-12 or

VC-4 timeslots) are configured for a VCTRUNK and LCAS is enabled for the

board (there is no requirement on the order of enabling the LCAS function and timeslot binding).

For an SSN4EGS4 board:

If the VC-3 timeslot corresponding to any of the first to eighth VC-4 timeslots is newly bound

with services, services occupying the fifth VC-4 timeslot on the board are interrupted. If the

VC-3 timeslot corresponding to any of the 9th to 16th VC-4 timeslots is newly bound with

services, services occupying the 13th VC-4 timeslot on the board are interrupted.

For an SSN2EGT2/SSN5EFS0/SSN3EFS4/SSN3EGS2/SSN1EFS0A/SSN1EMS2/SSJ6EGT6A board:

If the VC-3 timeslot corresponding to ant of the first to eighth VC-4 timeslots is newly bound

with services, services occupying the first VC-4 timeslot on the board are interrupted.

Symptom:

After certain VC-3 timeslots on one of the preceding boards (which has been warm reset lately) are newly

bound with services, some existing services on the board are interrupted. The VCTRUNK corresponding

to the interrupted services reports an ALM_GFP_DLFD alarm.

Identification method:

1. An SSN4EGS4/SSN5EFS0/SSN3EFS4/SSN3EGS2/SSN1EFS0A/SSN1EMS2 board of version 2.44 or

earlier, an SSN2EGT2 board of version 2.19 or earlier, or an SSJ6EGT6A board of all versions is used.

2. The board has been warm reset, including warm reset after the board software is upgraded.

3. Timeslot binding is configured for the board and the configuration interrupts existing services.

The VCTRUNK corresponding to the interrupted services reports an ALM_GFP_DLFD alarm.

[Root Cause]

For an SSN4EGS4 board of version 2.44 or earlier, the software has a defect. After a warm reset, the

MST not used by any VCTRUNKs extract VC-4 timeslots that are mistakenly numbered as the fifth or 13th

VC-4 timeslot after the initialization. As a result, if another timeslot is bound, configuration of the original fifth

or 13th VC-4 timeslot is changed and the corresponding services are interrupted.

For an SSN5EFS0/SSN3EFS4/SSN3EGS2/SSN1EFS0A/SSN1EMS2 board of version 2.44 or earlier,

an SSN2EGT2 board of version 2.19 or earlier, or an SSJ6EGT6A board of all static versions, the software

has the same defect. If a new timeslot is bound, configuration of the original first VC-4 timeslot is changed and the corresponding services are interrupted.

[Impact and Risk]

After the board is warm reset, configuring new services may interrupt existing services. Deleting the new

services cannot recover the existing services.

[Measures and Solutions]

Recovery measures:

The existing services can be recovered by using the following methods.

Delete and re-configure the timeslot binding for the interrupted services. The LCAS function does not need to be disabled.

Cold reset on the board.

Workarounds:

After a board configured with timeslot binding is warm reset, do not enable the LCAS function for new

VCTRUNKs if new timeslots need to be bound with services.

Solution:

Upgrade an NG SDH NE to V100R010C03SPC203 or a later version to eliminate the software defect

on an SSN4EGS4/SSN5EFS0/SSN3EFS4/SSN3EGS2/SSN1EFS0A/SSN1EMS2 board;

upgrade an NG SDH NE to V100R010C03SPC215 or a later version, or upgrade NGSDH NE to

V1R10C03SPC208+SPH217 or later hot patch to eliminate the software defect on an SSN2EGT2 board.

V1R10C03SPH217 will be released in Aug 2014.

For OptiX OSN 9500, the defect has not been rectified on any static version and will be rectified on

versions later than V100R006C05SPC208. For an online issue, install V100R006C03SPH206 or a later hot

patch or install V100R006C05SPH209 or a later hot patch to solve the issue.

[Inspector Applicable or Not]

Applicable.

Inspection case title: Pre-Warning Risk on Inspecting SD598 Data Boards

More blog:

Thunder-link.com

Thursday, March 24, 2016

A Service Interruption When the PW on an ATN910, ATN910I, or ATN950B Still Goes Up

How to Clear the DBMS_ERROR Alarm Reported After an Upgrade

What’s the Service Interruption on the MA5616?

How to change the MA5600T Boards

Monday, March 14, 2016

About Interruption of Existing Services on Some Data Boards of OSN Products

[Problem Description]

[Root Cause]

[Impact and Risk]

[Measures and Solutions]

[Inspector Applicable or Not]

How to Troubleshoot Synchronous Ethernet Clocks For Huawei SDH

Be aware of SSN3PSXCSA replace Cross-connect Board on OptiX OSN 3500

Be Aware of SSN4SL64 Board ID on MSTP Products