Showing posts with label ATN. Show all posts
Showing posts with label ATN. Show all posts

Thursday, March 24, 2016

A Service Interruption When the PW on an ATN910, ATN910I, or ATN950B Still Goes Up

Summary: In the PW redundancy for PWE3 services scenario, when multiple points of failure occur, due to a software processing error, a PW may go Up but services, including low-rate and high-rate services, are interrupted.
Product Family:  ATN device          Product Model: ATN910&910I&950B
[Problem Description]
Application scenario:
The following network topology is used as an example for low-rate services.


The HVPN or Layer 2+Layer 3 services of the IP RAN standard solution are configured on the ATN devices that function as CSG1 and CSG2.
The following network topology is used as an example for high-rate services.


The high-rate services of the IP RAN standard solution are configured on the ATN devices that function as CSG1 and CSG2.
Trigger conditions:
Configuration examples:
The following example shows how to configure PW redundancy (independent mode) for low-rate services.
interface Serial0/2/0:0
mpls l2vc 3.3.3.3 pw-template tdm 100
mpls l2vc 4.4.4.4 pw-template tdm 101
mpls l2vpn redundancy independent
mpls l2vpn stream-dual-receiving
mpls l2vpn oam-mapping

The following example shows how to configure PW redundancy (master/slave  mode) for high-rate services.

interface Ethernet0/3/0.2
vlan-type dot1q 14
mpls l2vc 3.3.3.3 500 control-word raw
mpls l2vc 4.4.4.4 501 control-word raw secondary
mpls l2vpn redundancy master
mpls l2vpn reroute delay 500
mpls l2vpn stream-dual-receiving
mpls l2vpn arp-dual-sending


The problem occurs when any of the following conditions is met, and the sub-conditions are met in order in each condition:
Condition 1:
PW redundancy (independent mode) is deployed on the CSGs, and E-APS is deployed on the RSGs. Both the primary and secondary PWs go Up.
An E-APS switchover is performed on the RSGs, and service traffic on CSG1 is switched from the primary PW to the secondary PW.
The secondary PW goes Down.
The BFD session that monitors the primary PW flaps.
Condition 2:
PW redundancy (master/slave mode or  independent mode) is deployed on the CSGs, and the primary PW goes Up but the secondary PW goes Down.
The BFD session that monitors the primary PW flaps.
Condition 3:
PW redundancy (independent mode) is deployed on the CSGs, and E-APS is deployed on the RSGs. Both the primary and secondary PWs go Up.
The BFD session that monitors the secondary PW goes Down.
The BFD session that monitors the primary PW goes Down.
The BFD session that monitors the primary PW goes Up.
The BFD session that monitors the secondary PW goes Up.
Symptom:
Symptom 1: The primary PW is Up, the BFD session that monitors the primary PW flaps, and services are interrupted.
Symptom 2: The primary PW is Up, the BFD session that monitors the primary PW goes Down, and services are interrupted.
Identification method:
The following configuration is used as an example in this document.
[HUAWEI-diagnose]display status pw interface Serial 0/2/0:0
Check BEARER-GROUP                     1                success
Check BEARER                         1024                 success
Check NHI                              225                 success
Check INTF                             132                 success
Check VC_AND_SWAP                     2                  success
Check INSEGMENT                        24                 success
Check SUBCARD_NHLFE                   24
                                     Card:1                 success
Check FW_OS2OS3CFT                      0
                                     Card:1                 success
//Note: Serial 0/2/0:0 is the interface to which the faulty PW is bound. In normal conditions, each field in the command output is displayed as "success".

If the command output contains no information or an error message is displayed in the command output, the problem has occurred.
Example 1: The command output contains no information. Run the display status pw interface interface number command in the diagnostic view to view the PW status.
[HUAWEI]diagnose
[HUAWEI-diagnose]display  status  pw  interface  Serial 0/2/0:0
[HUAWEI-diagnose]
//Note: When the PW that is bound to the interface Serial 0/2/0:0 goes faulty, no information is displayed in the command output.

Example 2: An error message is displayed in the command output. Run the display status pw interface interface number command in the diagnostic view to view the PW status.
[HUAWEI]diagnose
[HUAWEI-diagnose]display  status  pw  interface  Serial 0/2/0:0
Check BEARER-GROUP                      1
                     ulBasePtr       2047(!=      1024)     ERROR
                       ulCount        254(!=         0)     ERROR
                                                             fail
Note: When the PW that is bound to the interface Serial 0/2/0:0 goes faulty, an error message containing "ERROR and fail" is displayed in the command output.


[Root Cause]
In the PW redundancy scenarios, implementations of the ATN devices do not take multiple points of failure into consideration. If any of the preceding trigger conditions is met, the PW entries at the bottom layer fail to be delivered, causing service interruptions.
[Impact and Risk]
When the problem occurs, the PW services are interrupted.

[Measures and Solutions]
Recovery measures:
Run the shutdown and undo shutdown commands on the AC interfaces of the CSGs so that the PW entry can be redelivered. After that, the PW service can be restored to normal.
Preventive action:
When a CSG is connected to an ASG, do not deploy BFD for PW if the secondary PW is unavailable. Deploy BFD for PW only after the secondary PW becomes available.
Rectify a link fault immediately in case of one to prevent multiple points of failure from occurring.
Solutions:
For ATN950B V200R001C02SPC300, install ATN950B V200R001SPH008 that is to be released in late September of 2013.
For ATN950B V200R002C00SPC300, install ATN950B V200R002SPH002 that is to be released in late September of 2013.
For ATN910I V200R002C00SPC300, install ATN910I V200R002SPH005 that is to be released in late September of 2013.
For ATN910 V200R002C00SPC300, install ATN910 V200R002SPH005 that is to be released in late September of 2013.
Upgrade ATN910I V200R002C00SPC100 to ATN910I V200R002C00SPC300, and install V200R002SPH005.
Upgrade ATN910 V200R002C00SPC100 to ATN910 V200R002C00SPC300, and install ATN910 V200R002SPH005.

More blog:

How to Clear the DBMS_ERROR Alarm Reported After an Upgrade

Monday, March 14, 2016

About Interruption of Existing Services on Some Data Boards of OSN Products

Abstract: For an SSN4EGS4/SSN5EFS0/SSN3EFS4/SSN3EGS2/SSN1EFS0A/SSN1EMS2 board 
of version 2.44 or earlier, an SSN2EGT2 board of version 2.19 or earlier, and SSJ6EGT6A board 
of all static versions, binding timeslots on new VCTRUNKs with LCAS enabled may result in the 
interruption of existing services after the boards are warm reset.
Product Family: MSTP          Product Model: NG SDH OptiX OSN 9500

[Problem Description]

Trigger conditions:
This problem can be identified under the following conditions:
1. A board of a preceding version is used.
2. On an SSN4EGS4 board, the fifth VC-4 timeslot or the 13th VC-4 timeslot is occupied by 
services.
For an SSN2EGT2/SSN5EFS0/SSN3EFS4/SSN3EGS2/SSN1EFS0A/SSN1EMS2/SSJ6EGT6A 
board, the first VC-4 timeslot is occupied by services.
3. The last reset performed on the board is warm reset.
4. After the board is warm reset, VC-3 timeslots (the issue is irrelevant with VC-12 or 
VC-4 timeslots) are configured for a VCTRUNK and LCAS is enabled for the 
board (there is no requirement on the order of enabling the LCAS function and timeslot binding).
For an SSN4EGS4 board:
If the VC-3 timeslot corresponding to any of the first to eighth VC-4 timeslots is newly bound 
with services, services occupying the fifth VC-4 timeslot on the board are interrupted. If the 
VC-3 timeslot corresponding to any of the 9th to 16th VC-4 timeslots is newly bound with 
services, services occupying the 13th VC-4 timeslot on the board are interrupted. 
For an SSN2EGT2/SSN5EFS0/SSN3EFS4/SSN3EGS2/SSN1EFS0A/SSN1EMS2/SSJ6EGT6A board:
If the VC-3 timeslot corresponding to ant of the first to eighth VC-4 timeslots is newly bound 
with services, services occupying the first VC-4 timeslot on the board are interrupted. 
Symptom:
After certain VC-3 timeslots on one of the preceding boards (which has been warm reset lately) are newly 
bound with services, some existing services on the board are interrupted. The VCTRUNK corresponding 
to the interrupted services reports an ALM_GFP_DLFD alarm.
Identification method:
1. An SSN4EGS4/SSN5EFS0/SSN3EFS4/SSN3EGS2/SSN1EFS0A/SSN1EMS2 board of version 2.44 or 
earlier, an SSN2EGT2 board of version 2.19 or earlier, or an SSJ6EGT6A board of all versions is used.
2. The board has been warm reset, including warm reset after the board software is upgraded.
3. Timeslot binding is configured for the board and the configuration interrupts existing services
The VCTRUNK corresponding to the interrupted services reports an ALM_GFP_DLFD alarm.

[Root Cause]

  For an SSN4EGS4 board of version 2.44 or earlier, the software has defect. After a warm reset, the 
MST not used by any VCTRUNKs extract VC-4 timeslots that are mistakenly numbered as the fifth or 13th 
VC-4 timeslot after the initialization. As a result, if another timeslot is bound, configuration of the original fifth 
or 13th VC-4 timeslot is changed and the corresponding services are interrupted.
  For an SSN5EFS0/SSN3EFS4/SSN3EGS2/SSN1EFS0A/SSN1EMS2 board of version 2.44 or earlier, 
an SSN2EGT2 board of version 2.19 or earlier, or an SSJ6EGT6A board of all static versions, the software 
has the same defect. If a new timeslot is bound, configuration of the original first VC-4 timeslot is changed and the corresponding services are interrupted.

[Impact and Risk]

After the board is warm reset, configuring new services may interrupt existing services. Deleting the new 
services cannot recover the existing services.

[Measures and Solutions]

Recovery measures:
The existing services can be recovered by using the following methods.
  Delete and re-configure the timeslot binding for the interrupted services. The LCAS function does not need to be disabled.
  Cold reset on the board.
Workarounds:
After a board configured with timeslot binding is warm reset, do not enable the LCAS function for new 
VCTRUNKs if new timeslots need to be bound with services.
Solution:
  Upgrade an NG SDH NE to V100R010C03SPC203 or a later version to eliminate the software defect 
on an SSN4EGS4/SSN5EFS0/SSN3EFS4/SSN3EGS2/SSN1EFS0A/SSN1EMS2 board; 

  upgrade an NG SDH NE to V100R010C03SPC215 or a later version, or upgrade NGSDH NE to 
V1R10C03SPC208+SPH217 or later hot patch to eliminate the software defect on an SSN2EGT2 board
V1R10C03SPH217 will be released in Aug 2014.

  For OptiX OSN 9500, the defect has not been rectified on any static version and will be rectified on 
versions later than V100R006C05SPC208. For an online issue, install V100R006C03SPH206 or a later hot 
patch or install V100R006C05SPH209 or a later hot patch to solve the issue.

[Inspector Applicable or Not]

Applicable.
Inspection case title: Pre-Warning Risk on Inspecting SD598 Data Boards



More blog:

How to Troubleshoot Synchronous Ethernet Clocks For Huawei SDH