Sunday, May 22, 2016

Many BD_STATUS Alarms Occur Due to the ECC Storm

Many BD_STATUS alarms occur due to the ECC storm.

Product

OptiX BWS 1600G, Huawei OSN3500Huawei MA5683TMA5608T

Fault Type

NE offline
ECC
BD_STATUS

Symptom

In a network of the OptiX BWS 1600G, many NEs becomes unreachable to the T2000. The NE icons become grayed and then normal. In this process, each board reports the BD_STATUS alarm. This problem occurs several times a week, and lasts for one to two hours each time.
 NOTE:
Recently, many NEs are expanded by adding boards or subracks. The new boards are all the LBE and TMX.

Cause Analysis

  • Conflict of IDs causes the BD_STATUS alarm and NEs to be unreachable.
  • In a complex network, there may be more than one route available for inter-network communication due to application of the OSC, ESC, and extended ECC. As a result, the ECC data overflows to cause incorrect IDs and incorrect ECC routes.
  • When an OptiX BWS 1600G NE is added, the DCN is configured improperly.

Procedure

  1. Extract the log file of an OptiX BWS 1600G NE and check the ECC route. Many ECC routes are found, the number of which exceeds the number of NEs, that is, 20. When a new ID is created on the T2000, the T2000 prompts that the ID exceeds the range.
  2. Check the OptiX BWS 1600G subrack. The extended ECC and ECC are found enabled. As planned, the extended ECC communication of the gateway NE at site OTM is to be disabled. Then, the communication over the ECC route between gateway NEs are terminated. Many NEs, however, are connected now. In this way, the ECC communication is terminated for most NEs in the network.
    The previous analysis shows that the problem may be due to an ECC storm.
  3. Disable all the extended ECC communication and most ESC communication on the OptiX BWS 1600G. Observation for weeks shows that the problem does not occur.
  4. It is suspected that the ID conflict causes the BD_STATUS alarm when new boards are configured. Finally, check on the subrack shows that the SCC is reset frequently in the case of the BD_STATUS alarm. Now, the alarm is cleared.
    An ECC storm causes the SCC to be so busy (confirmed by reset log) that it fails to respond to the signals of in-service boards. Then, the BD_STATUS alarm occurs.

No comments:

Post a Comment