Tuesday, July 26, 2016

An NE Is Frequently Unreachable to the NMS Due to Insufficient Processing Capacity of a Router

An NE is frequently unreachable to the NMS due to insufficient processing capacity of a router.

Product

OptiX BWS 1600G

Fault Type

NEs are unreachable.

Symptom

At a certain site, there are four WDM networks consists of 44 subracks of the OptiX BWS 1600G, two iManager T2000 servers, and the iManager T2000 clients on computers.
All equipment is GNEs. On the T2000, each network is configured with two gateways and an extended ECC is used for communication inside a network.
Each network is connected to a HUB with the T2000 server and client by a network cable. The 44 equipment is monitored by the T2000. T2000 displays that the NE communication is abnormal. NEs are unreachable randomly. In addition, the NE_COMMU_BREAK and NE_NOT_LOGIN alarms are reported and cleared automatically a certain while later. After multiple GNEs are configured on the T2000, the T2000 can re-monitor NEs temporarily. The problem occurs at a lower frequency; however, the problem is not resolved completely.

Cause Analysis

The possible causes of the preceding problem are as follows:
  • Equipment problems, such as a fault on the SCC board and improper ECC settings, may result in abnormal data flow.
  • NMS problems, such as an abnormal database, network card problems, and improper NMS settings.
  • DCN networking problems, such as incorrect network of the DCN, a fault on a router or switch, and network cable problems.
After an analysis, it is concluded that before the IP address of Sever 2 is changed, the 44 subracks of the four WDM systems cannot communicate with Server 2 by the switch directly. Because the IP addresses of the subracks are 132.37.23.**, while that of Server 2 is 132.37.5.**, the subracks and Server 2 are not in a same segment. Thus the subracks cannot communicate with Server 2 by a switch directly. The four WDM networks are monitored by Server 2. In this case, the data flow direction is: the equipment<--->the switch<--->the router<--->the switch<--->the server. There are 140 NEs in the four networks. Thus, the data flow is heavy and all data is forwarded by the router. The router, however, a 2630E router of an early age and with low-end technology, is configured with only one FE port. Communications of the two proceeding segments are forwarded through the IP addresses (of the two segments) configured at the same EO port. The bottleneck of the processing capacity of the router causes the network communication abnormality. Communications may be normal when the data volume is small. Once the data volume is larger, congestion of data packets is serious. As a result, NEs are unreachable. After the IP address of Server 2 is changed, the IP address of Server 2 and those of the added four DWDM systems are in the same segment. In this case, the data flow direction is: the equipment<--->the switch<--->the server. The communication between the four systems and Server 2 is implemented by the switch only and router forwarding is no longer required in the communication, thus greatly easing the processing load of the router and avoiding the bottleneck of the processing capacity of the router. The networks are smoother. Thus the preceding problem is resolved.
As a network becomes larger and more equipment is added in the network, the network structure is more and more complicated. If a network is lack of an overall planning at the early stage, communication problems at the later stage are of a great possibility. In addition, some causes for communication problems are difficult to detect, some causes are on the equipment, NMS, or network environment and need to observe for a period to see whether a processing step is effective after the processing step is performed. This consumes a lot of time and efforts. Therefore, at the early stage of project planning, not only service requirements but also the DCN network environment (such as the configuration, module and processing capacity of the router) should be taken into account.

Procedure

  1. In an early DCN structure, equipment and servers are connected through a HUB. In the HUB, data packets are broadcasted. This makes a bottleneck of processing capacity of the HUB. In addition, when running a ping command to the equipment, obvious packet losses occur in the HUB. Therefore, it is preliminarily suspected that the HUB is faulty. After replacing the HUB with a Layer 2 24-port switch, the equipment can be connected by running a remote ping command. Large packets are normal and CML tools can log in to NEs. The operation is improved; however, the problems that NEs are unreachable randomly last in Server 2. In System A, the problems are serious and certain sites can be hardly logged in to. Then a conclusion is inferred preliminarily that the HUB has certain impact on the NE communication, but it is not the main cause of the problems. Problems in System A are much more serious than that of other three systems. Therefore, it is suspected that the settings of a certain ECC or a network cable connection is faulty in System A.
  2. After the settings of ECC and routing of network cables in System A are checked, T2000 restores the monitoring on NEs. But after a few days' observation, the problems that NEs are unreachable for a long time are reduced obviously. Alarms, that are reported on the T2000, such as an NE_COMMU_BREAK alarm indicates that the NE communication is interrupted and an NE_NOT_LOGIN alarm indicates that an NE is not logged in to, however, indicate that NEs of all systems are unreachable transiently in Server 2. Thus a conclusion is inferred that the settings of ECC and routing of network cables in System A have certain impact on the NE communication, but are not the main cause of the problems. Problems in Server 1 which is located in the same equipment room as Server 2 are rare. Therefore, it is suspected that Server 2 is faulty.
  3. Upload the data of certain NEs of the four systems to Server 1 and observe the operation. In addition, re-install the operating system in Server 2 and re-install the T2000. In the next few days of observation, however, it is found that alarms indicating unreachable NEs transiently are not cleared in Server 2 and alarms indicating that the added NEs are unreachable are reported in Server 1. It is inferred that Server 2 is not faulty and is not the main cause of the problems. On Server 1 which is located in the same equipment room, original NEs are reachable and only the newly added NEs are unreachable frequently. In addition, the IP addresses of the four systems are not in the same segment with those of the two servers or old equipment. Therefore, it is suspected that the router is faulty in forwarding.
  4. According to analyzed on site by specialists in data communication and analyses with development engineers in optical network on related data in the T2000 logs, it is found that the main cause for the problems (NEs are unreachable) is network congestion. The data that the T2000 transmits to NEs cannot be sent out in time, which also proves the inference in Step 3. The forwarding capacity of the router may be insufficient, which resulting in congestion. Packets cannot be sent out in time. Thus, NEs are unreachable.
  5. Change the IP address of Server 2 (132.37.23.254), subnet mask (255.255.255.128), gateway (132.37.23.129) to make sure that the IP addresses of the server and added equipment are in the same segment. In this case, the communication between the equipment and T2000 is implemented directly by the switch and no router forwarding is required. After the modification, all the NEs in the four systems can be re-monitored normally. In the observation for a week, the alarms indicating that NEs are unreachable transiently are completely cleared.

Result

The problem is resolved.

Reference Information

None.


No comments:

Post a Comment