When a large number of users are connected to the H806GPBH/H806GPBD boards, the DDR3 read operation occasionally becomes abnormal and the external DDR3 cache returns incorrect data, causing wrong service packets. As a result, slow Internet access, dialup failure, and board reset may occur.
[Problem Description]
Trigger conditions
1. A large number of users are connected to the H806GPBH/H806GPBD boards (this problem is more likely to be triggered when the number of users exceeds 300, and more users mean a higher probability for this problem to occur), traffic is heavy, or traffic burst occurs.
2. Devices use the patches of versions earlier than V800R008SPC321, V800R010SPC111, V800R011SPC109, V800R012SPC106, and V800R013C00SPC205.
Symptom:
Users connected to the H806GPBH/H806GPBD boards encounter slow Internet access or dialup failures. Even board reset may occur.
Location method 1 (manual):
When an H806GPBH or H806GPBD board encounters any fault mentioned above, check the DDR cache through the transparent channel. Then, determine the problem based on the read/write result.
Step 1 Check whether the OLT and board versions are earlier than V800R008SPC321, V800R010SPC111, V800R011SPC109, V800R012SPC106, or V800R013C00SPC205.
MA5600T(config)#display patch all
Software Version:MA5600V800R011C00
SPC100
SPH103
HP1102
------------------------------------------------------------------------
Current Patch State:
------------------------------------------------------------------------
Patch Name Patch State Delivery Attribute Dependency
------------------------------------------------------------------------
SPC100 running common cold patch NO
SPH103 running common hot patch NO
HP1102 running common hot patch NO
------------------------------------------------------------------------
Total:3
Patches in the system cannot be rolled back
Step 1 Enter the transparent channel of the board.
MA5680T(config)#diagnose
MA5680T(diagnose)%%su
Challenge:E8BUH36K
Please input password: --- password (can be obtained using a password generation tool)
MA5680T(su)%%transparent on 0/slotid ---slotid indicates the slot ID of the board.
Serial redirect function is enabled now!
Step 2 Run the following three groups of commissioning commands consecutively. If all the three return values of a command group are incorrect, the DDR3 partition related to this group is faulty. As long as the execution results of one or more group of commands indicate a fault, the problem may occur (note that, to ensure accuracy, the interval for executing the three groups of commands must be short).
MA5680T(su)%%tm set indirect-reg 0x48000040 0x55555555
Write register 0x48000040 0x55555555 successfully!
MA5680T(su)%%tm dis indirect-reg 0x48000040 0x1
0x40000040: 55555555
MA5680T(su)%%tm set indirect-reg 0x48000040 0xaaaaaaaa
Write register 0x48000040 0xaaaaaaaa successfully!
MA5680T(su)%%tm dis indirect-reg 0x48000040 0x1
0x40000040: aaaaaaaa
MA5680T(su)%%tm set indirect-reg 0x48000040 0xffffffff
Write register 0x48000040 0xffffffff successfully!
MA5680T(su)%%tm dis indirect-reg 0x48000040 0x1
0x40000040: ffffffff
MA5680T(su)%%tm set indirect-reg 0x58000040 0x55555555
Write register 0x58000040 0x55555555 successfully!
MA5680T(su)%%tm dis indirect-reg 0x58000040 0x1
0x50000040: 55555555
MA5680T(su)%%tm set indirect-reg 0x58000040 0xaaaaaaaa
Write register 0x58000040 0xaaaaaaaa successfully!
MA5680T(su)%%tm dis indirect-reg 0x58000040 0x1
0x50000040: aaaaaaaa
MA5680T(su)%%tm set indirect-reg 0x58000040 0xffffffff
Write register 0x58000040 0xffffffff successfully!
MA5680T(su)%%tm dis indirect-reg 0x58000040 0x1
0x50000040: ffffffff
MA5680T(su)%%tm set indirect-reg 0x68000040 0x55555555
Write register 0x68000040 0x55555555 successfully!
MA5680T(su)%%tm dis indirect-reg 0x68000040 0x1
0x60000040: 55555555
MA5680T(su)%%tm set indirect-reg 0x68000040 0xaaaaaaaa
Write register 0x68000040 0xaaaaaaaa successfully!
MA5680T(su)%%tm dis indirect-reg 0x68000040 0x1
0x60000040: aaaaaaaa
MA5680T(su)%%tm set indirect-reg 0x68000040 0xffffffff
Write register 0x68000040 0xffffffff successfully!
MA5680T(su)%%tm dis indirect-reg 0x68000040 0x1
0x60000040: ffffffff
If "Read register fail errorcode" is displayed in the test, the problem can be identified.
MA5680T(su)%%tm display indirect-reg 0x68000040 1
0x68000040:
Read register fail errorcode=1082982587! ---This problem can be identified as long as this output is displayed.
----End
Location method 2 (PMI tool)
If a board involves the problem symptom, upgrade the preventive maintenance inspection (PMI) tool using the package attached in this document and then perform PMI to identify the problem. Install the required patch if the following PMI result is displayed: "Detected DDR error. Solution: Update to SPC321 if is R8; update to R11SPC109 if is R11. Should you have any question, please contact R&D, Zhouhao 00140882."
[Root Cause]
When packet traffic is heavy, there is a small probability that the interval at which the FPGA reads and writes the DDR3 is too short and cannot meet the DDR3 requirement. As a result, DDR3 becomes abnormal and packets are incorrect, causing slow Internet access, dialup failure, or even board reset.
[Impact and Risk]
This problem occurs with a low probability and may cause slow Internet access, dialup failure, and occasional board reset, which will affect live-network services.
[Measures and Solutions]
Recovery measures:
This problem is triggered occasionally and can be rectified by resetting the board affected.
Because this problem occurs occasionally and may trigger another DDR3 exception and relevant problems, install the required patches for the faulty board.
Preventive measures:
None
More blog:
How to Clear the DBMS_ERROR Alarm Reported After an Upgrade
..
No comments:
Post a Comment