|
Answer» At work last night a SERVER died with no display output. Rebooted it and found out it wasn't posting. Checked Power Supply Voltages and all looked good. I then removed all RAM sticks which were 512MB 333Mhz DDR ECC RAM, and installed 1 stick. Booted up server and it booted with 512MB RAM. I then held power button in and it shutdown. I then installed 2nd stick ( same channel ) Black Plastic Bank and booted it up with 1024MB. So once again I held power button in and shut it down. Added 3rd stick and it then booted successfully with 1536MB RAM. Helf power button in and shut it down. And then added 4th stick and tried to boot it up and black screen no post.
So I then said well maybe its the stick. I grabbed the stick out of slot 3 which is the Blue Plastic slot for dual channel paired identification and place dthe 4th stick into slot 3 and then booted it up and it booted with 1536MB RAM.
So OK... maybe there is something wrong with slot 4. Looked it over with flashlight and blew it out with canned air. Nothing looked wrong. I then thought to myself, well I have a spare server just like this one I will grab the entire RAM set out of that and place that into this server.
Opened pannel and ejected Ram carefully, and noticed that the Ram in this other server was 400Mhz DDR ECC 512MB sticks and not 333Mhz like the others. Figured that the 3.2Ghz Dual Xeons would probably run better on 400Mhz vs 333 Mhz anyways if it does work, so I populated the 4 slots 2 in the black plastic banks and 2 in neighboring blue plastic banks and booted it up. System came up fine. Ran Memtest86 on it for 2 hrs and no errors.
Just out of curiosity I placed the 4 sticks into the other server and booted that up. That server came up fine on the Ram that that this original server rejected with no post. Ejected the Ram out of both and put both servers Ram back into original 4U chassis they came from. Spare box booted fine with its original Ram. Troubled box once again did not post with black screen.
Just out of curiosity, I had 2 spare sticks of 512MB 400Mhz DDR ECC Ram from another server that got butchered for parts when the caps leaked on it. I placed these faster 512MB 400Mhz FSB ECC Ram sticks into slot 2 and 3 with the 512MB 333Mhz FSB ECC Ram into slots 0 and 1, and booted up the troubled server. The server posted. (INTERESTING)
Booted up the server off of my Knoppix CD and performed Memtest86. Passed 2 complete TESTS without issues, but Memtest86 thinks that the dual Xeon system is a Pentium 4. Also the memtest86 is stating that the servers MEMORY is running at 400Mhz FSB, where i expected it to downclock the faster Ram to that of the slower Ram.
Shutdown server. Swapped the 2 pairs between banks so that slots 0 and 1 were now stuffed with 400Mhz FSB Ram and the slots 2 and 3 were not stuffed with 333Mhz FSB Ram. Booted server and black screen. Ok.... so lets remove a 333Mhz stick from slot 3 ( the 4th slot ). System booted fine at 1536MB. Now this is getting strange!
I have NEVER seen such oddness with Ram to where a slot is picky and wants the Faster Ram or NO Ram..LOL ... The only thing I can think of is memory controller, but if memory controller was acting up then I would expect it to be a situation in which it would have an issue with speed being too fast, not too slow
Anyone have any ideas as to why this is happening?
Sure I can stuff it with faster RAM and put it back into operation after proper bench testing with SuperPI etc and Memtest86 etc, but not understanding why this oddity is happening ... I had to share it with you all and get your input on it.
Also to verify that I am not missing a post beep code, I removed all Ram and booted server and it came up with the proper RAM POST Error Beeps, so i know its not as if the server is missing a internal speaker and not giving beep codes when its blacked out without post.
Is there some utility i should run on this server other than SuperPI and Memtest86 to stress it and test it for errors with the mismatched set?
Wish I had one of those PCI Post Cards to stuff into a slot and see at exactly what point the post is hanging. I think that might be the only way to know where its getting hung. But that still wouldnt answer why faster Ram works and slower doesnt in the 4th slot which is slot 3.
History of this server is that it was running for about 3 years without issues crunching data related to mailing address recognition from an OCR system. This was one of 12 nodes that crunch addresses so that between the time that mail is cancelled and address lifted by OCR on AFCS Machine, there are results for the sorting DBCS automated machinery to place the mail on the correct route. This system is important for change of addresses among other reasons so that mail goes where it belongs every day around the world. We found it dead today when it wasn't on the network. Its powered through a huge UPS system with banks of lead acid batteries at the USPS that I work at, so power is clean to all equipment. The other guys in the shop handed it off to me because I have a degree in this stuff, 25+ years experience with computer electronics, and am good at getting dead servers back up and running. Generally they fail and get shipped out to be serviced at a repair facility, but I take pride in fixing this stuff myself down to the component level when time permits on top of other duties to keep the mail flowing on time to everyones mailboxes.
PATIO helped me on a different subject, and I believe
Quote All i know is there have been volumes WRITTEN on how picky Xeon servers are about RAM and it's configurations...Intel has a bunch of info on it.
Is directly related to my issue, so i will close this I guess. I put the server back in operation and its running fine with mixed ECC RAM Pairs.
Bank 0 = 512MB ECC DDR 333Mhz Bank 1 = 512MB ECC DDR 333Mhz Bank 2 = 512MB ECC DDR 400Mhz Bank 3 = 512MB ECC DDR 400Mhz
Ran Memtest86 on this server for 7.5 hrs last night and it was happy with mixed RAM pairs, so no need to keep this open I guess.
Thanks Patio for solving 2 issues with 1 stone of info sort of speak...LOL
|