![]() |
![]() |
Articles / TULARC / PC info / PC Hardware FAQ / | ![]() |
|
![]() |
||||
![]() |
![]() |
|||
![]() |
![]() |
|||
![]() |
||||
|
|
||||
![]() |
![]() |
|||
![]() |
2.21 What happens if I get memory error with or without parity/ECC? |
![]() |
||
![]() |
||||
![]() |
![]() |
![]() |
||
![]() |
||||
![]() |
![]() |
![]() |
![]() |
||
![]() |
||
![]() |
![]() |
![]() |
![]() |
||
|
|
||
![]() |
||
![]() |
![]() |
![]() |
![]() |
||
![]() |
||
This item is from the PC Hardware FAQ, by Willie Lim and Ralph Valentino with numerous contributions by others. (v1.25).
[From: gnewman@world.std.com (Gary Newman)]
Memory diagnostics and Power On Self Tests (POSTs) find only hard errors WHEN THE USER LOOKS FOR THEM. The POST only reports these errors when a computer is booted. So unless a memory diagnostic program is run by the user, a hard memory error may go undetected until the next reboot. The effects of an error can spread far and wide during that time. Some systems BIOS allows the user to disable POST to speed up reboot. Beware that doing this can cause widespread data corruption if a hard error is present on a system without parity memory.
The ONLY method of finding hard or soft memory errors during operation is the use of PARITY MEMORY. This is simply the addition of one extra bit for every byte of memory to the computer, increasing memory SIMM costs by about 10% due to packaging economics. For a 16 Mb memory today parity adds about $50 to the end user price of the computer system. SOFTWARE CANNOT REPLACE THE FUNCTION OF PARITY MEMORY!
In its simplest form, hardware already in all computers manufactured today uses information in the parity memory. This allows it to detect any single bit memory errors before the computer can make any use of the bad data. Use of parity memory prevents the error from propagating and producing side effects. The only user unfriendly aspect to this is that computers without ECC (see below) can only halt the running program to prevent the use of the bad data. However, that is almost always better, and less costly, than allowing the spread of bad data.
At its best, the OS on the computer system can display a warning that a memory error occurred in a specific SIMM and that the program is being halted. This is typical for the Unix OS. If the error occurs in the OS itself, the whole system is halted. The MSDOS operating system appears to leave the problem to the system's BIOS to deal with. The better BIOSs will display a message and halt. The worst will simply freeze. All of these alternatives are better and less costly, than allowing the spread of bad data.
It is interesting to note that Pentium computers access memory 64 bits at a time, allowing use of Error Correcting Circuits (called ECC) when parity memory is included. The cost of adding ECC to the memory interface chips is modest, and most server computers have done this. The result is that soft errors can not only be detected, but also corrected on the fly without effecting the running programs. Computers that do this produce warning messages such as:
"soft error corrected at address 0x00343487 pattern 0x0004000"
so you know which SIMM produced the error. Frequent errors in the same SIMM indicate a bad memory chip. That's how we found the SIMM that produced one error a month for three months straight! Single bit hard errors can also be corrected on the fly. A single burned out memory bit or bad SIMM pin is "worked around" by the ECC. No need to fix it until a convenient time comes around.
What about errors that parity let's slip by? Those are double bit errors and are thus expected once every few thousand years. Perhaps double bit errors will become important when there are billions of computers in use... or gigabytes of DRAM on the average computer.
 
Continue to:
pc, hardware, faq, frequently asked questions, repair, computer
![]() |
|
|