What is ECC memory. What is ECC RAM? Buffered RAM - what is it? ecc-enabled memory

To the question Explain what is "ECC support" on RAM, given by the author Alyonka the best answer is it is an error correction function. such memory is placed on servers, because it is impossible for them to lag, turn off or overload due to errors. for a home computer, this is not a necessary thing, although it is useful. if you decide to install one for yourself, make sure that your motherboard supports this type of RAM with ECC.
Source: ku

Answer from Rainswept[guru]
ECC (Error Correct Code) - detection and correction of errors (other interpretations of the same abbreviation are possible) - an algorithm that replaced the "parity check". Unlike the latter, each bit is included in more than one checksum, which allows, in the event of an error in one bit, to restore the error address and correct it. As a rule, errors in two bits are also detected, although they are not corrected. To implement these capabilities, an additional memory chip is installed on the module and it becomes 72-bit, in contrast to the 64 data bits of a conventional module. ECC is supported by all modern motherboards designed for server solutions, as well as some "general purpose" chipsets. Some types of memory (Registered, Full Buffered) are available only in ECC version. It should be noted that ECC is not a panacea for defective memory and is used to correct random errors, reducing the risk of computer malfunctions from accidental changes in the contents of memory cells caused by external factors such as background radiation.
Registered memory modules are recommended for use in systems requiring (or supporting) 4 GB or more of RAM. They are always 72 bits, that is, they are ECC modules, and contain additional register chips for partial buffering.
PLL-Phase Locked Loop - frequency and signal phase auto-tuning circuit, serves to reduce the electrical load on the memory controller and increase stability when using a large number of memory chips, is used in all buffered memory modules.
Buffered - buffered module. Due to the high total electrical capacitance of today's memory modules, their long "charging" times result in time-consuming write operations. To avoid this, some modules (usually 168-pin DIMMs) are equipped with a special chip (buffer) that stores incoming data relatively quickly, which frees up the controller. Buffered DIMMs are generally incompatible with unbuffered ones. Partially buffered modules are also called "Registered" modules, and Fully Buffered modules are called "FB-DIMMs". In this case, "unbuffered" refers to ordinary memory modules without buffering facilities.
Parity - parity, modules with parity, also parity. A rather old principle of data integrity checking. The essence of the method is that for the data byte at the recording stage, a checksum is calculated, which is stored as a special parity bit in a separate chip. When data is read, the checksum is calculated again and compared with the parity bit. If they match, the data is considered authentic, otherwise a parity error message is generated (usually resulting in a system halt). The obvious disadvantages of the method include the high cost of memory required to store extra parity bits, vulnerability to double errors (as well as false positives in case of an error in the parity bit), system shutdown even with a minor error (say, in a video frame). Currently not applicable.
SPD - a chip on a DIMM memory module that contains all the data about it (in particular, information about speed) necessary to provide normal operation. This data is read at the stage of computer self-testing, long before the operating system is loaded, and allows you to configure memory access settings even if there are different memory modules in the system at the same time. Some motherboards refuse to work with modules that do not have an SPD chip, but such modules are now very rare and are mainly PC-66 modules.

Answer from Mowgley[guru]
memory check for errors

As I understand it, his arguments are as follows:

Google didn't use ECC when they built their servers in 1999.
Most RAM errors are systematic errors, not random ones.
RAM errors are rare because Hardware improved.
If ECC memory was really important, then it would be used everywhere, not just in servers. Paying for this kind of optional material is clearly too dubious.

Let's go through these arguments one by one:

1. Google didn't use ECC in 1999

If you are doing something just because Google once did it, then try:

A. Place your servers in shipping containers.

Today they still write articles that this is a great idea, although Google just ran an experiment that was regarded as a failure. It turns out that even Google's experiments don't always work out. In fact, their notorious fondness for "breakthrough projects" ("loonshots") means they have more failed experiments than most companies. In my opinion, this is a significant competitive advantage for them. Don't make this advantage bigger than it is by blindly copying failed experiments.

B. Start fires in your own data centers.

Part of Atwood's post discusses how amazing these servers were:

Some may take a look at these early Google servers and see the unprofessionalism regarding the fire hazard. Not me. I see here a visionary understanding of how low-cost, off-the-shelf hardware will shape the modern internet.

The last part of what was said is true. But in the first part there is some truth. When Google started designing their own boards, one generation of them had a "growth" problem ( ) that caused a non-zero number of fires.

By the way, if you go to Jeff's post and look at the photo referenced in the quote, you will see that there are a lot of jumper cables on the boards. This caused problems and was fixed in the next generation of hardware. You can also see some rather sloppy cabling, which additionally caused problems and was also quickly fixed. There were other problems, but I'll leave them as an exercise for the reader.

C. Create Servers That Injure Your Employees

The sharp edges of a generation of Google servers have earned them a reputation for being "razor blades and hate."

D. Create your own weather in your data centers

After talking to the employees of many large technology companies, it seems that most companies were so climate controlled that clouds or fog formed in their data centers. You could call it Google's calculated and devious plan to replicate the Seattle weather to poach Microsoft employees. Alternatively, it could be a plan to create literally "cloud computing". Or maybe not.

Please note that everything indicated by Google tried and then changed. Making mistakes and then fixing them is common in any successful development organization. If you idolize engineering practice, then you should at least hold on to modern practice, and not to what was done in 1999.

When Google used non-ECC servers in 1999, they exhibited a number of symptoms that were eventually found to be memory corruption. Including a search index that returned virtually random results in queries. The actual failure mode here is instructive. I often hear that ECC can be ignored on these machines because errors in individual results are acceptable. But even if you consider random errors to be acceptable, ignoring them means that there is a danger of complete data corruption, unless careful analysis is carried out to make sure that one error can only slightly distort one result.

In studies carried out on file systems ah, it has been repeatedly shown that, despite heroic attempts to create systems that are resistant to a single error, it is extremely difficult to do this. Essentially, every heavily tested file system can have a major failure due to a single error (). I'm not going to attack file system developers. They are better at this kind of analysis than 99.9% of programmers. It's just that the problem has been repeatedly shown to be so difficult that people can't reasonably discuss it, and an automated tool for such analysis is still far from being a simple push of a button. In their Warehouse Computer Handbook, Google discusses error detection and correction, and ECC memory is considered the best option when it is obvious that hardware error correction ( ) must be used.

Google has an excellent infrastructure. From what I've heard about infrastructure at other major tech companies, Google seems to be the best in the world. But that doesn't mean you should copy everything they do. Even if only their good ideas are considered, it makes no sense for most companies to copy them. They created a replacement for the Linux job hook scheduler that uses both hardware runtime information and static traces to allow them to take advantage of the new hardware in Intel server processors, allowing for dynamic cache partitioning across cores. If you use this on all of their hardware, Google will save more money in a week than Stack Exchange has spent on all of its machines in its entire history. Does this mean you have to copy Google? No, unless you've already been hit with manna from heaven, such as having your core infrastructure written in highly optimized C++ rather than Java or (God forbid) Ruby. And the fact is that for the vast majority of companies, writing programs in a language that entails a 20-fold decrease in productivity is a perfectly reasonable decision.

2. Most RAM Errors Are Systematic Errors

The argument against ECC reproduces the following section of the DRAM error study (emphasis added by Jeff):

Our study has several main results. First, we found that approximately 70% of DRAM failures are repetitive (e.g., permanent) failures, while only 30% are intermittent (intermittent) failures. Second, we found that large multi-bit failures, such as failures that affect an entire row, column, or block, account for over 40% of all DRAM failures. Thirdly, we found that almost 5% of DRAM failures affect circuitry at the board level, such as data (DQ) or gate (DQS) lines. Finally, we found that the Chipkill feature reduced the frequency of system failures caused by DRAM failures by a factor of 36.

The quote seems somewhat ironic, as it does not seem to be an argument against ECC, but an argument for Chipkill - a certain ECC class. Putting that aside, Jeff's post indicates that systematic errors are twice as common as random errors. The post then says that they run memtest on their machines when systematic errors occur.

First, the 2:1 ratio isn't large enough to simply ignore random errors. Second, the post implies Jeff's belief that systematic errors are essentially immutable and cannot show up after a while. This is not true. Electronics wear out in the same way that mechanical devices wear out. The mechanisms are different, but the effects are similar. Indeed, if we compare chip reliability analysis with other types of reliability analysis, we can see that they often use the same families of distributions for failure modeling. Third, Jeff's line of reasoning implies that ECC cannot help detect or fix bugs, which is not only wrong, but directly contradicts the quote.

So, how often are you going to run memtest on your machines in an attempt to catch these system errors, and how much data loss are you willing to endure? One of the key uses of ECC is not to correct errors, but to signal errors so that hardware can be replaced before “silent corruption” occurs. Who would agree to close everything on the machine every day to run memtest? It would be much more expensive than just buying ECC memory. And even if you could convince me to run a memory test, memtest wouldn't find as many errors as ECC can.

When I was working for a company with a fleet of about a thousand machines, we noticed that we were having strange data integrity check failures, and after about six months, we realized that failures on some machines were more likely than others. These failures were quite rare (maybe a couple of times a week on average), so it took a long time to accumulate information and understand what was happening. Without knowing the cause, parsing the logs to see if the errors were caused by single bit flips (with a high probability) was also non-trivial. We were fortunate that, as a side effect of the process we were using, the checksums were computed in a separate process on a different machine at different times, so that a bug could not corrupt the result and propagate this corruption to the checksum.

If you're just trying to protect yourself with in-memory checksums, there's a good chance that you'll perform a checksum operation on already corrupted data and get the correct checksum of the bad data, unless you're doing some really fancy calculations that give their own checksums. And if you're serious about error correction, then you're probably still using ECC.

Anyway, after completing the analysis, we found that memtest could not detect any problems, but replacing the RAM on bad machines led to a decrease in the error rate by one to two orders of magnitude. Most services don't have the kind of checksums we had; these services will just silently write corrupted data to persistent storage and won't see the problem until the client complains.

3. Due to the development of hardware, errors have become very rare.

The data in the post is not enough for such a statement. Note that as RAM usage increases and continues to increase exponentially, RAM failures must decrease at a greater exponential rate to actually reduce the frequency of data corruption. Also, as the chips continue to get smaller, the cells get smaller, making the wear and tear issues discussed in the second point more relevant. For example, with 20nm technology, a DRAM capacitor can accumulate somewhere around 50 electrons, and this number will be less for the next generation of DRAM while continuing to decrease.

Another note: when you pay for ECC, you are not just paying for ECC memory - you are paying for parts (processors, boards) that are of higher quality. This can easily be seen with drive failure rates, and I've heard a lot of people notice this in their personal observations.

To quote publicly available research, as far as I remember, Andrea and Ramsey's group released the SIGMETRICS paper a few years ago, which showed that a SATA drive was 4 times more likely to fail a read than a SCSI drive, and 10 times more likely to have hidden data corruption. . This ratio was maintained even when using discs from the same manufacturer. There is no particular reason to think that the SCSI interface should be more reliable than the SATA interface, but it is not about the interface. We are talking about buying highly reliable server components compared to client ones. Perhaps you are not specifically interested in the reliability of the disk, because you have everything on the checksums, and damage is easily found, but there are some types of violations that are more difficult to detect.

4. If ECC memory was really important, then it would be used everywhere, not just in servers.

To paraphrase this argument slightly, we can say that "if this characteristic were really important for servers, then it would be used in non-servers as well." You can apply this argument to quite a lot of server hardware. In fact, this is one of the most frustrating problems facing major cloud providers.

They have enough leverage to get most of the components at the right price. But bargaining will only work where there is more than one viable supplier.

One of the few areas where there are no viable competitors is the production of central processing units and video accelerators. Fortunately for large suppliers, they usually do not need video accelerators, they need processors, a lot - this has long been the case. There were several attempts by processor vendors to enter the server market, but each such attempt always had fatal flaws from the very beginning, making it obvious that it was doomed (and these are often projects that require at least 5 years, i.e. it was necessary to spend a lot of time without confidence in success).

Qualcomm's efforts have received a lot of noise, but when I talk to my contacts at Qualcomm, they all tell me that the chip currently being made is essentially for testing. It happened because Qualcomm needed to learn how to make a server chip from all the people it had poached from IBM, and that the next chip would be the first one that could hopefully be competitive. I have high hopes for Qualcomm, and also for ARM's efforts to make good server components, but these efforts have not yet yielded the desired result.

The almost complete unsuitability of current ARM (and POWER) options (aside from hypothetical options for Apple's impressive ARM chip) for most server workloads in terms of performance per dollar of total cost of ownership (TCO) is a topic a little off the beaten track, so I'll leave that for now. another publication. But the point is that Intel has a position in the market that can force people to pay extra for server features. And Intel does it. Also, some features are really more important for servers than for mobile devices with several gigabytes of RAM and an energy budget of several watts, mobile devices that are still expected to periodically crash and reboot.

Conclusion

Should I buy ECC RAM? It depends on many things. For servers, this is probably a good option considering the costs. It's really hard to do a cost/benefit analysis though, because it's pretty hard to determine the cost of latent data corruption or the cost of risking losing half a year of a developer's time tracking down intermittent crashes, only to find they're caused by non-ECC memory usage.

For desktops, I'm also a supporter of ECC. But if you do not make regular backups, then it is more useful for you to invest in regular backups than in ECC memory. And if you have backups without ECC, then you can easily write corrupted data to the main storage and replicate this corrupted data to the backup.

Thanks to Prabhakar Ragda, Tom Murphy, Jay Weiskopf, Leah Hanson, Joe Wilder and Ralph Corderoy for discussion/comments/corrections. Also, thanks (or maybe not thanks) to Leah for convincing me to write this oral impromptu as a blog post. We apologize for any errors, lack of references, and sublime prose; this is essentially a recording of half the discussion, and I didn't explain the terms, provide links, or check the facts to the level of detail that I usually do.

One funny example is (for me at least) the magical self-healing fusible link. Although there are many implementations, imagine a fusible link on a chip as a kind of resistor. If you pass some current through it, then you should get a connection. If the current is too high, the resistor will heat up and eventually break. This is commonly used to disable elements on chips, or for activities such as setting the clock speed. The basic principle is that after the jumper has burned out, there is no way to return it to its original state.

A long time ago there was a semiconductor manufacturer who was a bit hasty with their manufacturing process and somewhat over-reduced tolerances in a certain technology generation. After a few months (or years), the connection between the two ends of such a jumper was able to reappear and restore it. If you're lucky, such a jumper will be something like the most significant bit of the clock multiplier, which, if changed, will disable the chip. If you're not lucky, it will lead to hidden data corruption.

I heard from many people in different companies about the problems in this technological generation of this manufacturer, so these were not isolated cases. When I say it's funny, I mean it's funny to hear this story in a bar. It's less funny to find out after a year of testing that some of your chips don't work because their jumper settings are meaningless and you need to remake your chip and delay release by 3 months. By the way, this fusible link recovery situation is another example of a class of errors that can be mitigated with ECC.

This is not a Google issue; I only mention this because a lot of the people I talk to are surprised at how hardware can fail.

If you don't want to dig through the whole book, then here's the snippet:

In a system that can withstand a series of failures at the software level, the minimum requirement for the hardware part is that failures of this part are always detected and reported. software timely enough to allow the software infrastructure to contain them and take appropriate recovery action. It is not necessary for the hardware to explicitly handle all failures. This does not mean that the hardware for such systems should be designed without error correction capability. Whenever bug fixing functionality can be offered at reasonable cost or complexity, supporting it often pays off. This means that if hardware error correction were extremely expensive, then the system might be able to use a cheaper version that only provided detection capabilities. Current DRAM systems are a good example of a situation in which powerful error correction can be provided at very low additional cost. However, relaxing the requirement to detect hardware errors would be much more difficult, as it would mean that each software component would be burdened with the need to verify its own correct execution. At the initial stage of its google history I had to deal with servers where DRAM didn't even have parity. The creation of a web search index essentially consists of a very large sort/merge operation using multiple machines at length. In 2000, one of Google's monthly web index updates failed pre-validation when it was found that a subset of the queries tested were returning documents, apparently at random. After some research, a situation was found in the new index files that corresponded to fixing a bit to zero in a certain place in the data structures, which was a negative side effect of streaming a large amount of data through a faulty DRAM chip. Consistency checks were added to the index data structures to minimize the chance of this problem reoccurring and there have been no further problems of this nature. However, it should be noted that this method does not guarantee 100% error detection in the indexing pass, since not all memory positions are checked - instructions, for example, remain unchecked. This worked because the index data structures were so much larger than all the other data involved in the computation that the presence of these self-monitoring data structures made it very likely that machines with defective DRAM would be identified and excluded from the cluster. The next generation of machines at Google already included memory parity detection, and once the price of ECC memory dropped to competitive levels, all subsequent generations used ECC-DRAM.

Tags: Add tags

Also, ECC data protection schemes can be applied to the memory built into microprocessors: cache memory, register file. Sometimes control is also added to computational circuits.

Description of the problem

There is concern that the trend towards smaller physical dimensions of memory modules will lead to an increase in error rates due to the fact that particles of lower energy will be able to change the bit. On the other hand, the compact size of the memory reduces the chance of particles getting into it. In addition, moving to technologies such as silicon on an insulator can make memory more stable.

A study conducted on a large number of Google servers showed that the number of errors can be in the range from 25,000 to 70,000 errors per billion working hours (eng. device hours) per megabit (that is, 2.5-7.0 × 10 − 11 errors / bit hour) .

Technology

One solution to this problem is parity - using an extra bit that records the parity of the rest of the bits. This approach allows you to detect errors, but does not allow you to correct them. Thus, when an error is detected, you can only interrupt the execution of the program.

A more reliable approach is one that uses error-correcting codes. The most commonly used error-correcting code is the Hamming code. Most error-correcting memories used in modern computers can correct a single bit error in one 64-bit machine word and detect, but not correct, a two-bit error in a single 64-bit word.

The most effective approach to error correction depends on the kind of errors expected. It is often assumed that the different bits change independently. In this case, the probability of two errors in one word is negligibly small. However, this assumption does not hold for modern computers. Memory based on error correction technology Chipkill(IBM), allows you to correct several errors, including when the whole memory chip is damaged. Other memory correction technologies that do not assume independence of errors in different bits include Extended ECC(Sun Microsystems), Chipspare(Hewlett-Packard) and SDDC(Intel).

Many older systems did not report fixed bugs, only reporting bugs that were found and could not be fixed. Modern systems record both corrected errors (CE, English correctable errors) and uncorrectable errors (UE, English uncorrectable errors). This allows you to replace damaged memory in time: despite the fact that a large number of corrected errors in the absence of uncorrectable errors does not affect the correct operation of the memory, this may indicate that for a given memory module the probability of uncorrectable errors will increase in the future.

Advantage and disadvantages

Error-correcting memory protects against incorrect operation of a computer system due to memory corruption and reduces the likelihood of a fatal system failure. However, such memory costs more; the motherboard, chipset, and processor that support error-correcting memory can also be more expensive, so such memory is used in systems where smooth and correct operation is important, such as a file server, scientific and financial applications.

Error-correcting memory is 2-3% slower (often requiring one extra cycle of the memory controller to check sums) than conventional memory, depending on the application. Additional logic that implements counting, ECC checking and error correction requires logical resources and time to work either in the memory controller itself or in the interface between the CPU and the memory controller.

Notes

Werner Fischer. RAM Revealed (indefinite) . admin-magazine.com. Retrieved October 20, 2014.
Archived copy (indefinite) (unavailable link). Retrieved November 20, 2016. Archived from the original on April 18, 2016.
Single Event Upset at Ground Level, Eugene Normand, Member, IEEE, Boeing Defense & Space Group, Seattle, WA 98124-2499
"A Survey of Techniques for Modeling and Improving Reliability of Computing Systems", IEEE TPDS, 2015
Kuznetsov VV Solar-terrestrial physics (course of lectures for students of physics). Lecture 7. Solar activity. // Solar storms. Gorno-Altai State University. 2012
Gary M. Swift and Steven M. Guertin. "In-Flight Observations of Multiple-Bit Upset in DRAMs". Jet Propulsion Laboratory
Borucki, "Comparison of Accelerated DRAM Soft Error Rates Measured at Component and System Level", 46th Annual International Reliability Physics Symposium, Phoenix, 2008, pp. 482–487
Schroeder, Bianca; Pinheiro, Eduardo; Weber, Wolf-Dietrich. DRAM Errors in the Wild: A Large-Scale Field Study (indefinite) // SIGMETRICS/Performance. - ACM, 2009. - ISBN 978-1-60558-511-6.
Using StrongArm SA-1110 in the On-Board Computer of Nanosatellite (indefinite) . Tsinghua Space Center, Tsinghua University, Beijing. Retrieved February 16, 2009. Archived from the original on October 2, 2011.
Doug Thompson, Mauro Carvalho Chehab. "EDAC - Error Detection And Correction" Archived from the original on September 5, 2009. . 2005-2009. "The "edac" kernel module goal is to detect and report errors that occur within the computer system running under linux."
Discussion of ECC on pcguide (indefinite) . Pcguide.com (April 17, 2001). Retrieved November 23, 2011.

Page 1 of 10

On the Web, you can often see questions on thematic forums regarding error-correcting memory, namely, its impact on system performance. Today's test will answer this question.

Before reading this material, we recommend that you familiarize yourself with the materials on and platform LGA1151.

Theory

Before testing, let's talk about memory errors.
Errors that occur in memory can be divided into two types - hardware and random. The reason for the appearance of the former are defective DRAM chips. The latter arise due to the influence of electromagnetic interference, radiation, alpha and elementary particles, etc. Accordingly, hardware errors can only be corrected by replacing DRAM chips, and random errors can be corrected using special technologies, for example, ECC (Error-Correcting Code). ECC error correction has two methods in its arsenal: SEC (Single Error Correction) and DED (Double Error Detection). The first one corrects single-bit errors in a 64-bit word, and the second one detects two-bit errors.
The hardware implementation of ECC is to accommodate additional memory chips that are needed to write 8-bit checksums. Thus, an error-correcting memory module with a single-sided design will have 9 memory chips instead of 8 (like a standard module), and with a double-sided one - 18 instead of 16. At the same time, the width of the module increases from 64 to 72 bits.
When reading data from memory, the checksum is recalculated and compared with the original one. If the error is in one bit, it is corrected; if in two, it is detected.

Practice

In theory, everything is fine - error-correcting memory increases the reliability of the system, which is very important when building a server or workstation. But in practice, there is also the financial side of this issue. If the server requires error correction memory, then the workstation can do without ECC (many ready-made workstations from different manufacturers are equipped with conventional RAM). How much more expensive is memory with error correction?
A typical 8GB DDR4-2133 module costs around $39, while an ECC module costs $48 (at the time of writing). The difference in cost is about 23%, which is quite significant at first glance. But if you look at the total cost of the workstation, then this difference will not exceed 5% of it. Thus, the purchase of memory with ECC only slightly increases the cost of the workstation. The only question that remains is how memory with ECC affects processor performance.
In order to answer this question, the editors of the site took Samsung DDR4-2133 ECC and Kingston DDR4-2133 memory modules with the same timings 15-15-15-36 and 8 GB for testing.

On Samsung M391A1G43DB0-CPB memory modules with error correction, 9 chips are soldered on each side.

While on conventional Kingston KVR21N15D8 / 8 memory modules, 8 chips are soldered on each side.

Test stand: Intel Xeon E3-1275v5, Supermicro X11SAE-F, Samsung DDR4-2133 ECC 8GB, Kingston DDR4-2133 non-ECC 8GB

Detailing

Processor: (HT on; TB off);
- Motherboard: ;
- RAM: 2x (M391A1G43DB0-CPB), 2x (KVR21N15D8/8);
- OS: .

Test Methodology

3DMark06 1.21;
- 7zip 15.14;
- AIDA64 5.60;
- Cinebench R15;
- Fritz 4.2;
- Geekbench 3.4.1;
- Lux Mark v3.1;
- MaxxMEMI 1.99;
- Pass Mark v8;
- RealBench v2.43;
- SiSoftware Sandra 2016;
- SVPmark v3.0.3b;
- TrueCrypt 7.1a;
- WinRAR 5.30;
- wPrime 2.10;
- x264 v5.0.1;
- x265 v0.1.4;
- Kraken;
- Octane;
- Octane 2.0;
- Peacekeeper;
- Sunspider;
- Web XPRT.