Motivation posed by Eddie Dong
There are different policy in today’s data center servers. Google uses cheap desktop for server even without ECC mechanism, but using MapReduce application level fault tolerance solution like Frans mentioned. In those servers, there is no ECC, nor HA. Tasks running in fault server can be seemlessly taken over by other server (running on cheap server as well). Because Google provide SAAS only where this kind of solution could provide a perfect answer to user or google search engine.
Amazon doesn’t do in that way, because Amazon provide IAAS where each VM is critical to customer. A customer OS couldn’t fallover to another VM easily using software solution except virtualization technology like this one. The fact is that a failure of service level agreement (SLA) per VM may be fined with 100X cost. In Amazon, 1.29% fauilure rate is important, and Amazon uses expansive servers with ECC to provide memory redundancy in HW. The error rate increases when the machine becomes elder….
In other hand, mission critical server requires to be survive for 10 years, which will eventually suffers from many HW failures. Most of them has redundance including memory. HW ECC is the 1st level guard, but SW solution is 2nd guard. For mission critical server, nobody can ignore a probably 10^^(-12) failure. In those solutions, OS kernel, VMM has dual instance in OS, and OS/VMM are enhanced to be able to failover from one instance to another when a failure happens.
Intel is already doing that business.
ECC is entry level MC solution. It pays less memory, but get lower HA (1 bit only for example). ECC + HW Memory Mirror is typical MC solution, with higher HA. (Intel product has this capability).
Your solution is between them: Using entry level HW MC platform, to achieve same with typical MC solution HA. As I said, paper is just a paper. It can solve one problem only. It is product that needs to solve all the problems.
Besides the advanatge of flexible mirroring configuration, i.e. mirroring ratio. I got another selling point for the advantage of SW mirroring over HW mirroring: Supporting hybrid configuration or hybrid usage model, which means SW mirroring can support the situation where mission critical (MC) VM and non-mission critical VM can co-run on top of a hypervisor/host platform. In this case, we can provide memory mirroring to MC VM only with the payment of double size memory + up to 20% performance overhead, but runs non-MC VM as usual.
Electrical or magnetic interference inside a computer system can cause a single bit of dynamic random-access memory (DRAM) to spontaneously flip to the opposite state. 计算机内部的电磁可以干扰DRAM中的内存中的单一位（其实几位也可以干扰）
It was initially thought that this was mainly due to alpha particles emitted by contaminants in chip packaging material, but research has shown that the majority of one-off ("soft") errors in DRAM chips occur as a result of background radiation, chiefly neutrons from cosmic ray secondaries, which may change the contents of one or more memory cells or interfere with the circuitry used to read/write them.
There was some concern that as DRAM density increases further, and thus the components on chips get smaller, while at the same time operating voltages continue to fall
DRAM芯片的特点是，加载的电压在变小，密度更大，因为体积在变小，ECC认为这导致DRAM的chips会经常接收上述的辐射，因此即使很小的辐射颗粒都可能导致让DRAM的bit错误（lower-energy particles will be able to change a memory cell's state.）就是对于宇宙射线的干扰，对于chip本身可以用一种叫做Silicon on insulator的技术来解决（这是一种材料学中常见的技术，但是其瓶颈还是很大的，ECC所说的坏的标准是：change the memory cells’ state）ECC其实是这样认为的，他认为，虽然这种情况是容易受到出错的，但是一旦出错就是灾难性的，我觉得我们的论文也应该去强调这点。
ECC列举出了Cassini–Huygens spacecraft的例子。vcontains two identical flight recorders, each of which contains 2.5 gigabits of of memory in the form of arrays of commercial DRAM chips。In the vicinity of Earth, and when the sun is "quiet", it reported a nearly constant single-bit error rate of about 280 errors per day（佐证，在宇宙飞船的例子中，内存的出错几率就是很大的，但是我们问题的出发点是廉价机器部署内存错误）
The maximum hourly error report from Cassini–Huygens in the first month in space was 128 single-bit errors per hour during a weak solar flare. If the flight recorders had been designed with EDAC words assembled from widely-separated bits, the number of (uncorrectable) multiple-bit errors should average less than one per year.
EDAC，即Error Detection And Correction（错误检测与纠正）
从2007-2009年的研究表明，，10e-10-10e-17，roughly one bit error, per hour, per gigabyte of memory to one bit error, per century, per gigabyte of memory. he actual error rate found was several orders of magnitude higher than previous small-scale or laboratory studies, with 25,000 to 70,000 errors per billion device hours per megabit (about 3–10×10−9 error/bit•h), and more than 8% of DIMM memory modules affected by errors per year.
首先，ECC认为，内存错误依赖于系统，如果没有ECC，会an error can lead either to a crash or to corruption of data（吹牛逼的）. In large-scale production sites memory errors are one of the most common hardware causes of machine crashes.（告诉了我们这点）.Error会引起安全问题（security vulnerabilities好词啊）。
An example of a single-bit error that would be ignored by a system with no error-checking, would halt a machine with parity checking, or would be invisibly corrected by ECC: a single bit is stuck at 1 due to a faulty chip, or becomes changed to 1 due to background or cosmic radiation; a spreadsheet storing numbers in ASCII format is loaded, and the digit "8" is stored in the byte which contains the stuck bit as its eighth bit; then a change is made to the spreadsheet and it is saved. However, the "8" (00111000 binary) has silently become a "9" (00111001).
- DRAM modules that include extra memory bits and memory controllers that exploit these bits
- Parity allows the detection of all single-bit errors (actually, any odd number of wrong bits）
2 支持ECC功能的主板必须disable ECC memory 给用户
3 大部分基于Intel的PC机芯片不支持ECC（ those whose motherboards do are often supplied with memory modules that do not support ECC. ）
4 It may be that most users opt for non-ECC systems and memory even when ECC is available.
- 尼玛ECC太贵了，屌丝买不起（(each bank is 9 memory chips compared to 8 for non-ECC memory, and more importantly there is more volume for non-ECC. In some cases the price ratio reduces to 9/8, as an example, on 2008/11/30, on Crucial.com, an ECC CL=5 unbuffered 2GB DDR2-667 DIMM cost $30 while the corresponding non-ECC part cost $28, a difference of 1/15, however some ECC modules cost twice as much as their non-ECC equivalents [Crucial CT12872Z40B and CT12864Z40B, Jan 2009])）
- ECC做高可靠时，效率会衰减2-3%，见参考文献( depending on application, due to the additional time needed for ECC memory controllers to perform error checking;)
- An ECC-capable memory controller as used in many modern PCs can typically detect and correct errors of a single bit per 64-bit "word" (the unit of bus transfer)，and detect (but not correct) errors of two bits per 64-bit word.
- EDAC-protected memory 需要调查，跟ECC类似
- 有些ECC采用N-redudancy modules
 Bianca Schroeder，Eduardo Pinheiro，Wolf-Dietrich Weber.DRAM errors in the wild: a large-scale field study.SIGMETRICS '09 Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems.ACM New York, NY, USA ©2009.
 Borucki, "Comparison of Accelerated DRAM Soft Error Rates Measured at Component and System Level", 46th Annual International Reliability Physics Symposium, Phoenix, 2008, pp. 482–487.
 Gary M. Swift and Steven M. Guertin. "In-Flight Observations of Multiple-Bit Upset in DRAMs". Jet Propulsion Laboratory.
 Xin Li, Kai Shen, Michael C. Huan. A Memory Soft Error Measurement on Production Systems.USNIX2007.
 ECC wkipedia.http://en.wikipedia.org/wiki/ECC_memory
 "pcguide; The Market's Change from Parity to Non-Parity Memory". Pcguide.com. 2001-04-17. Retrieved 2011-11-23.
 pcguide: Parity vs. Non-Parity: Pros and Cons
 "Discussion of ECC on pcguide". Pcguide.com. 2001-04-17. Retrieved 2011-11-23.
 HP Co. Advanced ECC
- The DIMM continuously refreshes the cells to preserve the data.
- 造成上述结果的原因，Hp认为，在硬件上，是：DRAM defects, bad solder joints, and connector issues, cause hard errors so the device consistently returns incorrect results.For example, a memory cell may be stuck so that it always returns “0” bit, even when a “1” bit is written to it.
These trends help to drive manufacturers to build more memory capacity in industry-standard servers:
• Operating system support for increasing amounts of memory
• Availability of low-cost, high-capacity memory modules
• Server virtualization
To improve memory protection beyond standard ECC, HP introduced Advanced ECC technology in 1996. HP and most other server manufacturers use this solution in industry-standard products. Advanced ECC can correct a multi-bit error that occurs within a DRAM chip and avoid a complete DRAM chip failure. In Advanced ECC with 4-bit memory devices, each chip contributes four bits of data to the data word. The four bits from each chip are distributed across four ECC devices (one bit per ECC device), so that an error in one chip could produce up to four separate single-bit errors. Figure 7 shows how one ECC device receives four data bits from four DRAM chips.
Replacing a failed DIMM usually raises your operating costs when you take a server down for unscheduled maintenance, and the shutdown loses revenues for your business. Three available memory protection technologies, known as Memory Failure Recovery, give you failover and backup capability（好句子）
保护的类型：This prevents data corruption, a server crash, or both, and you can replace the defective DIMM at your convenience during a scheduled shutdown.
2 可支持单memory channel，但条件是：Can run on some systems with only one memory channel populated. But a single-channel memory configuration requires dual-rank DIMMs.
3 可以支持多信道但是条件是 the operating system must have system management and agent support for Advanced Memory Protection. Implementing Online Spare Memory mode over Advanced ECC requires extra DIMMs for the spare memory channel and reduces the system’s memory capacity.
- Mirror Memory机制：
（可借鉴其语言） With Mirrored Memory mode, the memory subsystem writes identical data to two channels simultaneously. If a memory read from one of the channel returns incorrect data due to an uncorrectable memory error, the system automatically retrieves the data from the other channel. A transient or soft error in one channel does not affect mirroring, and operation continues unless there is a simultaneous error in exactly the same location on a DIMM and its mirrored DIMM. Mirrored Memory mode reduces the amount of memory available to the operating system by 50% since only one of the two populated channels provides data.
- Lockstep Memory mode
Lockstep Memory mode uses two memory channels at a time and offers you an even higher level of protection. In lockstep mode, two channels operate as a single channel—each write and read operation moves a data word two channels wide. Both channels split the cache line to provide 2x 8-bit error detection and 8-bit error correction within a single DRAM. In three-channel memory systems, the third channel is unused and left unpopulated. The Lockstep Memory mode is the most reliable, but it reduces the total system memory capacity by one-third in most systems.
- google认为的内存错误的几率为：DRAM error rates that are orders of magnitude higher than previously reported, with 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year.
- There is also a fear that advancing densities in DRAM technology might lead to increased memory errors, exacerbating this problem in the future
- DRAM error在real world是个什么样的情形呢
failure detection system for a mirrored memory dual controller(google patent 1994)
- Focus on disk storage controler. In high reliability disk stroage system, there has a desire to have redudancy in all the physic part.
- Some disk use a delayed or massive update process to create duplication(default: degrade performance and complex to management)
- Form a real time mirrored memory process(advantage: fast & accuracy)
- default:multiple disk stroage controler solution problems more difficult
- The motivation of mirrored memory in stroage controler are
How to effectively and reliably (1)detect controller failure early on in the context of mirrored memory processing so as to reduce potential problems that may occur from later discovery of failure; (2) detect controller failure without sigificant harware and /or software overhead requirements; and (3)detect controller failure to seprate the controllers and discontinue mirroring of their memories without loss of processing operations and capabilities.
- with detection rather than retrievation.