User:Binqu:Hy reser

From Trusted Cloud Group
Jump to: navigation, search


Contents

Motivation posed by Eddie Dong

There are different policy in today’s data center servers. Google uses cheap desktop for server even without ECC mechanism, but using MapReduce application level fault tolerance solution like Frans mentioned. In those servers, there is no ECC, nor HA. Tasks running in fault server can be seemlessly taken over by other server (running on cheap server as well). Because Google provide SAAS only where this kind of solution could provide a perfect answer to user or google search engine.

Amazon doesn’t do in that way, because Amazon provide IAAS where each VM is critical to customer. A customer OS couldn’t fallover to another VM easily using software solution except virtualization technology like this one. The fact is that a failure of service level agreement (SLA) per VM may be fined with 100X cost. In Amazon, 1.29% fauilure rate is important, and Amazon uses expansive servers with ECC to provide memory redundancy in HW. The error rate increases when the machine becomes elder….

In other hand, mission critical server requires to be survive for 10 years, which will eventually suffers from many HW failures. Most of them has redundance including memory. HW ECC is the 1st level guard, but SW solution is 2nd guard. For mission critical server, nobody can ignore a probably 10^^(-12) failure. In those solutions, OS kernel, VMM has dual instance in OS, and OS/VMM are enhanced to be able to failover from one instance to another when a failure happens.

Intel is already doing that business.

ECC is entry level MC solution. It pays less memory, but get lower HA (1 bit only for example). ECC + HW Memory Mirror is typical MC solution, with higher HA. (Intel product has this capability).

Your solution is between them: Using entry level HW MC platform, to achieve same with typical MC solution HA. As I said, paper is just a paper. It can solve one problem only. It is product that needs to solve all the problems.

Besides the advanatge of flexible mirroring configuration, i.e. mirroring ratio. I got another selling point for the advantage of SW mirroring over HW mirroring: Supporting hybrid configuration or hybrid usage model, which means SW mirroring can support the situation where mission critical (MC) VM and non-mission critical VM can co-run on top of a hypervisor/host platform. In this case, we can provide memory mirroring to MC VM only with the payment of double size memory + up to 20% performance overhead, but runs non-MC VM as usual.

ECC

Background

ECC所应用的背景

Electrical or magnetic interference inside a computer system can cause a single bit of dynamic random-access memory (DRAM) to spontaneously flip to the opposite state. 计算机内部的电磁可以干扰DRAM中的内存中的单一位(其实几位也可以干扰)

It was initially thought that this was mainly due to alpha particles emitted by contaminants in chip packaging material, but research has shown that the majority of one-off ("soft") errors in DRAM chips occur as a result of background radiation, chiefly neutrons from cosmic ray secondaries, which may change the contents of one or more memory cells or interfere with the circuitry used to read/write them.

刚开始人们认为是从芯片外包装被污染后放射出的阿尔法粒子引起的,但是ECC在论文中认为一次性错误来源于宇宙射线,可以破坏一个或多个内存的单元。

There was some concern that as DRAM density increases further, and thus the components on chips get smaller, while at the same time operating voltages continue to fall

DRAM芯片的特点是,加载的电压在变小,密度更大,因为体积在变小,ECC认为这导致DRAM的chips会经常接收上述的辐射,因此即使很小的辐射颗粒都可能导致让DRAM的bit错误(lower-energy particles will be able to change a memory cell's state.)就是对于宇宙射线的干扰,对于chip本身可以用一种叫做Silicon on insulator的技术来解决(这是一种材料学中常见的技术,但是其瓶颈还是很大的,ECC所说的坏的标准是:change the memory cells’ state)ECC其实是这样认为的,他认为,虽然这种情况是容易受到出错的,但是一旦出错就是灾难性的,我觉得我们的论文也应该去强调这点。

ECC列举出了Cassini–Huygens spacecraft的例子。vcontains two identical flight recorders, each of which contains 2.5 gigabits of of memory in the form of arrays of commercial DRAM chips。In the vicinity of Earth, and when the sun is "quiet", it reported a nearly constant single-bit error rate of about 280 errors per day(佐证,在宇宙飞船的例子中,内存的出错几率就是很大的,但是我们问题的出发点是廉价机器部署内存错误)

The maximum hourly error report from Cassini–Huygens in the first month in space was 128 single-bit errors per hour during a weak solar flare. If the flight recorders had been designed with EDAC words assembled from widely-separated bits, the number of (uncorrectable) multiple-bit errors should average less than one per year.

EDAC,即Error Detection And Correction(错误检测与纠正)

在一些电磁环境比较恶劣的情况下,一些大规模集成电路常常会受到干扰,导致不能正常工作。特别是像RAM这种利用双稳态进行存储的器件,往往会在强干扰下发生翻转,使原来存储的"0"变为"1",或者"1"变为"0",造成的后果往往是很严重的。例如导致一些控制程序跑飞,存储的关键数据出错等等。现在,随着芯片集成度的增加,发生错误的可能性也在增大。在一些特定的应用中,这已经成为一个不能忽视的问题。例如在空间电子应用领域,单粒子翻转效应就成为困扰设计师的一个难题。


在ECC看来内存错误的几率有多大

前提是大型机,并且应用于太空站等地方

从2007-2009年的研究表明,,10e-10-10e-17,roughly one bit error, per hour, per gigabyte of memory to one bit error, per century, per gigabyte of memory. he actual error rate found was several orders of magnitude higher than previous small-scale or laboratory studies, with 25,000 to 70,000 errors per billion device hours per megabit (about 3–10×10−9 error/bit•h), and more than 8% of DIMM memory modules affected by errors per year.

然后讲述memory errior的危害

ECC所认为的纠错的必要性

首先,ECC认为,内存错误依赖于系统,如果没有ECC,会an error can lead either to a crash or to corruption of data(吹牛逼的). In large-scale production sites memory errors are one of the most common hardware causes of machine crashes.([1]告诉了我们这点).Error会引起安全问题(security vulnerabilities好词啊)。


还有,ECC认为,内存错误很不容易被发现,甚至没有任何征兆去判断某一位出现了问题

但ECC还是承认给出了一些易判断的现象,比如,很有可能内存的错误会阻止奇偶校验的发生

ECC给出一个很生动的例子,是这样的:

An example[5] of a single-bit error that would be ignored by a system with no error-checking, would halt a machine with parity checking, or would be invisibly corrected by ECC: a single bit is stuck at 1 due to a faulty chip, or becomes changed to 1 due to background or cosmic radiation; a spreadsheet storing numbers in ASCII format is loaded, and the digit "8" is stored in the byte which contains the stuck bit as its eighth bit; then a change is made to the spreadsheet and it is saved. However, the "8" (00111000 binary) has silently become a "9" (00111001).

ECC应用的条件

  • DRAM modules that include extra memory bits and memory controllers that exploit these bits
  • Parity allows the detection of all single-bit errors (actually, any odd number of wrong bits)

ECC的应用范围有多大

  • 科学精确的计算器
  • 现在大部分的大型服务器都带ECC
DELL HP,IBM以及亚马逊

大部分普通PC机都不带ECC

1 支持desktop而非server的大部分主板都不带ECC

2 支持ECC功能的主板必须disable ECC memory 给用户

3 大部分基于Intel的PC机芯片不支持ECC( those whose motherboards do are often supplied with memory modules that do not support ECC. )

4 It may be that most users opt for non-ECC systems and memory even when ECC is available.

为什么不让普通PC机带ECC,原因是

(此处很重要,因为HAMA中必须回答出这些问题来,是否有这个需求)(参考文献[6] [7]应重点查看)

  • 内存位的错误无法感知,并且束手无策(你如何去判断)
  • 八百年也碰不到一次(但感觉这不是最关键的,几率低不等于不需要)
  • 尼玛ECC太贵了,屌丝买不起((each bank is 9 memory chips compared to 8 for non-ECC memory, and more importantly there is more volume for non-ECC. In some cases the price ratio reduces to 9/8, as an example, on 2008/11/30, on Crucial.com, an ECC CL=5 unbuffered 2GB DDR2-667 DIMM cost $30 while the corresponding non-ECC part cost $28, a difference of 1/15, however some ECC modules cost twice as much as their non-ECC equivalents [Crucial CT12872Z40B and CT12864Z40B, Jan 2009]))
  • RAM的ECC也是奢侈品,主板价格就很昂贵
  • ECC做高可靠时,效率会衰减2-3%,见参考文献[8]( depending on application, due to the additional time needed for ECC memory controllers to perform error checking;)


所以其实我们HAMA的口号很简单:没有ECC就没有HAMA,我们的hama必须建立在能够快速识别内存错误的机制的基础上,在廉价机器上也能布置类似ECC之类的高可靠,我们用的是软件,就是让屌丝也能用得起DRAM的恢复

ECC的特点

  • ECC和奇偶校验的区别:奇偶校验在原来数据位的基础上增加1位数据位来进行,ECC是在数据位后专门增加校验位,就是说,ECC的位不属于数据位。8数据位要增加5个校验位,然后数据位每增加1倍,ECC增加1位
  • 奇偶校验是王道,但是实际上大部分pc机是90年代之后才配上了奇偶校验
  • An ECC-capable memory controller as used in many modern PCs can typically detect and correct errors of a single bit per 64-bit "word" (the unit of bus transfer),and detect (but not correct) errors of two bits per 64-bit word.
  • Linux系统支持连续不断地回写恢复memory
  • ECC有一套专门的预测机制,能够自动恢复内存
  • 有个前提,每一位的错误是相互独立的
  • EDAC-protected memory 需要调查,跟ECC类似
  • 有些ECC采用N-redudancy modules
  • ECC有专门错误纠正代码来进行处理
  • 多位的用海明码有些采用TMR

ECC的工作原理[9]

ECC有专门的一套算法,得出校验位的各部分的值,当从内存中读取数据时,该算法重新计算校验位中存放数据的校验和和写入数据的校验进行比较。如果校验和是相等的,那么可以继续,一旦不相等,那么ECC分离出bit的错误,然后告知系统,最后纠正错误

Since each ECC device can correct single-bit errors, Advanced ECC can correct a multi-bit error that occurs within one DRAM chip. As a result, Advanced ECC gives you protection from device failure

相关参考文献

[1] Bianca Schroeder,Eduardo Pinheiro,Wolf-Dietrich Weber.DRAM errors in the wild: a large-scale field study.SIGMETRICS '09 Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems.ACM New York, NY, USA ©2009.

[2] Borucki, "Comparison of Accelerated DRAM Soft Error Rates Measured at Component and System Level", 46th Annual International Reliability Physics Symposium, Phoenix, 2008, pp. 482–487.

[3] Gary M. Swift and Steven M. Guertin. "In-Flight Observations of Multiple-Bit Upset in DRAMs". Jet Propulsion Laboratory.

[4] Xin Li, Kai Shen, Michael C. Huan. A Memory Soft Error Measurement on Production Systems.USNIX2007.

[5] ECC wkipedia.http://en.wikipedia.org/wiki/ECC_memory

[6] "pcguide; The Market's Change from Parity to Non-Parity Memory". Pcguide.com. 2001-04-17. Retrieved 2011-11-23.

[7] pcguide: Parity vs. Non-Parity: Pros and Cons

[8] "Discussion of ECC on pcguide". Pcguide.com. 2001-04-17. Retrieved 2011-11-23.

[9] HP Co. Advanced ECC

Remus

DIMM的内存检测及纠正机制

  • The DIMM continuously refreshes the cells to preserve the data.
  • HP认为,DIMM的内存错误是由在DRAM中的电容单元的电压超过了正常的额定电压导致的
  • 造成上述结果的原因,Hp认为,在硬件上,是:DRAM defects, bad solder joints, and connector issues, cause hard errors so the device consistently returns incorrect results.For example, a memory cell may be stuck so that it always returns “0” bit, even when a “1” bit is written to it.
  • 软件的错误更常见,表现在电磁兼容上,距离太近的2个电容器往往会相互干扰

高级的ECC

对高级ECC来说,产生内存错误的原因

1 对内存容量的加大

2 内存的density加大,造成这种趋势的原因是软件的application的趋势很多都是内存密集型以及大容量应用

3 DIMM电压没变,chip体积在不断地减小

ECC的局限性

能检测和纠正1位的内存错误,但是无法检测和纠正2位以上的

运营商的目的

These trends help to drive manufacturers to build more memory capacity in industry-standard servers:

• Operating system support for increasing amounts of memory

• Availability of low-cost, high-capacity memory modules

• Server virtualization

Advanced ECC工作原理

To improve memory protection beyond standard ECC, HP introduced Advanced ECC technology in 1996. HP and most other server manufacturers use this solution in industry-standard products. Advanced ECC can correct a multi-bit error that occurs within a DRAM chip and avoid a complete DRAM chip failure. In Advanced ECC with 4-bit memory devices, each chip contributes four bits of data to the data word. The four bits from each chip are distributed across four ECC devices (one bit per ECC device), so that an error in one chip could produce up to four separate single-bit errors. Figure 7 shows how one ECC device receives four data bits from four DRAM chips.

Since each ECC device can correct single-bit errors, Advanced ECC can correct a multi-bit error that occurs within one DRAM chip. As a result, Advanced ECC gives you protection from device failure

System Failure恢复

高级的ECC并不能保证系统错误的恢复

Replacing a failed DIMM usually raises your operating costs when you take a server down for unscheduled maintenance, and the shutdown loses revenues for your business. Three available memory protection technologies, known as Memory Failure Recovery, give you failover and backup capability(好句子)


目前来说,HP有3种恢复机制

  • 网络通道冗余

设置网络冗余信道channel,让信道做冗余,一旦发现有普通物理内存发生错误,从冗余信道拷贝数据进行恢复

保护的类型:This prevents data corruption, a server crash, or both, and you can replace the defective DIMM at your convenience during a scheduled shutdown.

优点:不需要在服务器端设置硬件干预或者服务器中断

缺点:

1 无法做到完全保护

2 可支持单memory channel,但条件是:Can run on some systems with only one memory channel populated. But a single-channel memory configuration requires dual-rank DIMMs.

3 可以支持多信道但是条件是 the operating system must have system management and agent support for Advanced Memory Protection. Implementing Online Spare Memory mode over Advanced ECC requires extra DIMMs for the spare memory channel and reduces the system’s memory capacity.


  • Mirror Memory机制:

(可借鉴其语言) With Mirrored Memory mode, the memory subsystem writes identical data to two channels simultaneously. If a memory read from one of the channel returns incorrect data due to an uncorrectable memory error, the system automatically retrieves the data from the other channel. A transient or soft error in one channel does not affect mirroring, and operation continues unless there is a simultaneous error in exactly the same location on a DIMM and its mirrored DIMM. Mirrored Memory mode reduces the amount of memory available to the operating system by 50% since only one of the two populated channels provides data.

  • Lockstep Memory mode

Lockstep Memory mode uses two memory channels at a time and offers you an even higher level of protection. In lockstep mode, two channels operate as a single channel—each write and read operation moves a data word two channels wide. Both channels split the cache line to provide 2x 8-bit error detection and 8-bit error correction within a single DRAM. In three-channel memory systems, the third channel is unused and left unpopulated. The Lockstep Memory mode is the most reliable, but it reduces the total system memory capacity by one-third in most systems.

Google Research

  • google认为的内存错误的几率为:DRAM error rates that are orders of magnitude higher than previously reported, with 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year.
  • There is also a fear that advancing densities in DRAM technology might lead to increased memory errors, exacerbating this problem in the future
  • google认为,现阶段的研究都是在极端的情况下考虑如果内存出错了怎么办,比如高温,强磁场的环境之中。
  • DRAM error在real world是个什么样的情形呢

将memory绑定为其他的高可靠

高可靠本身都是在做实时备份的技术,那么很多技术都是基于整体的高可靠,只是将memory考虑成一个整体,比如说remus,但是remus的恢复,仍然没有做到真正内存内部的恢复

Mirrored Memory

failure detection system for a mirrored memory dual controller(google patent 1994)

  • Focus on disk storage controler. In high reliability disk stroage system, there has a desire to have redudancy in all the physic part.
  • Some disk use a delayed or massive update process to create duplication(default: degrade performance and complex to management)
  • Form a real time mirrored memory process(advantage: fast & accuracy)
  • default:multiple disk stroage controler solution problems more difficult
  • The motivation of mirrored memory in stroage controler are

How to effectively and reliably (1)detect controller failure early on in the context of mirrored memory processing so as to reduce potential problems that may occur from later discovery of failure; (2) detect controller failure without sigificant harware and /or software overhead requirements; and (3)detect controller failure to seprate the controllers and discontinue mirroring of their memories without loss of processing operations and capabilities.

  • with detection rather than retrievation.
Personal tools
Namespaces
Variants
Actions
Navigation
Upload file
Toolbox