李曉維 鄢貴海 韓銀和
摘 要:高通量計算系統由海量的計算節點、存儲節點通過網絡互連而成。由于規模巨大,系統的可靠性成為一個非常嚴重的問題,部件失效已經成為一種常態,系統設計必須考慮容錯的問題。我們需要建立新的高通量計算系統的可靠性保障框架,來適應高通量計算中不同層次的可靠性需求,研究從芯片級到系統級跨層次的可靠計算技術。圍繞該目標,該研究從高通量處理芯片的故障檢測和容錯設計方法,高通量計算系統的失效檢測和恢復方法和從芯片級到系統級的故障自預測、自檢測、自定位、自隔離和自愈合(5S)支撐環境3方面展開研究。截至2013年各項工作按照任務書原定計劃正在穩步推進,部分工作取得階段性成果。在(1)針對NBTI老化故障的在線預測技術;(2)深度學習等系統故障預測技術;(3)寄存器故障診斷;(4)片上網絡通信隔離技術等技術點上取得了突破,共發表錄用了IEEE Transactions論文6篇,其他期刊論文1篇。從研究點覆蓋來看,部署到研究點已經全部覆蓋了任務書規定的所有研究計劃,并對某些研究點進行了細化。
關鍵詞:可靠性設計 故障檢測 深度學習 在線預測 通信隔離
Abstract:High-throughput computing system incorporates massive computing nodes, storage nodes and their associate inner interconnection network. It is very common that components of such system will encounter malfunction due to its large scale, which makes reliability an imperative issue that needs to be considered seriously. In other words, computing system design must take fault tolerance into account. We intend to build unprecedented reliability framework specially for high-throughput computing system, in order to accommodate the desirable reliability demands of various layers in high-throughput computingdesign the corresponding reliable computing techniques across chip level and system level. To achieve this objective, this study commences the relevant research in three consecutive aspects: (1)fault detection/tolerance approaches in high-through computing, (2)malfunction detection/recovery methods in high-throughput computing system, (3)self-prediction, self-detection, self-isolation and self-healing across chip level and system level (5S supportive environments). Up to the year 2013, various work has been carried on in align with task specification steadily, and parts of the work have reached preset milestones. We have made breakthrough in some researches, such as (1) NBTI aging prediction, (2) fault prediction based on deep learning,(3)register fault diagnosis, and (4) on-chip communication isolation techniques, along with abundant high-rank research publications. In terms of research comprehensiveness, the deployment has covered all research plans defined in the proposal, and some research techniques are further refined as well.
Key Words:Reliability design;Fault detection;Deep learning;Online prediction;Communication isolation
閱讀全文鏈接(需實名注冊):http://www.nstrs.cn/xiangxiBG.aspx?id=50730&flag=1