







摘 要:大數(shù)據(jù)集群環(huán)境中,隨機(jī)訪問(wèn)的低效性使得基于行級(jí)別抽樣的近似查詢處理方法在構(gòu)建樣本時(shí)效率低下。該文將利用集群環(huán)境中數(shù)據(jù)分塊存儲(chǔ)的特性,以分塊級(jí)別來(lái)進(jìn)行抽樣。在基準(zhǔn)測(cè)試數(shù)據(jù)集和真實(shí)數(shù)據(jù)集上的實(shí)驗(yàn),顯示此方法在降低數(shù)據(jù)讀取率,提高查詢響應(yīng)速度的同時(shí),保持較高的查詢精度。實(shí)驗(yàn)中,僅需要讀取少于20%的數(shù)據(jù)就可以獲得低于5%的查詢誤差,且為數(shù)據(jù)集每個(gè)分塊的預(yù)計(jì)算的特征數(shù)據(jù)所需要的存儲(chǔ)空間小于數(shù)據(jù)集所占空間的0.04%。
關(guān)鍵詞:近似查詢處理;聚類;分塊抽樣;數(shù)據(jù)跳過(guò);特征計(jì)算
中圖分類號(hào):TP274 文獻(xiàn)標(biāo)志碼:A 文章編號(hào):2095-2945(2024)24-0019-05
Abstract: In big data cluster environment, the inefficiency of random access makes the approximate query processing method based on row-level sampling inefficient in constructing samples. This paper will make use of the characteristics of data block storage in the cluster environment to sample at the block level. Experiments on benchmark data sets and real data sets show that this method not only reduces the data reading rate and improves the query response speed, but also maintains high query accuracy. In the experiment, only less than 20% of the data need to be read to obtain a query error of less than 5%, and the storage space required for the precalculated feature data for each block of the dataset is less than 0.04% of the space occupied by the dataset.
Keywords: approximate query processing; clustering; block sampling; data skip; feature calculation
隨著近幾十年來(lái)數(shù)據(jù)存儲(chǔ)數(shù)量的指數(shù)級(jí)增長(zhǎng),單機(jī)數(shù)據(jù)庫(kù)逐漸不能滿足人們對(duì)于數(shù)據(jù)的存儲(chǔ)和查詢的需求,越來(lái)越多的人選擇將數(shù)據(jù)存儲(chǔ)到分布式的大數(shù)據(jù)集群中。但即便是配合一些大規(guī)模數(shù)據(jù)分析引擎,要處理數(shù)TB量級(jí)的數(shù)據(jù),完整計(jì)算得到準(zhǔn)確結(jié)果的時(shí)間消耗也常是無(wú)法接受的。通過(guò)使用近似查詢處理方法[1],可以犧牲查詢結(jié)果的一部分準(zhǔn)確性,來(lái)獲得更快的查詢響應(yīng)。
在近似查詢處理方法中,抽樣是最常見(jiàn)的一種策略,它使用數(shù)據(jù)集中的一部分?jǐn)?shù)據(jù)作為樣本來(lái)回答查詢。要為存儲(chǔ)在大數(shù)據(jù)集群上的數(shù)據(jù)集構(gòu)建行級(jí)別抽樣的樣本,在讀取數(shù)據(jù)上的消耗很高,與掃描整個(gè)數(shù)據(jù)集無(wú)異。在HDFS文件系統(tǒng)中,數(shù)據(jù)被分塊存儲(chǔ),行級(jí)別的隨機(jī)訪問(wèn)十分低效。如果考慮到構(gòu)建樣本的時(shí)間消耗,很多場(chǎng)景下,使用行級(jí)別抽樣的近似查詢并不能帶來(lái)速度上的提升。……