







摘要: 針對關系型數據中的不一致錯誤,現有子集修復方法通常以最小刪除元組數量為優化目標求解最優修復方案,以減少對原始數據的更改。但當數據中的錯誤較多時,該方法的準確率將降低。提出了一種最大概率子集修復方法,利用屬性之間的關聯關系及概率統計信息對元組的正確性概率進行建模,將最小刪除元組的正確性概率之和作為優化目標進行最優子集修復,并給出了高效的最大概率子集修復近似算法。真實數據集和合成數據集上的實驗結果表明,最大概率子集修復方法的準確率優于當前最好方法。
關鍵詞: 不一致數據;最大概率;子集修復;數據清洗;機器學習
中圖分類號: TP301 文獻標志碼: A
doi:10.3969/j.issn.2095-1248.2023.01.007
Maximum probability subset repair algorithm for inconsistent data
XIA Xiu-feng,SI Jia-yu,ZHANG An-zhen
(College of Computer Science,Shenyang Aerospace University,Shenyang 110136,China)
Abstract: For inconsistency errors in relational data,existing subset repair methods usually take the minimum number of deleted tuples as the optimization goal to find the optimal repairing scheme to reduce the changes to the original data.However,when there are more errors in the data,the accuracy of the method will be greatly reduced.To this end,a maximum probability subset repair method was proposed,which used the relationship between attributes and probability and statistical information to model the correctness probability of tuples.The sum of the correctness probability of the minimum deleted tuple was taken as the optimization goal to solve the optimal subset repair,and an efficient maximum probability subset repair approximation algorithm was given.Experimental results on real datasets and synthetic datasets show that the maximum probability subset repair method outperforms the current state-of-the-art method in accuracy.
Key words: inconsistent data;maximum probability;subset repair;data cleaning;machine learning
近年來,信息獲取技術、信息物理系統、物聯網、社交網絡、元宇宙等技術的飛速發展引發了數據規模的爆炸式增長,各行各業積累的TB級、PB級乃至EB級的數據已成為信息社會的重要財富。然而,在數據積累的過程中常常伴隨各種類型的數據錯誤,極大降低了數據的可用性,權威機構的調查結果顯示,數據錯誤在生產、生活的各方面都造成了嚴重影響[1-2]。例如,在工業方面,數據錯誤造成美國工業企業每年約6 110億美元的損失;在醫療方面,由于數據錯誤而引發的醫療事故致死人數約占美國全部醫療事故致死人數的50%;在金融方面,數據不一致而導致的信用卡欺詐造成了美國銀行業一年內損失48億美元[3]。由此可見,確保數據質量、提高數據可用性是有效發掘和利用數據價值的前提[4-5]。……