如何對資料量8億的表中資料去重複

出自：

背景：

某定時應用程式（每天02：00啟動）會讀取指定目錄下的所有txt扁平資料檔案，並將資料儲存到oracle資料庫。本來應用程式將txt檔案中資料入庫後，會將目錄中的txt資料檔案備份到另外的目錄中，但是應用程式有bug導致備份失敗。應用程式初始執行時需要對1億全量的資料入庫，接連執行8天，因應有程式有bug導致資料重複入庫8次，結果資料庫表中有7億重複資料。之所以把應用程式做成定時，是因為每天有100萬增量資料需要入庫。

表t_test結構：

col_id1

number(11)

col_id2

number(5)

col_3

varchar2(32)

col_4

number(10)

col_5

varchar2(256)

updatetime

timestamp

注：通過col_id1，col_id2欄位可以判斷記錄是否重複，updatetime為記錄更新時間

問題：

最後更新的記錄保留，將其它重複的記錄刪除

解決方案：

1.通過create table ... as select將不重複的記錄重建成表t_test_1

create table t_test_1 nologging tablespace &tablespace_name as

select col_id1, col_id2, col_3, col_4, col_5

from (select col_id1,

col_id2,

col_3,

col_4,

col_5,

updatetime,

row_number() over(partition by col_id1, col_id2 order by updatetime desc) rn

from t_test)

where rn = 1

2.對新錶重建索引，原表有多少索引，在新表上也重建多少索引

create index ind_t_test_1 on t_test_1(col_id1, col_id2)

nologging tablespace &ind_tablespace_name;

3.收集新錶統計資訊，確保select查詢採用正確高效率的執行計畫

declare

begin

dbms_stats.gather_table_stats(ownname

=> '&user',

tabname

=> 't_test_1',

estimate_percent => dbms_stats.auto_sample_size,

cascade

=> true,

method_opt

=> 'for all columns size 1',

granularity

=> 'all');

end;

/4.將新錶和新索引更改為日誌方式

alter table t_test_1 logging;

alter index ind_t_test_1 logging;

5.備份舊表，將新錶切換上線

alter table t_test rename to t_test_bak0902;

alter table t_test_1 rename to t_test;

不建議方案：

不建議直接在原表t_test上做delete操作

如何對資料量8億的表中資料去重複

Oracle 大資料量去重實驗

資料庫大資料量去重實現方式

大資料量的建表導資料

如何對資料量8億的表中資料去重複

Oracle 大資料量去重實驗

資料庫大資料量去重實現方式

大資料量的建表 導資料

相關推薦

大資料量的建表導資料