CSVReader讀取資料缺失

最近在專案中遇到乙個匯入csv檔案的程式資料缺失嚴重.2.4g的報表600多萬行,匯入資料庫實際只有200多萬行,最後終於找到了問題的所在,並解決了.記錄mark一下

前面的一些曲折過程,懷疑多執行緒來不及處理直接丟棄,就不講了.

程式中用了while ((data = csvreader.readnext()) != null)迴圈進行讀取.

檢視readnext原始碼,也是通過bufferedreader的readline進行一行一行的讀取,只是在字串引用和轉義進行了處理.csv程式預設使用default_separator = 『,』逗號作為一列與一列的分割符,default_quote_character = 『」』雙引號作為字串引用,就是當一列的內容中出現特殊字元如逗號時,怎麼區分這個逗號的是列裡面的內容還是列之間的分割,例如,乙個檔案裡面某列內容為a,b,c為了區分這個a b c之間的逗號為本來的內容,所以用」a,b,c」這樣表示,default_escape_character = 『\』;反斜槓作為轉義.

在字串引用的處理,發現某列資料以雙引號開頭,但是在這一行沒有發現與之對應的雙引號,即是說這一行的雙引號為奇數個,會讀取下一行進行處理,直到找到與之匹配的雙引號.例如,我們的報表在151行在geometry dash後面出現了特殊字元換行符,在xstep後面也出現了換行符

用vim開啟,這一行變成了三行,程式會把這三行當成一行處理,這本身沒有什麼問題.

但是程式中使用反斜槓作為轉義,但是csv檔案中使用雙引號作為轉義,這樣就會造成\」這樣的雙引號不做特殊處理,導致雙引號不匹配,程式繼續讀取下一行,造成資料丟失並且資料混亂.

由於csvreader預設為反斜槓,又不能將轉義設定為雙引號,這樣會和字串引用的雙引號重複,程式處理會混亂,並且程式會丟擲異常the separator, quote, and escape characters must be different!,最後重寫乙個不帶轉義的csvreader構造器,重新打個jar包,最後能夠讀取資料6340034行,解決

附readnext關鍵**:

public string readnext() throws ioexception

string r = parser.parselinemulti(nextline);

if (r.length > 0) else

}} while (parser.ispending());

return result;

}private string parseline(string nextline, boolean multi) throws ioexception

if (nextline == null) ;

} else

}listtokensonthisline = new arraylist();

stringbuilder sb = new stringbuilder(initial_read_size);

boolean inquotes = false;

if (pending != null)

for (int i = 0; i < nextline.length(); i++)

} else if (c == quotechar) else else }}

inquotes = !inquotes;

}infield = !infield;

} else if (c == separator && !inquotes) else }}

// line is done - check status

if (inquotes) else

}if (sb != null)

return tokensonthisline.toarray(new string[tokensonthisline.size()]);

}

CSVReader讀取資料缺失

Pandas 缺失資料

Pandas缺失資料

pandas 缺失資料

CSVReader讀取資料缺失

Pandas 缺失資料

Pandas缺失資料

pandas 缺失資料

相關推薦