Python 實現對大檔案的增量讀取

前段時間在做乙個演算法測試，需要對源於日誌的資料進行分析才能獲取到結果；日誌檔案較大，所以想要獲取資料的變化曲線，增量讀取是最好的方式。

網上有很多人的技術部落格都是寫的用for迴圈readline以及乙個計數器去增量讀取，假如檔案很大，遍歷一次太久。而且對於很多大檔案的增量讀取，如果遍歷每一行比對歷史記錄的輸出或者全都載入到記憶體通過歷史記錄的索引查詢，是非常浪費資源的，

獲取檔案控制代碼的基本理論中就包含指標操作。linux的檔案描述符的struct裡有乙個f_pos的這麼個屬性，裡面存著檔案當前讀取位置，通過這個東東經過vfs的一系列對映就會得到硬碟儲存的位置了，所以很直接，很快。

在python中的讀取檔案的方法也有類似的屬性。

函式作用

tell()

返回檔案當前位置

seek()

從指定位置開始讀取資訊

其中seek()有三種模式：

#!/usr/bin/python
fd=open("test.txt",'r') #獲得乙個控制代碼
for i in xrange(1,3): #讀取三行資料
fd.readline()
label=fd.tell() #記錄讀取到的位置
fd.close() #關閉檔案
#再次閱讀檔案
fd=open("test.txt",'r') #獲得乙個控制代碼
fd.seek(label,0)# 把檔案讀取指標移動到之前記錄的位置
fd.readline() #接著上次的位置繼續向下讀取

如何得知這個大檔案行數，以及變化

我的想法：

方式1：遍歷'\n'字元。

方式2：開始時就在for迴圈中對fd.readline()計數，變化的部分（用上文說的seek、tell函式做）再用for迴圈fd.readline()進行統計。

如何避免檔案讀取時，記憶體溢位

def read_in_chunks(file_path,  chunk=100 * 100):  # 通過chunk指定每次讀取檔案的大小防止記憶體占用過大
file_object = open(file_path, "r")
while true:
data = file_object.read(chunk)
if not data:
file_object.close()
break
# 使用generator（生成器）使資料只有在被使用時才會迭代時占用記憶體
yield data

20191129新增根據乙個朋友的實際問題寫的一段應用**，解決程式執行異常、斷點再讀問題：

#! /usr/bin/python
# coding:utf-8 
""" 
@author:bingo.he 
@file: 20191129-file.py 
@time: 2019/11/29 
"""import os
import glob
class opened(object):
def __init__(self, filename):
self.filename = filename
self.handle = open(filename)
if filename in get_read_info().keys():
self.handle.seek(get_read_info()[filename], 0)
def __enter__(self):
return self.handle
def __exit__(self, exc_type, exc_value, exc_trackback):
seek_num = self.handle.tell()
set_read_info(self.filename, seek_num)
self.handle.close()
if exc_trackback is none:
print(f"檔案【】讀取正常退出。")
else:
print(f"檔案【】讀取退出異常！")
def get_read_info():
"""讀取已讀取的檔案的控制代碼位置
:return:
"""file_info = {}
# 如果檔案不存在則建立乙個空檔案
if not os.path.exists("temp"):
with open("temp", 'w', encoding="utf-8") as f:
pass
return file_info
with open("temp", 'r', encoding="utf-8") as f:
datas = f.readlines()
for data in datas:
name, line = data.split("===")
file_info[name] = int(line)
return file_info
def set_read_info(filename, seek_num):
"""設定為已經讀取的檔案的控制代碼位置
:param filename: 檔名稱
:param seek_num: 控制代碼位置
:return:
"""flag = true
with open("temp", 'r', encoding="utf-8") as f:
datas = f.readlines()
for num, data in enumerate(datas):
if filename in data:
flag = false
datas[num] = f"===\n"
if flag:
# print(datas)
with open("temp", 'w', encoding="utf-8") as f:
f.writelines(datas)
# 測試**
# 注：檔案讀完之後，儲存在temp檔案中的，第二次讀取時不會再讀，可以以刪除temp檔案或者修改其中資訊
pys = glob.glob("*.py") # 獲取當前目錄以py結尾的檔案
for py in pys:
with opened(py) as fp: # 預設為讀模式
for line_data in fp:
print(line_data)

Python 實現對大檔案的增量讀取

python實現增量讀取檔案

rsync 增量傳輸大檔案優化 Linux系統

python 大檔案的讀取

Python 實現對大檔案的增量讀取

python實現增量讀取檔案

rsync 增量傳輸大檔案優化 Linux系統

python 大檔案的讀取

相關推薦