pandas 處理大檔案過程

pandas 處理大檔案json，csv檔案過程

對json資料進行清洗（摘出自己需要的字段，business檔案較小）

def
getbusinessjson()
: file_name =
'yelp_academic_dataset_business.json'
file_name2 =
"business.json"
with
open
(file_name,
'r', encoding=
'utf-8'
)as f:
for item in jsonlines.reader(f)
: business_id = item[
"business_id"
] latitude = item[
"latitude"
] longitude = item[
"longitude"
] categories = item[
"categories"
] business =
with jsonlines.
open
(file_name2,
"a")
as f2:
f2.write(business)
f.close(
) f2.close(
)

把4g的json轉為csv（review檔案很大）

# 把json檔案轉為csv檔案
defjsontocsv()
:try
: jsonname =
'yelp_academic_dataset_review.json'
csvname =
"review.csv"
fc =
open
(csvname,
'a', encoding=
'utf-8'
) csv_writer = csv.writer(fc)
csv_writer.writerow(
["business_id"
,"text"])
with
open
(jsonname,
'r', encoding=
'utf-8'
)as f:
for item in jsonlines.reader(f)
: business_id = item[
"business_id"
] text = item[
"text"
] data =
[business_id, text]
csv_writer.writerow(data)
print
(i) i = i +
1finally
: f.close(
) fc.close(
)

把review檔案根據busines_id進行和並text中的文字

包括（使用groupby 和並字串， series轉換為dataframe## 標題）

import pandas
reader = pandas.read_csv(r"review.csv"
, iterator=
true
)out_csv =
"review1.csv"
status =
true
i =1
while status:
try:
review = reader.get_chunk(
100000
)# review.columns = ['business_id', 'text']
# data = review[review['business_id'] == "ucpuotvqr-nbwbnvmzjlea"]['text']
#分組拼接字串
data = review.groupby(
'business_id')[
'text'].
(lambda x: x.
str.cat(sep=
". "))
#把series轉換為dataframe
dict_review =
df_review = pandas.dataframe(dict_review)
print
(i) i = i +
1 df_review.to_csv(out_csv, mode=
'a')
except stopiteration:
status =
false

Pandas處理較大檔案讀檔案

1 讀取檔案中前部分通過nrows引數，來設定讀取檔案的前多少行，nrows是乙個大於等於0的整數。data pd.read csv data.csv nrows 5 print data a b c 0 0 1 2 1 3 4 5 2 6 7 8 3 9 10 11 4 12 13 14 2 逐...

c 處理大檔案

最近寫乙個小工程，要讀寫大檔案，10 20g。開始經過一番考慮，考慮到c函式的高效與操作難度以前用c的函式，總忘關檔案直接使用了c 的ifstream，用類操作比較方便。由於隨機讀取檔案位置，所以需要找辦法能讀取到任一位置，這在處理小檔案時沒有問題，但在處理大於4g檔案時出現了問題。以前在lin...

Python 大檔案處理

非記憶體資源可以使用with 在python中逐行讀取大檔案在我們日常工作中，難免會有處理日誌檔案的時候，當檔案小的時候，基本不用當心什麼，直接用file.read 或readlines 就可以了，但是如果是將乙個10g大小的日誌檔案讀取，即檔案大於記憶體的大小，這麼處理就有問題了，會將整個檔案載...

pandas 處理大檔案過程

Pandas處理較大檔案 讀檔案

c 處理大檔案

Python 大檔案處理

相關推薦

Pandas處理較大檔案讀檔案