kaggle競賽入門整理

2022-08-11 12:00:23 字數 3561 閱讀 3160

1、bike sharing demand

kaggle: 

目的:根據日期、時間、天氣、溫度等特徵,**自行車的租借量

處理:1、將日期(含年月日時分秒)提取出年,月, 星期幾,以及小時

2、season, weather都是類別標記的,利用啞變數編碼

演算法模型選取:

回歸問題:1、randomforestregressor

2、gradientboostingregressor

# -*- coding: utf-8 -*-import csv

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

train = pd.read_csv('

data/train.csv')

test = pd.read_csv('

data/test.csv')

# 選取特徵值

selected_features = ['

datetime

', '

season

', '

holiday',

'workingday

', '

weather

', '

temp

', '

atemp

', '

humidity

', '

windspeed']

#x_train =train[selected_features]

y_train = train["

count"]

result = test["

datetime"]

# 特徵值處理

month =pd.datetimeindex(train.datetime).month

day =pd.datetimeindex(train.datetime).dayofweek

hour =pd.datetimeindex(train.datetime).hour

season =pd.get_dummies(train.season)

weather =pd.get_dummies(train.weather)

x_train = pd.concat([season, weather], axis=1

)x_test = pd.concat([pd.get_dummies(test.season), pd.get_dummies(test.weather)], axis=1

)x_train[

'month

'] =month

x_test[

'month

'] =pd.datetimeindex(test.datetime).month

x_train[

'day

'] =day

x_test[

'day

'] =pd.datetimeindex(test.datetime).dayofweek

x_train[

'hour

'] =hour

x_test[

'hour

'] =pd.datetimeindex(test.datetime).hour

x_train[

'holiday

'] = train['

holiday']

x_test[

'holiday

'] = test['

holiday']

x_train[

'workingday

'] = train['

workingday']

x_test[

'workingday

'] = test['

workingday']

x_train[

'temp

'] = train['

temp']

x_test[

'temp

'] = test['

temp']

x_train[

'humidity

'] = train['

humidity']

x_test[

'humidity

'] = test['

humidity']

x_train[

'windspeed

'] = train['

windspeed']

x_test[

'windspeed

'] = test['

windspeed']

from sklearn.ensemble import *clf = gradientboostingregressor(n_estimators=200, max_depth=3

)clf.fit(x_train, y_train)

result =clf.predict(x_test)

result =np.expm1(result)

df=pd.dataframe()

df.to_csv('

results1.csv

', index = false, columns=['

datetime

','count'])

from sklearn.ensemble import randomforestregressor

gbr =randomforestregressor()

gbr.fit(x_train, y_train)

y_predict = gbr.predict(x_test).astype(int

)df = pd.dataframe()

df.to_csv('

result2.csv

', index=false, columns=['

datetime

', '

count'])

#predictions_file = open("

randomforestregssor.csv

", "wb"

)#open_file_object =csv.writer(predictions_file)

#open_file_object.writerow([

"datetime

", "

count"])

#open_file_object.writerows(

zip(res_time, y_predict))

view code

2、daily news for stock market prediction

通過歷史資料:包含每日點選率最高的25條新聞,與當日**漲跌,來**未來**漲跌

方法一:

1、將25條新聞合併成一篇新聞,然後對每個單詞做預處理(去掉特殊字元,含數字的單詞,刪除停詞,變成小寫,取詞幹),然後用tf-idf提取特徵,用svm訓練

2、用word2vec提取特徵

具體實現:

3、

Kaggle競賽記錄

比賽 planet understanding the amazon from space這個比賽是乙個遙感影象識別,但是主辦方也提供了jpg,由於對遙感影象識別不熟悉,而且遙感影象資料太大不好處理,所以本次比賽使用的是jpg資料。這個比賽是乙個多標籤的分類問題,一共有17個類別,每張可以有乙個或者...

kaggle三個入門競賽教程

1.titanic 泰坦尼克之災 中文教程 邏輯回歸應用之kaggle泰坦尼克之災 英文教程 an interactive data science tutorial 2.house prices advanced regression techniques 房價 中文教程 kaggle競賽 201...

演算法競賽入門筆記整理

判斷是否為素數 int is prime int n 字串格式轉換sprintf函式 sprintf 儲存的字串,輸出格式控制符 要儲存的對應格式資料 c 需要指定標頭檔案的輸入輸出流和命名空間後,才能使用cin等函式 includeusing namespace std 宣告靜態常量可以用 con...