kaggle競賽入門整理

1、bike sharing demand

kaggle:

目的：根據日期、時間、天氣、溫度等特徵，**自行車的租借量

處理：1、將日期（含年月日時分秒）提取出年，月，星期幾，以及小時

2、season, weather都是類別標記的，利用啞變數編碼

演算法模型選取：

回歸問題：1、randomforestregressor

2、gradientboostingregressor

# -*- coding: utf-8 -*-import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
train = pd.read_csv('
data/train.csv')
test = pd.read_csv('
data/test.csv')
# 選取特徵值
selected_features = ['
datetime
', '
season
', '
holiday',
'workingday
', '
weather
', '
temp
', '
atemp
', '
humidity
', '
windspeed']
#x_train =train[selected_features]
y_train = train["
count"]
result = test["
datetime"]
# 特徵值處理
month =pd.datetimeindex(train.datetime).month
day =pd.datetimeindex(train.datetime).dayofweek
hour =pd.datetimeindex(train.datetime).hour
season =pd.get_dummies(train.season)
weather =pd.get_dummies(train.weather)
x_train = pd.concat([season, weather], axis=1
)x_test = pd.concat([pd.get_dummies(test.season), pd.get_dummies(test.weather)], axis=1
)x_train[
'month
'] =month
x_test[
'month
'] =pd.datetimeindex(test.datetime).month
x_train[
'day
'] =day
x_test[
'day
'] =pd.datetimeindex(test.datetime).dayofweek
x_train[
'hour
'] =hour
x_test[
'hour
'] =pd.datetimeindex(test.datetime).hour
x_train[
'holiday
'] = train['
holiday']
x_test[
'holiday
'] = test['
holiday']
x_train[
'workingday
'] = train['
workingday']
x_test[
'workingday
'] = test['
workingday']
x_train[
'temp
'] = train['
temp']
x_test[
'temp
'] = test['
temp']
x_train[
'humidity
'] = train['
humidity']
x_test[
'humidity
'] = test['
humidity']
x_train[
'windspeed
'] = train['
windspeed']
x_test[
'windspeed
'] = test['
windspeed']
from sklearn.ensemble import *clf = gradientboostingregressor(n_estimators=200, max_depth=3
)clf.fit(x_train, y_train)
result =clf.predict(x_test)
result =np.expm1(result)
df=pd.dataframe()
df.to_csv('
results1.csv
', index = false, columns=['
datetime
','count'])
from sklearn.ensemble import randomforestregressor
gbr =randomforestregressor()
gbr.fit(x_train, y_train)
y_predict = gbr.predict(x_test).astype(int
)df = pd.dataframe()
df.to_csv('
result2.csv
', index=false, columns=['
datetime
', '
count'])
#predictions_file = open("
randomforestregssor.csv
", "wb"
)#open_file_object =csv.writer(predictions_file)
#open_file_object.writerow([
"datetime
", "
count"])
#open_file_object.writerows(
zip(res_time, y_predict))

view code

2、daily news for stock market prediction

通過歷史資料：包含每日點選率最高的25條新聞，與當日**漲跌，來**未來**漲跌

方法一：

1、將25條新聞合併成一篇新聞，然後對每個單詞做預處理（去掉特殊字元，含數字的單詞，刪除停詞，變成小寫，取詞幹），然後用tf-idf提取特徵，用svm訓練

2、用word2vec提取特徵

具體實現：

3、

Kaggle競賽記錄

比賽 planet understanding the amazon from space這個比賽是乙個遙感影象識別，但是主辦方也提供了jpg，由於對遙感影象識別不熟悉，而且遙感影象資料太大不好處理，所以本次比賽使用的是jpg資料。這個比賽是乙個多標籤的分類問題，一共有17個類別，每張可以有乙個或者...

kaggle三個入門競賽教程

1.titanic 泰坦尼克之災中文教程邏輯回歸應用之kaggle泰坦尼克之災英文教程 an interactive data science tutorial 2.house prices advanced regression techniques 房價中文教程 kaggle競賽 201...

演算法競賽入門筆記整理

判斷是否為素數 int is prime int n 字串格式轉換sprintf函式 sprintf 儲存的字串,輸出格式控制符要儲存的對應格式資料 c 需要指定標頭檔案的輸入輸出流和命名空間後，才能使用cin等函式 includeusing namespace std 宣告靜態常量可以用 con...

kaggle競賽入門整理

Kaggle競賽記錄

kaggle三個入門競賽教程

演算法競賽入門筆記整理

相關推薦