布隆過濾器在網頁去重中的應用

//關於布隆過濾器在url去重中的應用
//問題背景
/****/ 
//本例中url初始數量為20萬條 ，如果有其它規模的資料，可以將具體引數進行相應更改
//測試url數量為186083條
//請使用標準c＋＋進行編譯
//更多hash函式請登入 淚下的天空
//原**高亮顯示
#include #include #include #include #include using namespace std;
#define func_num 8
#define bit_max 3999949 //這是乙個素數，why?
const int hash_size = bit_max / 8 + 1;
char hash[hash_size];
int strint[func_num];
//以下標<<[1-8]>>數字的是字串雜湊函式，本程式中我使用了８個雜湊函式
//<<1>>
unsigned int rshash(const std::string& str)
return hash;
}//<<2>>
unsigned int jshash(const std::string& str)
return hash;
}//<<3>>
unsigned int pjwhash(const std::string& str)
}return hash;
}//<<4>>
unsigned int aphash(const std::string& str)
return hash;
}//<<5>>
unsigned int bkdrhash(const std::string& str)
return hash;
}//<<6>>
unsigned int sdbmhash(const std::string& str)
return hash;
}//<<7>>
unsigned int fnvhash(const std::string& str)
return hash;
}//<<8>>
unsigned int hflp(string str)
}//查詢url是否存在於url.dat檔案中
bool find(string url)
return res;
}int main(int argc, char* argv)
time_t con_end = time(null);
url_in.close();
//讀取檔案中測試資料
ifstream test_in("test_url.dat");
assert(test_in);
int count(0) , size(0);
time_t test_start = time(null);
while(getline(test_in , url))
time_t test_end = time(null);
cout<<"測試url數量："<"
}

bitmap去重與布隆過濾器

通過乙個位元位來存乙個位址，占用記憶體很小 bloomfilter 會開闢乙個m位的bitarray 位陣列開始所有資料全部置 0 當乙個元素過來時，能過多個雜湊函式 h1,h2,h3.計算不同的在雜湊值，並通過雜湊值找到對應的bitarray下標處，將裡面的值 0 置為 1 python中使用布...

URL去重布隆過濾器的簡單實現

如何不採集重複的網頁？去重可以使用布隆過濾器，每個執行緒使用乙個bitarray，裡面儲存本批源頁面上次抓取的頁面的雜湊值情況，抓取下來的源頁面分析鏈結後，去這個bitarray裡判斷以前有沒有抓過這個頁面，沒有的話就抓下來，抓過的話就不管了。假設乙個源頁面有30個鏈結，一批10w個源頁面，300w...

python實現布隆過濾器對資料去重

一直觀的資料去重方式通常我們採用如下演算法對一組長度為n的資料d進行去重時。s1.在資料中取出第x個資料 1 xs2.在資料中取出第y個資料 xs3.比較d x 和d y 若相同丟棄d y 重複s2，s3直到y n s4.重複s1,s2,s3直到x n 1 此演算法時間複雜度近似t n o 1 ...

布隆過濾器在網頁去重中的應用

bitmap去重與布隆過濾器

URL去重 布隆過濾器的簡單實現

python實現布隆過濾器對資料去重

相關推薦

URL去重布隆過濾器的簡單實現