Sphinx實時索引

資料庫中的資料很大，然後我有些新的資料後來加入到資料庫中，也希望能夠檢索到，全部重新建立索引很消耗資源，這樣需要用到「主索引+增量索引」的思路來解決，這個模式實現的基本原理是設定兩個資料來源和兩個索引。

1、建立乙個計數器

乙個簡單的實現是，在資料庫中增加乙個計數表，記錄將文件集分為兩個部分的文件 id,每次重新構建主索引時，更新這個表

先在 mysql 中插入乙個計數表

create table sph_counter( counter_id integer primary key not null, max_doc_id integer not null);

2、再次修改配置檔案

主資料來源，繼承資料來源，主索引，繼承索引。（繼承索引也就是增量索引）。

主資料來源裡面:我們需要把欲查詢語句改成下面的語句:

vi /usr/local/coreseek/etc/csft.conf

source main{

把sql_query_pre的改成下面的語句

sql_query_pre = replace into sph_counter select 1, ifnull(max(id),0) from post

sql_query= \

select id,title, content from post \

where id<=(select max_doc_id from sph_counter where counter_id=1)

繼承資料來源:

source delta : main

sql_query_pre = set names utf8

sql_query

select id,title, content from post \

where id>(select max_doc_id from sph_counter where counter_id=1)

主索引:

把名字該成想對應的

index main {

source = main

path = /usr/local/coreseek/var/data/main

繼承索引(也是增量索引)

index delta:main

source= delta

path= /usr/local/coreseek/var/data/delta

剩下的基本不用改變

注意:如果你增量索引的 source 配置中只有 id,content 三項

而主索引的 source 配置中有 id, title,content 四項，合併的時候會報屬性數量不匹配，如:

delta:sql_query = select id, title,content from post

main:sql_query=select id,title,date,content from post

3、測試增量索引+主索引

如果想測試增量索引是否成功，往資料庫表中插入資料，查詢是否能夠檢索到，這個時候檢索應該為空，然後，單獨重建增量索引

/usr/local/coreseek/bin/indexer -c /usr/local/coreseek/etc/csft.conf delta

檢視是否將新的記錄進行了索引，如果成功

此時，再用/usr/local/coreseek/bin/search 工具來檢索，能夠看到，在主索引中檢索到的結果為 0，而在增量中檢索到結果。當然，前提條件是，檢索的詞，只在後來插入的資料中存在

4、實時更新索引

我們需要建立兩個指令碼，還要用到計畫任務

建立乙個主索引和增量索引的指令碼

main.sh delta.sh

在增量索引中寫下delta.sh

#!/bin/bash

#delta.sh

/usr/local/coreseek/bin/indexer delta –rotate >> /usr/local/coreseek/var/log/delta.log

主索引中寫下:main.sh意思就是合併索引

#!/bin/bash

#main.sh

/usr/local/coreseek/bin/indexer main –rotate >> /usr/local/coreseek/var/log/merge.log

最後，我們需要指令碼能夠自動執行，以實現增量索引每5分鐘重新建立，和主索引只在凌晨2:30時重新建立.

指令碼寫好了，我們需要建立計畫任務

crontab -e

*/10 * * * * /usr/local/coreseek/etc/delta.sh

30 2 * * * /usr/local/coreseek/etc/main.sh

第一條是表示每5分鐘執行

第二條是表示每天的凌晨2:30分執行

指令碼許可權:

chmod a+x delta.sh

chmod a+x main.sh

要驗證的話，我們可以檢視日誌檔案

分布式索引

分布式是為了改善查詢延遲問題和提高多伺服器、多 cpu 或多核環境下的吞吐率，對於大量資料（即十億級的記錄數和 tb 級的文字量）上的搜尋應用來說是很關鍵的

分布式思想:對資料進行水平分割槽（hp，horizontally partition），然後並行處理，

當searchd收到乙個對分布式索引的查詢時，它做如下操作

1. 連線到遠端**.

2. 執行查詢.

3. 對本地索引進行查詢.

4. 接收來自遠端**的搜尋結果.

5. 將所有結果合併，刪除重複項.

6. 將合併後的結果返回給客戶端.

index dist

type = distributed

local = chunk1

agent = localhost:9312:chunk2 本地

agent = 192.168.100.2:9312:chunk3 遠端

agent = 192.168.100.3:9312:chunk4 遠端

chunck為索引名稱

Sphinx實時索引

1，首先建立乙個表用來儲存主索引和增量索引的id值 create table if not exists sph counter s id int 10 unsigned not null auto increment,p id int 11 not null,primary key s id 2，...

Sphinx 實時索引

index rtsearchd 實時索引不需要indexer，直接開啟searchd。usr local sphinx bin searchd c usr local sphinx etc csft rt.conf sphinx的實時索引配置本身並不需要資料來源 source 它的資料是要通過程式利...

Sphinx實時索引

Sphinx實時索引

Sphinx 實時索引

Sphinx 實時索引

相關推薦