命名實體識別實踐（albert crf）

該專案是識別query中實體的專案，由於業務特點，query中實體較密集且連續。

該專案是albert在該項目的乙個測試方案，僅僅是想體驗一下albert流程，效果上還可以。

使用了bert4keras包，感謝作者。

但應該注意的有：

（1）albert的中文向量版本要看仔細，要和**需要的版本相匹配。

（2）在使用的時候，注意編碼時新增的字首字尾（[cls]，[sep]）,對應的tag是o。

實現方式如下：

import numpy as np
from bert4keras.backend import keras, set_gelu
from bert4keras.tokenizers import tokenizer
from bert4keras.models import build_transformer_model
from bert4keras.optimizers import adam, extend_with_piecewise_linear_lr
from bert4keras.snippets import sequence_padding, datagenerator
from bert4keras.snippets import open
from keras.layers import lambda, dense
from keras import layers
from keras_contrib.layers import crf
from keras.models import model, model_from_json
from keras.utils.np_utils import to_categorical
import random
import math
import os
import json
from keras.optimizers import adam, sgd
import numpy as np
import keras.backend.tensorflow_backend as ktf
import tensorflow as tf
os.environ["cuda_visible_devices"] = "0"
config = tf.configproto()
config.gpu_options.allow_growth=true #不全部佔滿視訊記憶體, 按需分配
sess = tf.session(config=config)
set_gelu('tanh') # 切換gelu版本
config_path = 'albert_large_google_zh/albert_config.json'
checkpoint_path = 'albert_large_google_zh/albert_model.ckpt'
dict_path = 'albert_large_google_zh/vocab.txt'
# 建立分詞器
tokenizer = tokenizer(dict_path, do_lower_case=true)
class ner_model_resume(object):
def __init__(self):
self.max_sentence_len = 15 + 2
self.class_num = 0 # 類別數量
self.word_num = 0 # 單詞個數
self.word2id = none
self.tag2id = none
self.id2tag = {}
self.batch_size = 128
self.conv_size = 256
self.model_load = none
def build_model(self):
model_bert = build_transformer_model(
config_path,
checkpoint_path,
model='albert',
)model_bert.trainable = false
output_layer = 'transformer-feedforward-norm'
output = model_bert.get_layer(output_layer).get_output_at(12 - 1)
# dense = layers.timedistributed(dense(len(self.word2id), activation="relu"), name="time_distributed")(output)
output = layers.bidirectional(layers.lstm(512, return_sequences=true))(output)
# dense = layers.timedistributed(dense(len(self.word2id), activation="softmax"), name="time_distributed")(output)
dense = layers.dense(len(self.word2id),activation="relu")(output)
crf = crf(self.class_num, sparse_target=false)
crf_res = crf(dense)
model = model(model_bert.input, crf_res)
adam = adam(lr=0.000005)
model.compile(optimizer=adam, loss=crf.loss_function, metrics=[crf.accuracy])
print(model.summary())
return model
def gene_batch_data(self, sent_list, tag_list, word2id, tag2id):
# 對id進行便把操作
sent_list_id = 
seg_list_id = 
tag_list_id = 
for sent in sent_list:
sent = ["[cls]"] + sent + ["[sep]"]
# sent = "".join(sent)
token_ids, segment_ids = tokenizer.encode(sent, first_length=self.max_sentence_len)
for tag in tag_list:
tmp_tag_list_id = 
tag = ["o"] + tag + ["o"]
for t in tag:
if len(tmp_tag_list_id) < self.max_sentence_len:
tmp_tag_list_id = tmp_tag_list_id + [tag2id["o"]] * (self.max_sentence_len - len(tmp_tag_list_id))
if len(tmp_tag_list_id) >= self.max_sentence_len:
tmp_tag_list_id = tmp_tag_list_id[0:self.max_sentence_len]
train_x = np.stack(sent_list_id, axis=0)
train_x1 = np.stack(seg_list_id, axis=0)
train_y = np.stack(tag_list_id, axis=0)
train_y = to_categorical(train_y, num_classes=self.class_num)
# #混淆和劃分
# cc = list(zip(train_x, train_y))
# random.shuffle(cc)
# train_x[:], train_y[:] = zip(*cc)
#return [train_x,train_x1], train_y

ai命名實體識別模型命名實體識別

crf中有兩類特徵函式，分別是狀態特徵和轉移特徵，狀態特徵用當前節點某個輸出位置可能的狀態中的某個狀態稱為乙個節點的狀態分數表示，轉移特徵用上乙個節點到當前節點的轉移分數表示。其損失函式定義如下 crf損失函式的計算，需要用到真實路徑分數包括狀態分數和轉移分數其他所有可能的路徑的分數包括狀...

命名實體識別

簡單的分詞器如二元分詞器無法識別oov，所以需要運用一些規定的規則來輔助識別如在識別音譯人名時，可以設定規則一旦發現某詞是人名，而該詞後面跟隨人名詞時，將他們合併針對不同情況，需要設計相應的標註集拿人名識別舉例輸入資料集進行訓練後，會將人名拆分為碎片，模擬人名的錯誤切分.接著，檢查拆...

命名實體識別實踐（詞典匹配）

任務場景實體識別任務中，如果有乙份可靠的詞典，並且詞典和普通的文字間差異比較大的時候，其實可以用磁帶你匹配的方式進行實體識別。本文中實現了一種詞典匹配的實體識別方式，採用的是正向最大匹配檢索樹樹尾標籤列表的方式實現的。也就是其支援單實體可以對應多標籤的情形。public static void...

命名實體識別實踐（albert crf）

ai命名實體識別模型 命名實體識別

命名實體識別

命名實體識別實踐（詞典匹配）

相關推薦

ai命名實體識別模型命名實體識別