Perl 的 html 解析模組

html::treebuilder

這個解析模組使用了強大的 html::element 模組。解析時，html::treebuilder模組把整個 html 文件轉換成了 perl 的資料結構，可以進行任意的操作。

使用時先建立乙個 html::treebuilder 物件。

use data::dumper qw(dumper);

$data::dumper::indent = 1;

use html::treebuilder;

my $tree = new html::treebuilder;

直接向 html::treebuilder 傳遞檔案好像會把中文轉換成 unicode 的字元，所以一般傳遞檔案控制代碼。而且傳遞的字串或者控制代碼一般要先確保是 utf8 字串才行，不然會有乙個warning：

parsing of undecoded utf-8 will give garbage when decoding entities at /home/ywb/temp/t.pl line 16, line 5.

解析檔案和解析字串的區別僅僅是前者是使用 parse_file 函式，而後者使用parse 函式。下面以解析檔案控制代碼的例子：

binmode data, "utf8";

$tree->parse_file(\*data);

print dumper($tree), "\n";

__data__xy

12要提取出**中的內容可以這樣：

foreach my $row ( $tree->find_by_tag_name("tr") )

print "\n";

}由於 html::element 會強制把所有的 tag 都轉換成小寫，所以不用擔心 tag的大小寫問題。

html::tokenparser

html::tokenparser和 html::parser 等模組不同，html::tokenparser 模組是類似於流(stream oftokens)的方式來解析 html 檔案。在解析的過程中 html 中的文字轉換成這六種token：

["s",

$tag, $attr, $attrseq, $text]

["e",

$tag, $text]

["t",

$text, $is_data]

["c",

$text]

["d",

$text]

["pi", $token0, $text]

這個例子應該能夠體現這個模組解析的一些特點：

use html::tokeparser;

my $file = \*data;

my $parser = html::tokeparser->new($file)

or die "can't open $file: $!\n";

my (@table, @row, $inrow);

while (my $token = $parser->get_token( ))

elsif ( $type eq 's' )

} elsif ( $type eq 'e' ) }}

print dumper(\@table), "\n";

__data__xy

12與前面 html::treebuilder 的例子相比可能有些麻煩，但是很多情況下，只需要一次處理乙個 token，這時候用這個模組就非常方便了，比如你要得到乙個html 裡所有的或者所有的鏈結，像這樣寫就行了：

my @images;

while (my $token = $parser->get_token( )) }}

__data__

html::linkextor

專門的模組html::linkextor如果要提取 html 檔案中的鏈結，也不用自己寫了，用 html::linkextor 就好了。乙個簡單的例子：

require html::linkextor;

my $p = html::linkextor->new();

$p->parse_file(\*data);

print dumper($p->links), "\n";

__data__

xhtml::linkextor 的 new 函式可以提供乙個 callback 函式，這個函式是當發現鏈結時就呼叫這個函式。傳遞給這個函式的第乙個引數是鏈結的型別，比如 'a','img'，其餘的引數是鏈結的屬性。如果提供了 callback 函式，html::linkextor 就不再累積鏈結了，這意味著你不能再用 links 函式來得到所有的鏈結。要得到所有鏈結，只有在 callback 函式裡自己儲存好。

require html::linkextor;

my $p = html::linkextor->new(\&cb);

$p->parse_file(\*data);

sub cb

__data__

xhtml::headparser

如果只是要得到 html 的標題或者其它在 head 標籤之間內容，就不要用html::treebuilder 這樣的重量級模組了，html::headparser 模組就能完成這個任務，並且使用相當簡單。

require html::headparser;

my $p = html::headparser->new;

my $text = join('', );

$p->parse($text) and

print "not finished";

# to access ....

print "title: ", $p->header('title'), "\n";

# to access

print "base:

", $p->header('content-base'), "\n";

# to access

print "content type:", $p->header('content-type'), "\n";

# to access

print "author:", $p->header('x-meta-author'), "\n";

print dumper($p->header), "\n";

__data__

可以看出它不能解析出 script、link、style 這樣的標籤。

html::tableextract

html::tableextract 只能從 html 中提取出 table 裡的內容。如果只要這個，那麼這個模組是很容易使用的。

use html::tableextract;

use data::dumper qw(dumper);

my $html_string = join("", );

$te = html::tableextract->new();

$te->parse($html_string);

print dumper($te), "\n";

foreach $ts ($te->tables)

}__data__xy

12由這個例子可以看出，解析後鏈結資訊都丟失了。但是**裡的內容是很容易得到的。

Perl 的 html 解析模組

Perl模組的安裝

perl安裝模組

perl解析語法

Perl 的 html 解析模組

Perl模組的安裝

perl安裝模組

perl解析語法

相關推薦