[1]刘韶涛,李洪胜.融合链接结构的主题爬虫算法[J].华侨大学学报(自然科学版),2017,38(2):195-200.[doi:10.11830/ISSN.1000-5013.201702012]
 LIU Shaotao,LI Hongsheng.Topic Crawler Algorithm With Link Structure[J].Journal of Huaqiao University(Natural Science),2017,38(2):195-200.[doi:10.11830/ISSN.1000-5013.201702012]
点击复制

融合链接结构的主题爬虫算法()
分享到:

《华侨大学学报(自然科学版)》[ISSN:1000-5013/CN:35-1079/N]

卷:
第38卷
期数:
2017年第2期
页码:
195-200
栏目:
出版日期:
2017-03-20

文章信息/Info

Title:
Topic Crawler Algorithm With Link Structure
文章编号:
1000-5013(2017)02-0195-06
作者:
刘韶涛 李洪胜
华侨大学 计算机科学与技术学院, 福建 厦门 361021
Author(s):
LIU Shaotao LI Hongsheng
College of Computer Science and Technology, Huaqiao University, Xiamen 361021, China
关键词:
Best-First算法 链接结构 HITS算法 爬行策略
Keywords:
Best-First algorithm link structure HITS algorithm crawling strategy
分类号:
TP311
DOI:
10.11830/ISSN.1000-5013.201702012
文献标志码:
A
摘要:
通过分析基于内容的链接选择Best-First算法,引入能够体现链接价值的HITS(hyperlink induced topic search)算法,提出了新的链接选择策略.将两种算法相结合,新的爬虫不仅仅考虑页面内容,同时将链接结构加入进来,使得在下载的过程中能够保证主题相关性和权威性,缓解爬虫在爬行阶段的“近视”现象.结果表明:新的爬行策略比单一的Best-First算法具有更好的性能表现.
Abstract:
By analyzing the content-based link selection Best-First algorithm, and introduce the HITS(hyperlink induced topic search)algorithm which can reflect the link value, a new kind of link selection strategy is proposed: Combination of two algorithms, new crawler not only consider the page content, but also the link structure, and can ensure topic relevance and authority in the process of downloading; at the same time, ease the “short-sighted” phenomenon in crawling stage. Experimental result shows the new crawling strategy has better performance than that of the single Best-First algorithm.

参考文献/References:

[1] 闵钰麟,黄永峰.用户定制主题聚焦爬虫的设计与实现[J].计算机工程与设计,2015,36(1):17-21.
[2] TAYLAN D,POYRAZ M,AKYOKUS S,et al.Intelligent focused crawler:learning which links to crawl[C]//International Symposium on Innovations in Intelligent Systems and Applications.Madrid:IEEE Press,2011:504-508.
[3] MENCZER F,PANT G,SRINIVASAN P,et al.Evaluating topic-driven web crawlers[C]//Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Queensland:ACM,2001:241-249.
[4] RAWAT S,PATIL D R.Efficient focused crawling based on best first search[C]//IEEE 3rd International of Advance Computing Conference.Ghaziabad:IEEE Press,2013:908-911.
[5] BATSAKIS S,PETRAKIS E G M,MILIOS E.Improving the performance of focused web crawlers[J].Data and Knowledge Engineering,2009,68(10):1001-1013.
[6] FILIPOWSKI K.Comparison of scheduling algorithms for domain specific web crawler[C]//IEEE Conference Publications of Network Intelligence Conference.Nara:IEEE Press,2014:69-74.
[7] 罗林波,陈绮,吴清秀.基于 Shark-Search 和 Hits 算法的主题爬虫研究[J].计算机技术与发展,2010,20(11):76-79.
[8] PAGE L,BRIN S,MOTWANI R,et al.The pagerank citation ranking: Bring order to the web[R].Washington D C:Computer Science,1998:66-73.
[9] ZHENG Ling,BO Yang,ZHANG Ning.An improved link selection algorithm for vertical search engine[C]//1st International Conference on Information Science and Engineering.Nanjing:IEEE Press,2009:778-781.
[10] 林子皓.主题爬虫的设计与实现[J].计算机技术与发展,2014,24(8):99-102.
[11] DU Yajun,PEN Qiangqiang,GAO Zhaoqiong.A topic-specific crawling strategy based on semantics similarity[J].Data and Knowledge Engineering,2013,88(18):75-93.

备注/Memo

备注/Memo:
收稿日期: 2015-06-24
通信作者: 刘韶涛(1969-),男,副教授,主要从事软件体系结构与软件复用的研究.E-mail:shaotaol@hqu.edu.cn.
基金项目: 福建省科技厅科研基金资助项目(2011H6016)
更新日期/Last Update: 2017-03-20