CFP last date
15 November 2024
Reseach Article

Identifying Informative Web Content Blocks using Web Page Segmentation

by Stevina Dias, Jayant Gadge
International Journal of Applied Information Systems
Foundation of Computer Science (FCS), NY, USA
Volume 7 - Number 1
Year of Publication: 2014
Authors: Stevina Dias, Jayant Gadge
10.5120/ijais14-451129

Stevina Dias, Jayant Gadge . Identifying Informative Web Content Blocks using Web Page Segmentation. International Journal of Applied Information Systems. 7, 1 ( April 2014), 37-41. DOI=10.5120/ijais14-451129

@article{ 10.5120/ijais14-451129,
author = { Stevina Dias, Jayant Gadge },
title = { Identifying Informative Web Content Blocks using Web Page Segmentation },
journal = { International Journal of Applied Information Systems },
issue_date = { April 2014 },
volume = { 7 },
number = { 1 },
month = { April },
year = { 2014 },
issn = { 2249-0868 },
pages = { 37-41 },
numpages = {9},
url = { https://www.ijais.org/archives/volume7/number1/614-1129/ },
doi = { 10.5120/ijais14-451129 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2023-07-05T18:54:31.309218+05:30
%A Stevina Dias
%A Jayant Gadge
%T Identifying Informative Web Content Blocks using Web Page Segmentation
%J International Journal of Applied Information Systems
%@ 2249-0868
%V 7
%N 1
%P 37-41
%D 2014
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Information Extraction has become an important task for discovering useful knowledge or information from the Web. A crawler system, which gathers the information from the Web, is one of the fundamental necessities of Information Extraction. A search engine uses a crawler to crawl and index web pages. Search engine takes into account only the informative content for indexing. In addition to informative content, web pages commonly have blocks that are not the main content blocks and are called the non-informative blocks or noise. Noise is generally illogical with the main content of the page and affects two major parameters of search engines: the precision of search and the size of index In order to improve the performance of information retrieval, cleaning of Web pages becomes critical. The main objective of proposed technique is to eliminate the non-informative content blocks from a Web Page. In the proposed technique, the extraction of informative content blocks and elimination of non informative blocks is based on the idea of Web page Segmentation. Here, a web page is divided into n blocks and the block importance is calculated for each block. The blocks with importance >=threshold are considered as important blocks and the remaining blocks are eliminated as noisy blocks. The proposed approach saves significant space and time

References
  1. P. Sivakumar , R. M. S Parvathi , "An Efficient Approach of Noise Removal from Web Page for Effectual Web Content Mining", European Journal of Scientific Research ISSN 1450-216X Vol. 50 No. 3 (2011), pp. 340-351 © EuroJournals Publishing, Inc. 2011
  2. Jinbeom Kang, Jaeyoung Yang, Nonmember and Joongmin Choi, Member, IEEE "Repetition-based Web Page Segmentation by Detecting Tag Patterns for Small-Screen Devices", IEEE Transactions on Consumer Electronics, Vol. 56, No. 2, May 2010
  3. S. H. Lin and J. M. Ho , "Discovering Informative Content Blocks from Web Documents",Proc. Eighth ACM SIGKDD Int'l conf. Knowledge Discovery and Data Mining , pp. 588-593, 2002.
  4. Lan Yi, Bing Liu, Xiaoli Li, "Eliminating Noisy Information in Web Pages for Data Mining", SIGKDD . 03, August 24-27, 2003, Washington, DC, USA.
  5. Sandip Debnath, Prasenjit Mitra, C. Lee Giles, "Automatic Extraction of Informative Blocks from Webpages", SAC'05 March 2005, Santa Fe, New Mexico, USA
  6. Lan Yi, Bing Liu, "Web Page Cleaning for Web Mining through Feature Weighting" SAC' 05 March 13-17, 2005, New Mexico, USA
  7. Manisha Marathe, Dr. S. H. Patil, G. V. Garje,M. S. Bewoor, "Extracting Content Blocks from Web Pages", REVIEW PAPER International Journal of Recent Trends in Engineering, Vol 2, No. 4, November 2009
  8. A. Arasu and H. Garcia-Molina, "Extracting structured data from web page," Proc. ACM SIGMOD Intl. Conf. on Management of Data, pp. 337–348, 2003.
  9. Shine N. Das, Pramod K. Vijayaraghavan, Midhun Mathew, "Eliminating Noisy Information in Web Pages using featured DOM tree," International Journal of Applied Information Systems (IJAIS) – ISSN : 2249-0868, Foundation of Computer Science FCS, New York, USA Volume 2– No. 2, May 2012 – www. ijais. org
  10. L. Yi, B. Liu, and X. Li, "Eliminating noisy information in web pages for data mining," Proc. 9th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pp. 296-305, 2003.
  11. D. Chakrabarti, R. Kumar, and K. Punera, "Page-level template detection via isotonic smoothing," Proc. 16th Intl. Conf. on World Wide Web, pp. 61-70, 2007.
  12. Y. Chen, W. -Y. Ma, and H. -J. Zhang, "Detecting web page structure for adaptive viewing on small form factor devices," Proc. 12th Intl. Conf. on World Wide Web, pp. 225–233, 2003.
  13. Y. Chen, X. Xie, W. Ma, and H. Zhang, "Adapting web pages for small screen devices," IEEE Internet Computing, vol. 9, no. 1, pp. 40-56, 2005.
  14. Y. Yang and H. Zhang, "HTML page analysis based on visual cues," Proc. 16th Intl. Conf. on Document Analysis and Recognition, p. 859, 2001.
  15. G. Hattori, K. Hoashi, K. Matsumoto, and F. Sugaya, "Robust web page segmentation for mobile terminal using content distances and page layout information," Proc. 16th Intl. Conf. on World Wide Web, pp. 361–370, 2007.
  16. C. Choi, J. Kang, and J. Choi, "Extraction of user-defined data blocks using the regularity of dynamic web pages," Lecture Notes in Computer Science, vol. 4681, pp. 123-133, 2007.
  17. S. Lin and J. Ho, "Discovering informative content blocks from Web documents," Proc. 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pp. 588-593, 2002.
  18. A. K. Tripathy and A. K. Singh, "An Efficient Method of Eliminating Noisy Information in Web Pages for Data Mining", In Proceedings of the Fourth International Conference on Computer and Information Technology (CIT'04), pp. 978 – 985, September 14-16, Wuhan, China, 2004.
Index Terms

Computer Science
Information Sciences

Keywords

Search engine information extraction web content mining web segmentation repetition detection Informative blocks non-informative blocks and noise