Identifying Informative Web Content Blocks using Web Page Segmentation

Stevina Dias; Jayant Gadge

Call for Paper

May Edition

IJAIS solicits high quality original research papers for the upcoming May edition of the journal. The last date of research paper submission is 28 April 2025

Submit your paper

Know more

The week's pick

Enhancing Financial Time Series Predictions with a Hybrid BNN-LSTM Approach

Anika Tahsin Biva A.B.M. Shahadat Hossain Md. Shafiul Alom Khan Iqbal Habib

Random Articles

Computer Simulation of Chaotic Systems

Apr

2017

Automated Lip Reading Technique for Password Authentication

September

2012

Auto Conversion of Serial C Code into Cuda-C-Code for Faster Execution Utilizing GPU

September

2015

Deployment of Query Validation for Finite Range Query Scheme in Wireless Sensor Networks

August

2012

Reseach Article

Identifying Informative Web Content Blocks using Web Page Segmentation

by Stevina Dias, Jayant Gadge

International Journal of Applied Information Systems

Foundation of Computer Science (FCS), NY, USA

Volume 7 - Number 1

Year of Publication: 2014

Authors: Stevina Dias, Jayant Gadge

10.5120/ijais14-451129

Stevina Dias, Jayant Gadge . Identifying Informative Web Content Blocks using Web Page Segmentation. International Journal of Applied Information Systems. 7, 1 ( April 2014), 37-41. DOI=10.5120/ijais14-451129

@article{ 10.5120/ijais14-451129,

author = { Stevina Dias, Jayant Gadge },

title = { Identifying Informative Web Content Blocks using Web Page Segmentation },

journal = { International Journal of Applied Information Systems },

issue_date = { April 2014 },

volume = { 7 },

number = { 1 },

month = { April },

year = { 2014 },

issn = { 2249-0868 },

pages = { 37-41 },

numpages = {9},

url = { https://www.ijais.org/archives/volume7/number1/614-1129/ },

doi = { 10.5120/ijais14-451129 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2023-07-05T18:54:31.309218+05:30

%A Stevina Dias

%A Jayant Gadge

%T Identifying Informative Web Content Blocks using Web Page Segmentation

%J International Journal of Applied Information Systems

%@ 2249-0868

%V 7

%N 1

%P 37-41

%D 2014

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Information Extraction has become an important task for discovering useful knowledge or information from the Web. A crawler system, which gathers the information from the Web, is one of the fundamental necessities of Information Extraction. A search engine uses a crawler to crawl and index web pages. Search engine takes into account only the informative content for indexing. In addition to informative content, web pages commonly have blocks that are not the main content blocks and are called the non-informative blocks or noise. Noise is generally illogical with the main content of the page and affects two major parameters of search engines: the precision of search and the size of index In order to improve the performance of information retrieval, cleaning of Web pages becomes critical. The main objective of proposed technique is to eliminate the non-informative content blocks from a Web Page. In the proposed technique, the extraction of informative content blocks and elimination of non informative blocks is based on the idea of Web page Segmentation. Here, a web page is divided into n blocks and the block importance is calculated for each block. The blocks with importance >=threshold are considered as important blocks and the remaining blocks are eliminated as noisy blocks. The proposed approach saves significant space and time

References

P. Sivakumar , R. M. S Parvathi , "An Efficient Approach of Noise Removal from Web Page for Effectual Web Content Mining", European Journal of Scientific Research ISSN 1450-216X Vol. 50 No. 3 (2011), pp. 340-351 © EuroJournals Publishing, Inc. 2011
Jinbeom Kang, Jaeyoung Yang, Nonmember and Joongmin Choi, Member, IEEE "Repetition-based Web Page Segmentation by Detecting Tag Patterns for Small-Screen Devices", IEEE Transactions on Consumer Electronics, Vol. 56, No. 2, May 2010
S. H. Lin and J. M. Ho , "Discovering Informative Content Blocks from Web Documents",Proc. Eighth ACM SIGKDD Int'l conf. Knowledge Discovery and Data Mining , pp. 588-593, 2002.
Lan Yi, Bing Liu, Xiaoli Li, "Eliminating Noisy Information in Web Pages for Data Mining", SIGKDD . 03, August 24-27, 2003, Washington, DC, USA.
Sandip Debnath, Prasenjit Mitra, C. Lee Giles, "Automatic Extraction of Informative Blocks from Webpages", SAC'05 March 2005, Santa Fe, New Mexico, USA
Lan Yi, Bing Liu, "Web Page Cleaning for Web Mining through Feature Weighting" SAC' 05 March 13-17, 2005, New Mexico, USA
Manisha Marathe, Dr. S. H. Patil, G. V. Garje,M. S. Bewoor, "Extracting Content Blocks from Web Pages", REVIEW PAPER International Journal of Recent Trends in Engineering, Vol 2, No. 4, November 2009
A. Arasu and H. Garcia-Molina, "Extracting structured data from web page," Proc. ACM SIGMOD Intl. Conf. on Management of Data, pp. 337–348, 2003.
Shine N. Das, Pramod K. Vijayaraghavan, Midhun Mathew, "Eliminating Noisy Information in Web Pages using featured DOM tree," International Journal of Applied Information Systems (IJAIS) – ISSN : 2249-0868, Foundation of Computer Science FCS, New York, USA Volume 2– No. 2, May 2012 – www. ijais. org
L. Yi, B. Liu, and X. Li, "Eliminating noisy information in web pages for data mining," Proc. 9th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pp. 296-305, 2003.
D. Chakrabarti, R. Kumar, and K. Punera, "Page-level template detection via isotonic smoothing," Proc. 16th Intl. Conf. on World Wide Web, pp. 61-70, 2007.
Y. Chen, W. -Y. Ma, and H. -J. Zhang, "Detecting web page structure for adaptive viewing on small form factor devices," Proc. 12th Intl. Conf. on World Wide Web, pp. 225–233, 2003.
Y. Chen, X. Xie, W. Ma, and H. Zhang, "Adapting web pages for small screen devices," IEEE Internet Computing, vol. 9, no. 1, pp. 40-56, 2005.
Y. Yang and H. Zhang, "HTML page analysis based on visual cues," Proc. 16th Intl. Conf. on Document Analysis and Recognition, p. 859, 2001.
G. Hattori, K. Hoashi, K. Matsumoto, and F. Sugaya, "Robust web page segmentation for mobile terminal using content distances and page layout information," Proc. 16th Intl. Conf. on World Wide Web, pp. 361–370, 2007.
C. Choi, J. Kang, and J. Choi, "Extraction of user-defined data blocks using the regularity of dynamic web pages," Lecture Notes in Computer Science, vol. 4681, pp. 123-133, 2007.
S. Lin and J. Ho, "Discovering informative content blocks from Web documents," Proc. 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pp. 588-593, 2002.
A. K. Tripathy and A. K. Singh, "An Efficient Method of Eliminating Noisy Information in Web Pages for Data Mining", In Proceedings of the Fourth International Conference on Computer and Information Technology (CIT'04), pp. 978 – 985, September 14-16, Wuhan, China, 2004.

Index Terms

Computer Science

Information Sciences

Keywords

Search engine information extraction web content mining web segmentation repetition detection Informative blocks non-informative blocks and noise