International Journal of Applied Information Systems |
Foundation of Computer Science (FCS), NY, USA |
Volume 2 - Number 2 |
Year of Publication: 2012 |
Authors: Shine N. Das, Pramod K. Vijayaraghavan, Midhun Mathew |
10.5120/ijais12-450272 |
Shine N. Das, Pramod K. Vijayaraghavan, Midhun Mathew . Eliminating Noisy Information in Web Pages using featured DOM tree. International Journal of Applied Information Systems. 2, 2 ( May 2012), 27-34. DOI=10.5120/ijais12-450272
The exact information retrieval from the Web is now a great challenge for the researchers to device new methodologies for web mining. Due to the massive information on the Web, the size and number appear to be growing rapidly at an exponential rate which is often accompanied by a large amount of noise such as banner advertisements, navigation bars, copyright notices, etc. Although such information items are functionally useful for human viewers and necessary for the web site owners, they often hamper automated information gathering and web data mining. The efficiency of feature extraction and finally classification accuracy are certainly degraded due to the presence of such noisy information. Thus cleaning the web pages before mining becomes critical for improving the mining results. In our work, we focuses on identifying and removing local noises in web pages to improve the performance of mining. We propose a novel and simple idea for the detection and removal of local noises using a new tree structure called featured DOM Tree. A three stage algorithm is proposed in which feature selection is done in the first phase, a featured DOM tree is created in the second phase and noise is marked and pruned in the third phase. The experimental results show that our algorithm outperform in terms of various benchmark measures and an increase in F score and accuracy is obtained as a result of automatic web page classification.