A probabilistic reasoning approach for discovering web crawler sessions

Stassopoulou, Athena; Dikaiakos, Marios D.

Article

Date

2007

Author

Stassopoulou, Athena

Dikaiakos, Marios D.

ISSN

0302-9743

Source

Joint 9th Asia-Pacific Web Conference on Advances in Data and Web Management, APWeb 2007 and 8th International Conference on Web-Age Information Management, WAIM 2007

Volume

4505 LNCS

Pages

265-272

Google Scholar check

Keyword(s):

Robustness (control systems)

Websites

Probabilistic logics

Classification (of information)

Data mining

Learning systems

Machine learning techniques

Classification accuracy

Case based reasoning

Web crawler sessions

Metadata

Show full item record

Abstract

In this paper we introduce a probabilistic-reasoning approach to detect Web robots (crawlers) from human visitors of Web sites. Our approach employs a Naive Bayes network to classify the HTTP sessions of a Web-server access log as crawler or human induced. The Bayesian network combines various pieces of evidence that were shown to distinguish between crawler and human HTTP traffic. The parameters of the Bayesian network are determined with machine learning techniques, and the resulting classification is based on the maximum posterior probability of all classes, given the available evidence. Our method is applied on real Web logs and provides a classification accuracy of 95%. The high accuracy with which our system detects crawler sessions, proves the robustness and effectiveness of the proposed methodology. © Springer-Vorlag Berlin Heidelberg 2007.

Links

https://www.scopus.com/inward/record.uri?eid=2-s2.0-38049025300&partnerID=40&md5=a4d707466c56826cc944d80d36f73e96