Many valuable text databases on the web have non-crawlable
contents that are "hidden" behind search interfaces. Hence traditional search
engines do not index this valuable information. One way to facilitate access
to "hidden-web" databases is through commercial Yahoo!-like directories, which
organize these databases manually into categories that users can browse. Our
QProber system automates the classification of searchable text
databases (whether their contents are "hidden" or not) by adaptively
probing the databases with queries derived from document classifiers, without
retrieving any documents. A large-scale experimental evaluation over 130 real
web databases indicates that our technique produces highly accurate database
classification results using -on average- fewer than 200 queries of four words
or less to classify a database (TOIS'03
paper;
SIGMOD'01 paper). Interestingly, our technique is attractive to classify
even crawlable text databases (i.e.,
databases whose contents are not "hidden") as long as search interfaces for the databases are
available (IEEE
Data Engineering Bulletin'02 paper).