<< back to Mark Carman's Web page

Discovering, Annotating and Modeling Information Sources

Integrating new services into existing integration systems (or "mash-ups") is an important problem. The diagram below shows the three phases of automated service discovery and integration. We first search for new services (using a search engine), we then annotate the services discovered (with the types of the parameters being passed) and finally, we build a model of the new sources (in terms of the relationships between those parameters).

An Example

In the first phase we perform keyword-based search over a service registry or Web index (e.g., UDDI or Google) to find relevant services. In the example, the keyword "hotels" is given to a search engine, which returns a service called HotelLookup.

In the second phase, we determine what type of data the service takes as input and returns as output. This is done by assigning semantic types such as HotelName and PhoneNumber (as opposed to syntactic types like string and integer) to each attribute (Name, Num, etc.) of the service as seen in the diagram.

In the third phase, we model the service by discovering how the input and output parameters relate to one another. This relationship is described by a database view definition (conjunctive query in Datalog) as shown in the diagram. In our example, the source definition (at the bottom right of the figure) states that the service returns the addresses and phone numbers of all hotels which lie within a certain distance of a given zipcode, (where the distance and zipcode are given as input).

Note that in general, the services will have many more attributes and far more complicated definitions than is the case for our simple example.


Research in this first phase of service discovery has concentrated on improving search performance by first classifying services into semantic domains (e.g., travel) [1] or clustering similar services together [2].

For the second phase of service discovery, researchers have demonstrated that classifiers can be used to assign semantic types to input/output parameters using metadata labels ("Zip", "Name", "Num", etc.) as features [1]. More recently [3] extended the feature set to include also the data ("Ritz-Carlton", "(310) 823-1700", etc.) generated by the service. The approach requires active invocation of the service using example input tuples (e.g., <"90292","5">), but the resulting classification outperforms that based on metadata alone.

Research in the third phase of service discovery is more recent with the goal being to learn view definitions automatically. In [4,5] we describe a system capable of inducing declarative source definitions automatically from examples of the data produced by a service. Our system actively invokes the new service to generate the example data and then searches the space of plausible source definitions (conjunctive queries) until it finds one that produces data similar to that observed.


Learning to Attach Semantic Metadata to Web Services.
Andreas Heß and Nicholas Kushmerick.
In 2nd International Semantic Web Conference (ISWC), 2003.
Simlarity Search for Web Services.
Xin Dong, Alon Y. Halevy, Jayant Madhavan, Ema Nemes, and Jun Zhang.
In Proceedings of VLDB, 2004.
Automatically Labeling the Inputs and Outputs of Web Services,
Kristina Lerman, Anon Plangrasopchok and Craig Knoblock (2006).
In Proceedings of AAAI-2006, Boston, MA, USA.
Learning Semantic Descriptions of Web Information Sources,
Mark James Carman and Craig A. Knoblock.
In Proceedings of the Twentieth International Joint Conference on Artificial Intelligence (IJCAI-07). Hyderabad, India, January 2007.
Learning Semantic Definitions of Information Sources on the Internet,
Mark James Carman.
Doctorate Thesis, (Advisors: Paolo Traverso and Craig A. Knoblock),
Department of Information and Communication Technologies,
University of Trento, August 2006.