************************************************************************************
README file for the EIDOS Source Induction System
************************************************************************************
Author:
Mark James Carman
Information Sciences Institute,
University of Southern California
Updated: 6/3/2007
EIDOS: Efficiently Inducing Definitions for Online Sources
----------------------------------------------------------
1. Abstract:
------------
This README provides a simple tutorial on how to install and use the source
induction system. A description of the theory behind the workings of the system
can be found in the following publications:
Learning Semantic Descriptions of Web Information Sources,
Mark James Carman and Craig A. Knoblock.
In Proceedings of the Twentieth International Joint Conference on Artificial
Intelligence (IJCAI-07). Hyderabad, India, January 2007.
Learning Semantic Definitions of Information Sources on the Internet,
Mark James Carman.
Doctorate Thesis, (Advisors: Paolo Traverso and Craig A. Knoblock),
Department of Information and Communication Technologies,
University of Trento, August 2006.
These publications are available from Mark Carman's webpage at:
http://bradipo.net/mark/
2. Installation:
----------------
The system can be downloaded from the ISI website at:
http://www.isi.edu/publications/licensed-sw/eidos/index.html
2.1 Prerequisites:
----------------------------------------
Installing the Source Induction system is quite simple. You will, however, need
to have the following installed on your machine:
- Java Runtime Environment 1.5 or higher
You will also need to have access to a machine running:
- MySQL 5.0 or higher (Other SQL compatible relational database implementations
should work - see discussion below)
2.2 Configuring the Database Connection:
----------------------------------------
In order to induce source definitions the system will need access to a
relational database implementation. You must select the database implementation
that will be used by the system by changing the parameters of the file
"dbProperties.txt". You will need to set the following properties:
- jdbcDriver (the java class that implements the JDBC driver for your
implementation)
- dbURL (the URL of the database instance, in standard JDBC format)
- dbUser (the username for accessing the database server)
- dbPassword (the password)
We recommend that you use MySQL 5.0 or later as a database. While the system has
been designed to work with other relational database implementations, (in
particular SQLServer), it has not been tested on them and inconsistencies in the
SQL used by each may cause problems.
As an example, in order to use a MySQL database running on a server called
"ragnarok.isi.edu", with a username "root" and password "secret", (and setting
the default schema to be "db4"), you would need to add the following to the
"dbProperties.txt" file:
jdbcDriver=com.mysql.jdbc.Driver
dbURL=jdbc\:mysql\://ragnarok.isi.edu/db4
dbUser=root
dbPassword=secret
3. Using the Source Induction System:
-------------------------------------
Once you have installed/configured the source induction system you can start
using it. To use the system you must create a problem specification file and you
must write wrapper classes for all the sources (both known and unknown) in the
specification file. (In the future we hope to automate the wrapper generating
process.) The problem specification file is discussed in section 4 and the
wrapper writing process in section 5.
To run the system, call:
java -jar sourceInduction.jar <problem_filename>
Sample problem files are available in "problems" folder. If no problem file is
given, the system will execute the default problem "problems/example.txt".
Note that in order to run this and other sample problems you will first need to
fill the database with example data found in the file "data/examples.sql". The
file contains a series of SQL commands that create a set of tables containing
examples of different Semantic Types (that are needed for the induction
process). To load the database just execute the SQL script using any database
management software (such as the "MySQL Query Browser").
4. The Problem Specification File:
-----------------------------------
In EIDOS, problems are specified in a text file. A typical problem specification
file will contain declarations for semantic types, domain relations, comparison
predicates, known sources and target predicates (usually in that order).
Declarations begin with one of the following keywords:
{type, relation, comparison, source, function, target,
import}.
Each declaration must be on a separate line. (Declarations cannot run across
lines.)
The "#" character is used to escape comments. The rest of the line following the
"#" character is ignored by the parser.
4.1 Semantic Types:
---------------------
The first thing to define in a problem specification file is the set of semantic
types that will be used to describe the data that is required as input and
produced as output by different sources. Semantic types (like "phoneNumber" and
"hotelName") differ from syntactic types (like "string" and "integer") in that
the domain of their values is typically much smaller, and comparing their values
makes sense. (For example, it makes sense to check if two phone numbers are
equal, but not whether a phone number equals a zipcode, even though they are
both just integers.) Syntactically, a type declaration is written as follows:
type name [primitive_type] {additional_type_parameters}
Each declaration starts with the "type" keyword, followed by the name of the
type, then in square brackets the primitive type (that is used to define columns
in the database), followed by additional type parameters in curly-braces (that
define the domain and/or provide examples of the type). Two example semantic
type declarations are shown below:
type latitude [varchar(20)] {numeric: -90.0, 90.0, 0.002}
type speedKmph [varchar(45)] {numeric: 0.0, 1000.0, 1.0%}
The first type "latitude" is a numeric type, taking values from the range -90.0
to +90.0. The third parameter 0.002 is the tolerance. The induction system will
consider two latitude values to be the same if they fall within 0.002 of each
other. The tolerance can also be given as a relative value (percentage) as is
the case for the second numeric type "speedKmph".
Semantic types can also be nominal, meaning that the set of possible values do
not come from a range:
type countryName [varchar(100)] {examples:
examples.countryname.val}
In this case, a set of example values can be given to the system for use when
querying sources. For the "countryName" type shown above, the example values are
found in a database called "examples" in the "val" column of a table called
"countryname".
One does not necessarily need to provide example values for all types. Any types
declared without additional parameters will be assumed to be nominal and exact
string matching will be used for checking equality between values. E.g.:
type direction [varchar(30)]
We do not recommend declaring types without additional information however, as
it may lead to errors during the induction process. Basically, if the system
ever needs to generate input values of type "direction" an error will be
returned and the induction process will fail.
For some (nominal) semantic types, certain values from their domain are far more
more common than others. In such cases it usually makes sense for the induction
system to invoke services using the more common values than the less common
ones. Consider for example a service providing classified listings of used cars
for sale in a given area. If the system invokes the service using a common car
manufacturer like "Toyota", the service is likely to return some cars, while if
it invokes it with a less common manufacturer like "Ferrari", it is unlikely
that any cars will be found. Thus it makes sense for some semantic types (like
"carMake") to provide information regarding the frequency of different example
values:
type carMake [varchar(100)] {examples:
examples.cars.manufacturer[frequency]}
The above declaration states that example values for type "carMake" can be found
in the "examples" database in the "manufacturer" column of the "cars" table. In
addition, the relative frequency (or count) of each example is found in the
column "frequency" of the same table. (Note that the sum of values in the
"frequency" doesn't need to be 1.)
In the same way that we allow flexibility in the exact value of a numeric
attribute like "Temperature" (by giving a tolerance), we would like to allow
some flexibility in the value of a nominal attribute like "company". In other
words, we would like strings like "Google Inc." and "Google Incorporated" to
match one another even though they are not exactly the same.
Deciding which strings match and which don't can be difficult (consider "Google
Incorporated" and "Yahoo Incorporated" for instance - both contain the same
substring). To decide for a given type whether two strings refer to the same
entity we use string similarity scores (that are usually related to the edit
distance between strings). A commonly used similarity score is the JaroWinkler
score:
type company [varchar(80)] {examples: examples.company.val;
equality: JaroWinkler > 0.85}
The above declaration states that the system will consider two strings of type
"company" to refer to the same entity if the JaroWinkler score for the two
strings is greater than 0.85.
String distance scores are calculated using the SecondString package
(http://secondstring.sourceforge.net/), so you can choose to use other
similarity scores provided by that software. (Note however, that JaroWinkler is
the only similarity score that has so far been tested in EIDOS. Furthermore,
some of the similarity scores such as TFIDF, require examples of different
strings to be given apriori and are not as yet supported by the EIDOS code).
Some semantic types (like "date") have complex structure which needs to be taken
into account when deciding if two values are equal. In that case the user can
write special code for deciding on equality between strings and have the system
invoke that code as it would a string similarity measure:
type date [varchar(80)] {equality: Date = 1.0}
The above declaration states that two strings can only be considered equal if a
handwritten procedure called "Date" returns 1. (Currently "Date" is the only
handwritten procedure available in EIDOS.)
4.2 Domain Relations:
-----------------------
Having declared the types, we need to declare the relations that can be used to
express semantic relationships between variables of different types. These
relations define the mediated schema. (The language that the user uses to write
queries for the information mediator.) Each declaration consists of a relation
name and a set of semantic types, (which must have been declared previously in
the problem specification), such as the following:
relation centroid(zipcode, latitude, longitude)
Relations may be very long and contain multiple occurrences of the same type:
relation
forecast(latitude,longitude,date,temperatureF,temperatureF,sky,time,time,humidity)
In such cases (of multiple occurances of the same type - such as the high and
low temperatures above), it can be hard to remember which slot refers to what,
so the syntax allows for the use of labels as follows:
relation
forecast(latitude,longitude,date,temperatureF:high,temperatureF:low,sky,time:sunrise,time:sunset,humidity)
Labels are separated by colons and must follow the semantic type. Since the
labels do not affect the semantics of the relation declaration, they are ignored
by the system.
Note that there are no built in predicates in the system, so relations like the
sum and product may need to be added to the problem specification. E.g.:
relation sum(price,price,price)
4.3 Comparison Predicates:
----------------------------
Comparison predicates (like "less-than") are declared in a similar way to
relations. Here are some examples:
comparison <($distanceMi,$distanceMi)
comparison >=($price,$price)
comparison <($timestamp,$timestamp)
Unlike relations, their meaning is understood (interpreted automatically) by the
system. Thus they are also treated like source predicates in the sense that they
are used directly to generate definitions during the learning process.
4.4 Known Sources:
--------------------
Once the relations and comparison predicates have been defined, we can use them
to write source definitions for a set of known sources. (These sources will be
accessed by the system whenever it needs to check the validity of a new
definition for the target predicate.) An example source declaration for a source
called "GetZipcode" is shown below:
source GetZipcode($city,$state,zip) :-
municipality(city,state,zip,_). {wrappers.Ragnarok; getZipcode}
The variables "city" and "state" are prefixed with the "$"-symbol to distinguish
them as input attributes, ("zip" is the only output attribute). The relations
(predicates) appearing in the body of the rule (after the ":-"-symbol) must have
already been defined in the problem specification. All head variables
(input/output attributes) must appear in the body of the clause (rule). The
system will work out the semantic types of the input/output attributes using the
definitions of the relations in the body. The underscore character ("_")
represents a fresh / "don't care" variable (i.e. a variable that doesn't appear
anywhere else in the clause).
The semi-colon (;) separated parameters given in the curly braces ({}) are used
by the system to locate the appropriate wrapper class and operation for
accessing the source (see section 5 for details).
In general source definitions can be far more complex and contain (be
conjunctions of) multiple different relations, as is the case for the
declaration below:
source YahooFinance($tkr,ls,dt,tm,chg,op,mx,mn,vol) :-
trade(dt,tm,tkr,ls), market(dt,tkr,cl,op,mx,mn,vol), sum(cl,chg,ls).
{wrappers.YahooFinance; GetQuote}
4.5 Functional Sources:
----------------------
Certain sources are functional in the sense that they produce exactly one output
tuple for any given (possible) input tuple. Such sources are usually the result
of some (user provided) calculation such as the adding of two numbers together
or calculating the distance between two coordinates. By letting the system know
that a source is functional we can improve the performance of source induction.
(Note that functional sources must produce exactly one output tuple for ALL
POSSIBLE input tuples.) Functional sources are declared in the same way as
normal sources but using "function" keyword instead of "source", as shown in the
declarations below:
function ConvertKm2Mi($distKm,distMi) :-
convertDist(distKm,distMi). {wrappers.Ragnarok; convertKm2Mi}
function Add($price1,$price2,price3) :-
sum(price1,price2,price3). {invocation.Local; add}
4.6 Target Predicates:
------------------------
Once all the types, relations and sources have been defined in the problem
specification, the only thing left to declare is one or more target sources. The
system will try to induce a definition for each of the targets listed. Here are
some example target declarations:
target CurrencySource($currency,currency,price)
{wrappers.CurrencySource; getRates}
target
Autosite($make,$model,$zipcode,$distanceMi,year,color,price,mileage,distanceMi,datetime)
{wrappers.Autosite; getCars}
Note that the semantic types of the input/output attributes must be listed for
each target. The wrapper parameter in the curly braces are the same as for
source declarations.
4.7 Import Keyword:
---------------------
In order to allow for better organization of problem specification files, there
is an additional keyword "import" which allows you to import all of the
declarations contained in another file:
import problems/sources.txt
If a particular entity (type, relation, etc.) is declared twice (with the same
name) then the system will only import the first declaration for that entity.
5. Making Wrappers:
--------------------
To use the system you will need to write wrapper classes for all the sources
(both known and unknown) in the specification file. In the future we hope to
automate the wrapper generating process.
Wrappers must extend the class "invocation.Wrapper", which means that they need
to implement an "invoke" method:
relational.Table invoke(String[] endpoint,
java.util.ArrayList inputTuple);
This method will be called by EIDOS every time it invokes the wrapper. The
parameters "endpoint" are the semi-colon (";") separated source invocation
parameters discussed above.
I.e. for the target declaration:
target WebContinuum($currency,$currency,$price,price)
{wrappers.WebContinuum; calcExcRate}
The wrapper would be called "wrappers.WebContinuum" and the "endpoint"
parameters would be:
endpoint[0] = "wrappers.WebContinuum";
endpoint[1] = "calcExcRate";
The second parameter of the "invoke" method is the input tuple being sent to the
source. The method returns a relational table as output. If no tuples are
returned by the source for a particular input then the source should return an
empty table.
5.1 Local Functions:
-------------------
Certain common functions like addition and multiplication are provided by the
system as a local procedure, which is a little more efficient than needing to
access a remote source. Such functions are currently being provided as methods
of the class "invocation.Local", e.g.:
function Add($price1,$price2,price3) :-
sum(price1,price2,price3). {invocation.Local; add}
6. Output Data:
----------------
The results of service induction are written to a database, the name of which is
defined in the properties file:
outputDbName=output
The table generated will have the same name as the problem specification file
(without the ".txt" extension). Columns of the table will be as follows:
timestmp, target, definition, unfolding, score,
normalisedScore, candidates, totalCandidates, accesses,
totalAccesses, time, totalTime, timeout, maxClauseLength,
maxPredRepitition, maxVarLevel, noVarRepetition, heuristic
The newly discovered definition for each target predicate will be in the column
"definition". The rest of the columns describe different search parameters and
metrics (see the cited publications for a description).
7. Advanced Settings:
--------------------
The system while running generates a lot of data which it stores in the database
you provided. The system will create a new database instance (schema) using the
name given in the "dbproperties" file:
cacheDbName=cache
In this new schema (called "cache" above) the system will create two tables to
represent each source it accesses.
Many of the search parameters used for constraining the search space can be
altered by editing the file "Tester.java" and recompiling the sources. For more
information contact Mark Carman at http://bradipo.net/mark/