draft-wiley-joseph-swartz-tristero-search-01
Interface authors: Brandon Wiley, Sam Joseph, Aaron Swartz
Document author: Brandon Wiley
The Tristero Search Interface is intended as a generic search interface for collections of RDF statements. It is designed with system interoperability in mind, and provides a framework for accessing and manipulating RDF information over XML-RPC as well as via native APIs.
The searching service contains several interfaces for generating, processing, and fetching statements and resources and meta-statements an statements. Each interface is orthogonal to the others. Each service can provide any combination of the interfaces.
Search
{
string search(string subject, string predicate, string object)
string search(string subject, string predicate, string object, string database)
string search(string subject, string predicate, string object, string database, string
matchType)
string search(string subject, string subjectMatchType, string predicate, string
predicateMatchType, string object, string objectMatchType, string database)
}
The first two methods are convenient versions of the last with default values filled in for the missing fields. The default value of database is "default". The default value for matchType is "exact". The search method takes the three arguments and uses them to match components in the metadata database. The database consists of RDF statements. Since RDF statements contain three parts, each of the arguments match one part of the RDF statements.
The matching strategy depends on the value of the matchType field. Currently, only the "exact" matchType is defined. Others will be defined as new search service implementations are created.
E.g. the NeuroGrid implementation of the tristero search interface supports the following matchtypes: =, <>, >, <, >=, <=, LIKE, NOT LIKE - the equals sign corresponds to the 'exact' keyword and the relative operators work using a ASCII ordering on the RDF elements. LIKE and NOT LIKE are used in conjunction with wild cards such as * and ?, where * represents any number of characters and ? represents any single character; e.g. LIKE *red? would match redreds and fredredt but not fredredred
The "exact" matching strategy matches only if the argument is the same string as the corresponding part of the RDF statement. Since no part of an RDF statement can be the empty string, an empty string matches all RDF statements. An RDF statement is considered to match the query only if all three arguments match the corresponding parts of the RDF statement.
The last method allows matchTypes to be specifed for each of the predicate, subject and object. In the third method the matchType is presumed to be used for all elements of the RDF statement.
Here is an example set of arguments:
search("Britney", "", "") -- Find all statements about the subject "Britney".
search("", "Author", "Tristero") -- Find every author for the object "Tristero".
search("", "", "") -- Return all statements in the database.
search("blanu", "Author", "Tristero") -- Return this statement if it is in the database. Otherwise, return an empty set. (checks to see if a specific statement is in the database)
search("blan*","LIKE", "Author","=", "Tristero","=","default") -- Return all statements where the author starts with "blan". Otherwise, return an empty set.
The search method does not actually return the matched statements. Instead, it returns a URL from which the matched statements can be fetched. This URL can also be used to perform various operations on the result set (such as unions or $intersections with other result sets) without having to transfer sets of statements between different machines. The result set can actually be retrieved using either the fetch method or by doing a simple HTTP GET of the URL returned from the search method.
SearchSet
{
string union(string a, string b)
string intersection(string a, string b)
string difference(string a, string b)
string symmetric_difference(string a, string b)
}
This interface provides operations for manipulating sets of statements.
Each method takes two URLs representing the sets of statements to be
used in creation of the new set. The method returns a URL referencing
the newly created set.
Fetch
{
list fetch(string uri)
list fetchSubjects(string uri)
list fetchPredicates(string uri)
list fetchObjects(string uri)
dict fetchSubjectTable(string uri)
}
This interface provides a way to retrieve statements from the given URI.
The fetch returns all of the full statements contained in the database. The statements are fully parsed and returned in a list. Each element in the list represents a parsed statement. A parsed statement is represented as a list with three elements, representing the subject, predicate, and object of the statement, respectively. Each element of the statement is represented as a list of two elements representing the type and value of the element, respectively.
The type of an element is represented as an $integer. The possible types are as follows:
int | type |
0 | Literal |
1 | URI |
2 | Node |
The value of an element is a string.
The fetchSubjects, fetchPrediates, and fetchObjects methods each return a list containing the relevant element in each statement.
The fetchSubjectMap returns a mapping. For every unique subject contained in the database there is a key in the mapping. The associated value is also a mapping. For each unique predicate in the database which is associated with the given subject there is a key in the secondary mapping. The associated value is a list of all objects in the database associated with that subject and predicate pair. This hierarchal view of the database allows for the user to easily iterate through the statements in a structured way rather than dealing with them as an unordered list of statements.
Each of these fetch operations can be passed an additional pair of arguments, no_statements and offset like so:
list fetch(string uri, int no_statements, int offset)
which specifies that a certain number of statements should be returned and that they should be offset from the starting statement by a particular number. Thus the command:
fetch("some_uri",10,10)
would return the statements 11-20 assuming that the query matches 20 statements.
@
In the case where the URI specifies the "http" protocol or something else fetchable by external means, the result set can be retrieved directly from the URL. The canonical format for the statements to be encoded in for transmission is N-Triples with metastatements encoded in the RDF-URI format. For proponents of the REST architectural style, fetching the statements via HTTP GET allows use of the search interface in an entirely REST-compatible style. For those without such concerns, the Fetch iterface allows a convenient interface for accessing result sets without the need for writing HTTP-level code.
Add
{
add(string uri, list statement)
addAll(string uri, list statements)
}
This interface allows new statements to be added to the database whose uri is specified. The add method takes a statement in the list form returned specified in the Fetch interface. The addAll method takes a list of these $lists and adds each statement in the list.
Update
{
remove(string uri, list statement)
removeAll(string uri, list statements)
replace(string uri, list old, list new)
replaceAll(string uri, list oldList, list newList)
}
This interfaces allows for statements to be removed from the database specified in the uri. The remove method takes a statement in list form as used in the Add interface and removes it from the given database if it exists. The removeAll method takes a list of these $lists and removes each statement from the database.
The replace method takes two statements in list form. The first statement is removed from the database. The second statement is added. The replaceAll method takes two $lists of statements and performs the replace operation on each statement in the $lists.
MetastatementHandler
{
string encodeMetastatement(list statement)
list decodeMetastatement(string rdfUri)
}
The encodeMetastatement and decodeMetastatement methods handle statements about statements. The encodeMetastatement method takes a statement in list form and encodes it in a string using the RDF-URI format. This string is a valid URI and so can be used as the subject in other statements. The decodeMetastatement takes a string in the RDF-URI format and turns it $into the statement that was used to create it in list form.
It might be nice to have some more set operations. For instance you might want to divide the database $into just meta-statements or just normal statements. It would also be nice to make more complex queries about meta-statements server-side. For instance, you might want to make the query "all statements X with predicate Y which also have a meta-statement X A B asserted." Currently you can make the to separate queries, but have no way to combine them since the first query returns statements and then second query returns meta-statements. Some interface for combining these would be nice, but I'm not sure what it should be.
Sorting of search results
Handling situations where one wishes to find things like all the subjects in the database that are associated with "eats apples" and "drinks soda", e.g. say we have a set of statements like
1. John eats apples
2. John drinks soda
3. Fred eats oranges
4. Fred drinks soda
and say we want to find out who eats apples and drinks soda. Performing search
operations for "eats apples" and "drinks soda" and then taking the union returns statements
1,2 and 3; while taking the intersection returns nothing. In order to return the result
"John" we need a more fine-grained interface. @
The ability to specify the number of statements and offset within a set of statements, raises the issue that one might wish to perform set operations on subsets. This would be difficult to support since currently the search operations cannot be subsetted individually. While the interface might be modified to support this, implementing this against an underlying SQL database would raise issues with caching queries, since the current SQL based implementation (NeuroGrid) stores intermediate search uris purely as SQL query tokens, and the offseting/row operations applied to get subsets of results can only applied during the final fetch operation.