The D2RML Information Sources (D2RML‑IS) Specification

This document describes the D2RDF Information Sources (D2RML‑IS) vocabulary.

Introduction

This document describes D2RML-IS [[D2RMLISVoc]], a vocabulary for describing information sources.

D2RML-IS descriptions are written in Turtle syntax [[TURTLE]].

Document Conventions

In this document, examples assume the following namespace prefix bindings unless otherwise stated:

Table 1: Namespaces used by this document.
Namespace prefix Namespace URI
cenc http://islab.ntua.gr/ns/cenc#
cnt http://www.w3.org/2011/content#
dris http://islab.ntua.gr/ns/d2rml-is#
drop http://islab.ntua.gr/ns/d2rml-op#
ff http://islab.ntua.gr/ns/file-formats#
formats https://www.w3.org/ns/formats/
http http://www.w3.org/2011/http#
rdf http://www.w3.org/1999/02/22-rdf-syntax-ns#

Overview of D2RML-IS

The classes defined by d2rml-op are shown in Figure 1.

Information Sources
D2RML-IS classes.

Information Sources

In the contenxt of D2RML-IS, an information source represents a data container. The description of an information source provides the details about the exact way access to the data that reside in the data container can be achieved. Information sources eventually provide (possibly parts of) their data as data blocks. A data block may be a chunk of binary or textual data, and the information source may or may not be ignorant about the form of the data blocks it provides. The abstract class of all information sources is dris:InformationSource.

In general, the description of an information source may involve parameters; any such parameters are instances of drop:Parameter and are declared using the drop:parameter property. Any parameters should be named parameters and they may appear using their name in values of properties of type xsd:string within braces, i.e. as {parameter-name} where parameter-name is the value of the drop:name property of the parameter.

Information sources are divided in two main classes: data sources and service sources.

Data Sources

A data source is an information source which represents one or more data blocks, typically a single data block in a particular a file, that are obtained every time they are requested from the data source. A specific data source is a subclass of dris:DataSource, which is the abstract class of all data sources.

An instance of a data source should contain the necessary information for interpreting the data blocks it provides. If the file format of the data block is important and cannot be determined correctly from a file extension, it may be specified using the dris:fileFormat property; appropriate values for this property include e.g. those defined in the file formats and FF vocabularies. The character encoding of a textual data block may be specified using a dris:characterEncoding property; appropriate values for this property include e.g. those defined in the CENC vocabulary. If access to the data block is secured by a password, the password may be specified by a dris:fileCredentials property.

The current version of D2RML-IS defines two data sources: file sources, which represent local files, and HTTP sources, which represent remote files or data obtained though an HTTP request.

File Sources

A file source represents one or more files in an assumed local file system. Is is an instance of dris:FileSource. The physical location of the file is a path and is specified using a dris:path property. More that one paths may be provided, in which case the file source represents the set of data blocks corresponding to the individual files, rather than a singe data block. If the ordering of data blocks is important, they should be provided using the dris:paths property. The path values may be a uri in the file scheme, or a string. Strings allow for parametric paths.

If a path is provided as a string, it MUST be specified using / as separator, regardless of the underlying file system.


<#CSVFile>
   a dris:FileSource ;
   dris:path "c:/data/part-1.csv" ;
   dris:path "c:/data/part-2.csv" ;
   dris:characterEncoding cenc:Windows-1253 .
					

HTTP Sources

An HTTP source represents a single data block obtainable by an HTTP request. It is an instance of dris:HTTPSource. If the request is a GET request, the full request URL can be specified using the dris:uri property. More complex requests, such as POST requests, can be specified using the dris:httpRequest property whose object is an instance of http:Request described in the HTTP Vocabulary in RDF. If access to the data block represented by the HTTP source is secured, the necessary credentials can be provided through the dris:credentials property.


<#HTTPSource>
   a dris:HTTPSource ;
   dris:uri "http://www.example.com/data/countries" .
					

<#HTTPSource>
   a dris:HTTPSource ;
   dris:httpRequest [
      http:absoluteURI "https://www.example.com/api/analyze?type=image&language=en" ;
      http:methodName "POST" ;
      http:headers ( 
        [ 
          http:fieldName "Content-Type" ;
          http:fieldValue "application/json" ; 
        ]
        [ 
          http:fieldName "APIKey" ;
          http:fieldValue "A$2@3KZa" ; 
        ] 
      ) ;
      http:body [ 
        a cnt:ContentAsText ;
        cnt:chars "{\"url\" : \"http://www.example.com/data/image.jpg\" }" ; 
      ] ;
   ] .				
					

<#HTTPSource>
   a dris:HTTPSource ;
   dris:uri "http://www.example.com/data/countries/{COUNTRY}" .
   drop:parameter [
      drop:name "COUNTRY" ;
   ] .
					

Container Sources

A data source may act as a container source, i.e. as a data block that can be interpreted as a container of other data sources, in particular file sources. This means that a file source may reside within the data blocks represented by another data source, as e.g. in the case of a file contained inside another zipped file. In such cases, the container source of a file source may be specified by the dris:containerSource property. In such cases, the respective path is the location of the file inside the container source. In this case, the path value should be a string, and if path equals * the source represents the set of data blocks corresponding to all files in the container source.


<#RemoteZipFile>  
   a dris:HTTPSource ;
   dris:uri "http://www.example.com/data/all.zip" ;
   dris:fileCredentials [
      dris:password "A$2@3KZa" 
   ] .

<#CompaniesSource>  
   a dris:FileSource ;
   dris:containerSource <#RemoteZipFile> ;
   dris:path "content/companies.csv" .
   
<#PersonsSource>  
   a dris:FileSource ;
   dris:containerSource <#RemoteZipFile> ;
   dris:path "content/persons.csv" .
					

In the above example, the container source <#RemoteZipFile> (which is a password secured file) is used twice to provide access to two of the contained files.

If the container is going to be used only once, to access a single file within it, the respective data block can be written as a single data source using the dris:inContainerPath property. As before, dris:inContainerPath may equal *.dris:inContainerPath is a shortcut property. Formally, assuming that p1, ..., pn are predicates other than dris:containerSource, the following table defined the shortcut.

The dris:inContainerPath shortcut property.
Shorcut Shorcut for
?x ?p1 ?y1 . ... . ?x ?pn ?yn .
?x dris:inContainerPath ?v
?z ?p1 ?y1 . ... . ?z ?pn ?yn .
?w a dris:FileSource .
?w dris:containerSource ?z .
?w dris:path ?v

<#CompaniesSource>  
   a dris:HTTPSource ;
   dris:uri "http://www.example.com/data/all.zip" ;
   dris:filePassword "A$2@3KZa" .
   dris:inContainerPath "content/companies.csv" .
					

Container sources may be nested.

Service Sources

A service source is an information source that represents a data repository, from which data blocks are eventually obtained by issuing a specific request to the service source, usually in the form of a query in a language that the service source understands. Usually, the data blocks obtained from a service source have a specific format that depends on the type of the source. A specific service source is a subclass of dris:ServiceSource which is the abstract class of all service sources. The descriptions of a service source contains the necessary information for establishing the connection to the service source, but not the queries by which the data blocks are obtained.

The current version of D2RML-IS defines two service sources: SPARQL endpoints, which represent SPARQL endpoints answering SPARQL queries, and RDBMSs, which represent relational database management systems answering SQL queries.

SPARQL Endpoints

A SPARQL endpoint is an instance of dris:SPARQLEndpoint and is specified only by the URI the endpoints listens to through the dris:uri property.


<#WikidataEndpoint>
   a dris:SPARQLEndpoint ;
   dris:uri "https://query.wikidata.org/bigdata/namespace/wdq/sparql" .
					

Relational Databases

A RDBMS is an instance of dris:RDBMSSource and is specified by its type and access details. The RDBMS type determines the underlying RDBMS system. It is specified by the dris:rdbmsType property, and the possible values are listed in the RDBMS types.

The access details are specified using the dris:host, dris:port, dris:path, dris:databaseName, dris:databaseInstanceName properties, an the necessary username and password provided through the dris:credentials property. These currently provided options cover several simple RDMBS connection cases. A separate, implementation independent RDF vocabulary for fully specifying RDBMS access is required.


<#MySQLDatabase>
   a dris:RDBMSSource ;
   dris:rdbmsType dris:MySQL ;
   dris:host "http://database.org" ;
   dris:port 3006 ;
   dris:databaseName "companies" ;
   dris:credentials [
      a dris:StandardUserCredentials ;
      dris:username "root" ;
      dris:password "r@@Tx#" 
   ] .
					

<#MSAccessDatabase>
   a dris:RDBMSSource ;
   dris:rdbmsType dris:MicrosoftAccess ;
   dris:path "c:/data/database.mdb" ;
					

String Sources

An information source may also be a string source. A string source represents a single data block that is a directly provided user-defined string. It is an instance of dris:StringSource, and the string is provided through the dris:string property, which typically will make use of a parameter.


<#StringSource>
   a dris:StringSource ;
   dris:string "name;surname\nJohn;Smith" .
				

<#StringSource>
   a dris:StringSource ;
   dris:string "{@@VALUE@@}" ;
   drop:parameter [
      drop:name "VALUE" 
   ] .
				

Credentials

Credentials provide authorization information to access a resource. The abstarct class of credentials is dris:Credentials.

The current version of D2RML-IS defines the following credentials: standard user credentials.

Standard User Credentials

standard user credentials allow the identification of a user by means of a username and password specified by the dris:username and dris:password properties respectively.

Request Iterators

A request iterator is a parameter that is used to obtained a series of data blocks from an information source.

The current version of D2RML-IS defines the following request iterators: enumerate request iterators, a simple count request iterators or a simple key request iterators.

Enumerate Request Iterators

An enumerate request iterator is a request iterator that takes successively all values from a predefined list of string values. The list of values is provided by the dris:values property.


<#Source>
a dris:HTTPSource ;
dris:httpRequest [ 
  http:absoluteURI "https://www.museum.org/items/{@@id@@}" ;
  http:methodName "GET" ;
] ;
drop:parameter [ 
  a dris:EnumerateRequestIterator;
  drop:name "id" ;
  dris:values ( "photo-49" "photo-34" "document-19" "document-92" ) ;
] .
				

Simple Count Request Iterators

An simple count request iterator is a request iterator that takes successively all values from an initial value up to a maximum value by the specified increment.


<#Source>
a dris:HTTPSource ;
dris:httpRequest [ 
  http:absoluteURI "https://www.museum.org/items?page={@@page@@}&size=20" ;
  http:methodName "GET" ;
] ;
drop:parameter [ 
  a dris:SimpleCountRequestIterator;
  drop:name "id" ;
  dris:initialCount 1 ;
  dris:maxCount 100 ;
  dris:increment 20 ;
] .
				

Simple Key Request Iterators

A simple key request iterator is a request iterator that takes successively some values. Each time the value is obtained from an element of the current data returned by the information source. In order for this to be possible, the data should be interpreted.

If an HTTP source returns paginated results e.g. in the form


{
"itemsCount": 20,
"totalResults": 567,
"nextCursor": "K2FSDA48DSA93GH6B97F5D767KD3B08HJ",
"items": [
  ...
]
}
				

where the value of nextCursor is a value that should be used in the URL to retrieve the next page, all pages can be obtained iteratively by using the following request iterator


<#Source>
a dris:HTTPSource ;
dris:httpRequest [
  http:absoluteURI "http://www.example.org/search?cursor={@@cursor@@}&rows=20" ;
  http:methodName "GET" ;
] ;
drop:parameter [ 
  a dris:SimpleKeyRequestIterator ;
  drop:name "cursor" ;
  dris:initialValue "*" ;
  dr:column "$.nextCursor" ;
  dr:columnFormulation dris:JSONPath ; 
] .