sparql 笔记

发表于 2018-03-26 更新于 2023-11-22 本文字数： 4.5k 阅读时长 ≈ 16 分钟

Jumping Right In

FROM指定的数据集会被调用SPARQL处理器时指定的数据集覆盖（如果都指定了
RDF不是数据格式，而是数据模型，它可以选择存储数据文件的语法
如果我们将数据与其它数据结合起来，那么RDF三元组的主语和谓语都必须属于特定的名称空间，以防止类似名称之间的混淆，于是我们使用URI来表示它们
当使用完整的URI时，将其放在尖括号中以向处理器显示它是一个URI
在semantic web开发中，词汇表是一组使用标准格式存储的术语，供人们重复使用
发现不同来源三元组之间联系的能力是SPARQL的最佳特性之一
约定：主谓宾分别用?s,?p,?o表示
只有满足图模式中所有三元组的数据才会被返回

The Semantic Web,RDF,and Linked Data(and SPARQL)

the semantic web isn’t about the query language or about the model—it’s about the data
RDF中基本的信息单元是三元组
将RDF作为一串字节保存在磁盘上的技术术语是序列化
RDFS gives people a way to describe vocabularies. It is itself a vocabulary with a schema whose triples declare facts.
Linked Data：
- Use URIs as names for things.
- Use HTTP URIs so that people can look up those names.
- When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL).
- Include links to other URIs so that they can discover more things.

SPARQL Queries

介绍SPARQL查询语言更多更有用的特征

Data That Might Not Be There

将triple pattern放入到OPTIONAL graph patterns中，表示“如果有的话，检索该值”
OPTIONAL{t1.t2.t3}包含三个三元组，这些三元组需要同时满足
OPTIONAL{t1} OPTIONAL{t2} OPTIONAL{t3}则是三个独立的条件

OPTIONAL graph patterns的顺序很重要

# 优先使用nick作为?first
SELECT ?first ?last
WHERE
{
    ?s ab:lastName ?last.
    OPTIONAL {?s ab:nick ?first}
    OPTIONAL {?s ab:firstName ?first}
}

Finding Data That Doesn’t Meet Certain Conditions

SPARQL1.0

1
2
3

# 利用FILTER和bound，?v没有绑定值时返回true，然后输出
OPTIONAL {?s ?p ?v}
FILTER(!bound(?v))

SPARQL1.1

1 2	# 当指定的pattern不存在时返回ture NOT EXISTS {?s ?p ?v}

1 2	#另一种方式，减去满足pattern的资源（多数情况下表现相同 MINUS {?s ?p ?v}

Serching Futher In The Data

RDF三元组中的对象可以是字符串或者URI，字符串值更易读，但是URI使得将该数据和其它数据链接起来更加容易
如果一个数据集中资源的URI可以和另一个数据集中的URI对应起来，哪怕数据集来自不同的地方，不清楚组织形式，也可以利用SPARQL查询到更多的信息
逗号表示“接下来的三元组主语谓语和前一个三元组的相同”

利用属性路径（property paths）：

# 借助正则表达式的符号+（one or more）,返回直接引用、间接引用了A的论文
SELECT ?s 
WHERE
{?s c:cites+ :paperA.}
# 指定引用层数
#{?s c:cites{3} :paperA.}
#相同效果
#{?s c:cites/c:cites/c:cites :paperA.}
#^表示取相反,返回A引用的文章
#{?s ^c:cites :paperA.}
# ^结合property path的例子：引用了F引用的文章的文章
#{
#    ?s c:cites/^c:cites :paperF .
#    FILTER(?s != :paperF)
#}

Searching With Blank Nodes

空白节点的任何名称都是临时的，通常会给一个变量名
过渡的东西，用来链接三元组
最终的select列表中一般不包含空白节点

Eliminating Redundant Output

DISTINCT关键字
DISTINCT关键字不会增加查询结构的复杂性（就是在SELECT内容前面加个DISTINCT

Combining Different Search Conditions

UNION关键字：指定多个graph patterns，返回满足任意一个pattern的数据的组合(a combination of all the data that fits any of those patterns)
有些SPARQL处理器在其返回结果中使用声明的前缀，使得结果可读性更强

Filtering Data Based On Conditions

FILTER() 只有一个参数，表达式只要返回的是布尔值就可以作为参数
RDF解析器读入输入数据时，它将这些前缀映射到适当的名称空间URIs，然后将数据交给查询处理器
!isURI(?city): 如果?city不是一个正常的URI，返回true
IN关键字：查询某个变量属于列表的三元组（前面加NOT可以表示相反的意思
1
2
# ask for data where the ?v is either A or B
FILTER(?v IN (A,B))

Retrieving A Specific Number Of Results

LIMIT关键字
放在花括号外面
OFFSET关键字:跳过几个结果

Querying Named Graphs

命名图(named graph)：为三元组集合命名，方便管理（进行替换之类的操作
查询的数据集可以在查询内部用FROM指定，也可以在外部在ARQ命令行中指定，后者会覆盖前者
1
2
3
4
5
SELECT *
FROM<xx.ttl>
FROM<yy.ttl>
WHERE
{...}
指定的查询数据集构成了默认图(default graph)，它不属于任何命名图

FROM NAMED :表示数据集不会被加入到默认图，查询时需要指定其graph name（ARQ的约定是将URI作为其name），第六章会有SPARQL1.1标准的相关内容

SELECT ?lname ?courseName
FROM <ex069.ttl>
FROM NAMED <ex125.ttl>
FROM NAMED <ex122.ttl> # 即使这里写了，如果在查询中没有指定（GRAPH <ex125.ttl>），其数据不会被用来检索
WHERE
{
    { ?student ab:lastName ?lname }
    UNION
    { GRAPH <ex125.ttl> { ?course ab:courseTitle ?courseName } }
}

GRAPH关键字：查询中使用这个关键字表明引用特定命名图中的数据

GRAPH后面也可以跟变量，让SPARQL处理器寻找满足某个模式的图

#这里只设置了特定的命名图来查询，如果存在默认图，则没有任何返回结果
SELECT *
FROM NAMED <yyy.ttl>
FROM NAMED <xxx.ttl>
WHERE
{GRAPH ?g(?s ?p ?o)}

SPARQL处理器有一些预定义的命名图，当你在GRAPH中指定时不需要事先标识

SELECT ?graph ?email 
FROM <ex134.ttl> 
FROM NAMED <ex125.ttl> 
FROM NAMED <ex122.ttl> 
WHERE 
{
    ?graph dc:date "2011-09-24" . 
    { GRAPH ?graph { ?s ab:email ?email } }
}

Queries In Your Queries

子查询特征：可以将复杂的查询细分，又可以将来自不同查询的结果合并
每个子查询必须放到自己的花括号中

Combining Values And Assigning Values To Variables

sparql查询出来的值可以用于数学运算，函数调用

BIND关键字

#数学运算的例子
#这里利用amount创建了变量tip和total
SELECT ?description 
?amount 
((?amount * .2) AS ?tip) 
((?amount + ?tip) AS ?total)
WHERE
{
    ?meal e:description ?description ;
          e:amount ?amount .
}

#函数调用的例子
SELECT 
(UCASE(SUBSTR(?description,1,3))as ?mealCode) 
?amount 
WHERE
{
    ?meal e:description ?description ;
          e:amount ?amount .
}

#改进,expression calculation moved to a subquery,利用BIND关键字为变量赋值
SELECT ?mealCode ?amount
WHERE
{
    ?meal e:description ?description ;
          e:amount ?amount .
    BIND (UCASE(SUBSTR(?description,1,3)) as ?mealCode)
}

Sorting, Aggregating, Finding The Biggest And Smallest And…

SPARQL使用ORDER BY来排序（默认从小到大）.

# 根据amount排序
SELECT ?description ?date ?amount
WHERE
{
    ?meal e:description ?description ;
    e:date ?date ;
    e:amount ?amount .
}
ORDER BY ?amount

从大到小排序：利用DESC()，括号里面是排序用的指标

#根据amount从大到小排序
SELECT ?description ?date ?amount 
WHERE
{
    ?meal e:description ?description ;
        e:date ?date ;
        e:amount ?amount .
}
ORDER BY DESC(?amount)

多条件排序，各个条件空格隔开

# 先根据description排序（字母表），再根据amount从大到小排序
SELECT ?description ?date ?amount 
WHERE
{
    ?meal e:description ?description ;
        e:date ?date ;
        e:amount ?amount .
}
ORDER BY ?description DESC(?amount)

找最值
- sparql1.0: 先排序，然后LIMIT 1
- sparql1.1: MAX(),MIN()

均值：AVG()

SELECT (AVG(?amount) as ?avgAmount)
WHERE
{
    ?meal e:amount ?amount .
}

求和SUM(),计数COUNT()

GROUP_CONCAT(): 将很多数据绑定到一个变量，默认的分隔符是空格

1
2
3

# 返回一个amountlist，形如“25.05,10.00,6.65,31.45”
SELECT (GROUP_CONCAT(?amount;SEPARATOR = ',') AS amountlist)
WHERE { ?meal e:amount ?amount.}

GROUP BY: 根据某属性分组；代入SUM()函数可以求和，类似的，可以代入别的函数

# 统计早中晚餐总数
SELECT ?description (SUM(?amount) AS ?meanTotal)
WHERE
{...}
GROUP BY ?description

HAVING 关键字：限定显示出来的结果需要满足的条件

#我们只对总数超过20的感兴趣
SELECT ?description (SUM(?amount) AS ?meanTotal)
WHERE
{...}
GROUP BY ?description
HAVING (SUM(?amount)>20)

Query A Remote SPARQL Service

查询远程sparql服务

FROM 关键字(RDF file)

1
2
3

SELECT ?title
FROM <xxx://xxx.xxx.xxxx/xxx>
{?s dc:title ?title}

SERVICE 关键字(SPARQL endpoint)

# 在SPAERQL endpoint运行查询得到一些内容，再进行检索返回结果
SELECT ?p ?o
WHERE
{
    SERVICE <xxx://xxx.xxx/xxx>
    {
        SELECT ?p ?o
        WHERE {xxx ?p ?o}
    }
}

ARQ 必须指定–data参数，即便查询对数据没有任何操作（上面的查询，其实只是指定了待查询的endpoint，而不是数据
可以借助D2RQ使用SPARQL查询关系数据库（RDB）

Federated Queries:Searching Multiple Datasets With One Query

联合查询(Federated Queries):一个query查询多个数据集
- 第一个子查询中绑定的变量，在之后的子查询中依旧可用
- 如果数据集之间关系密切，上面的特点就会很有用（用来交叉引用）
- 如果一个查询由两个子查询，子查询1返回a个结果，子查询2返回b个结果，那么整体查询就返回a*b个结果（cross-product）
- 子查询依次执行，可能会花点时间

Coping,Creating,And Converting Data

除了查询出结果还能做更多的事

Query Forms: SELECT, DESCRIBE, ASK, and CONSTRUCT

CONSTRUCT返回三元组；可以返回原数据也可以抽取出值来创建新的三元组；可以用来复制、创建、转换
ASK询问处理器给定的图模式是否描述特定数据集中的一组三元组，返回一个布尔值；可以用来自动化数据处理流程中的质量控制；
DESCRIBE要求提供描述特定资源的三元组

Copying Data

利用CONSTRUCT抽取三元组；结合GRAPH关键字可以从特定命名图中抽取；SELECT后面跟的是变量列表，CONSTRUCT后面跟的是想要构造的三元组
1
2
3
4
5
#construct后面跟的是三元组，用花括号包围,里面可以包含任意个三元组模式（triple patterns）
CONSTRUCT
{?person ?p ?o}
WHERE
{...}

Creating New Data

利用一些函数处理数据，得到新的字段
指明资源所属的类会让信息推断更加容易
所谓的创建信息其实是将隐含的信息明确化

Converting Data

将一个命名空间中的属性转换到需要的命名空间
意味着规范化URI以便更加容易地组合数据
owl:sameAs是DBpedia中用来将不同来源的资源联系起来的方式

Finding Bad Data

schema是一组关于数据结构和数据类型的规则
如果数据遵循了某个schema那么就不需要程序员写代码应对“给字符串加了1”这样的情形了
语义网应用采用了别的方法，通过添加更多的metadata
使用SPARQL添加限制而不是OWL

Defining Rules with SPARQL

rules expressed as queries
一些常用的：
- isURI
- datatype(?amount)) != xsd:integer
- !(bound(?grade))
- ?grade < 5

Generating Data About Broken Rules

将ASK换为CONSTRUCT
问题建模：问题类型、相关属性
可以使用Union将不同的规则加合并，但是随着规则的增多，这种方式会产生越来越多的错误（ there’d be greater and greater room for error
好的处理方式是：分开存储规则，使用是时候pipeline

Using Existing SPARQL Rules Vocabularies

Schemarama
SPIN
关系数据库–API–check for rule compliance using SPARQL

Asking for a Description of a Resource

DESCRIBE+URI :返回资源的一些信息，具体返回结果和SPARQL查询引擎有关
CONSTRUCT可以完成相同的事情，with better control，因此不太推荐在serious的应用开发中使用

Datatypes And Functions

Datatypes and Queries

数据类型元数据的存储是记录语义信息的最早方式之一
标明数据类型方便理解，不标明的话会有默认设置
str()强制类型转换：FILTER (str(?o) = “two”) #返回所有值为”two”的

Representing Strings

单引号，双引号，三个单引号或者双引号
ARQ输出的时候:使用双引号分隔字符串；回车为/r，换行为/n，转义为/；输出的顺序doesn’t matter

Comparing Values and Doing Arithmetic

当使用不同的数字类型显式键入不同的值时，仍然可以在执行算术时将它们一起使用：比如integer和decimal可以乘到一起

Functions

SPARQL 1.0规范提供了一些基本函数，SPARQL 1.1提供了更广泛的选择，几乎所有都基于XPath函数
SPARQL处理器可以提供其实施者想要包含的任何扩展功能

Program Logic Functions

IF()函数有三个参数。如果第一个参数为true，则该函数返回第二个参数的值; 否则返回第三个

COALESCE():接受很多参数，返回the first one that doesn’t result in an error

SELECT ?first ?last 
WHERE 
{
?s ab:lastName ?last; 
   ab:firstName ?firstname . 
OPTIONAL{ ?s ab:nick ?nickname . } 
BIND (COALESCE(?nickname,?firstname) AS ?first)
}

Node Type and Datatype Checking Functions

函数参数可能需要是特定类型，数据字段也需要是特定类型
datatype()函数可以用来检测类型
isBlank(), isLiteral(), isNumeric(), isIRI(), and isURI()
数字，字符串和关键字true和false（全部写成小写）都是literals，只有URI和空白节点不是
datatype(params) 返回一个URI，标识params的类型,params为空白节点和URIs时返回为空
bound()告诉我们一个变量是否有一个绑定的值

Node Type Conversion Functions

URI() function lets you convert values to URIs if possible
在将值传递给URI()或IRI()函数之前，使用ENCODE_FOR_URI()函数预处理是个推荐的做法，但要注意它只接受simple literals或xsd：string
str():返回传入资源的字符串形式，传入为空白节点时不返回值

一个例子：

CONSTRUCT {?s ?p ?testURI.} 
WHERE
{
    ?s ?p ?o . 
    BIND( IF(isURI(?o), ?o, URI(ENCODE_FOR_URI(str(?o))) ) AS ?testURI)
}

Datatype Conversion

conversion to boolean is pickier: xsd:boolean(?o)不能转换True，可以转换true
xsd:dateTime()不能转换“2011-11-12” ，可以转换“2011-11-12T14:30:00”

STRDT()接收两个参数，一个literal值和一个类型URI，创建一个typed literal（自定义类型）

CONSTRUCT { ?s u:amount ?newAmount . } 
WHERE {
    ?s im:product ?prodName ;
       im:amount ?amount ; 
       im:units ?units .
       BIND (STRDT(?amount, URI(CONCAT("http://learningsparql.com/ns/units#",?units))) AS ?newAmount)
}

Checking, Adding, and Removing Spoken Language Tags

FILTER ( lang(?label) = “en” ) #只返回英文标签
BIND (str(?label) AS ?strippedLabel) #去掉@en，只返回文字信息
STRLANG()函数可以为属性加标签，比如：STRLANG(?USTerm,”en-US”)

String Functions

STRLEN(), SUBSTR(), UCASE(), and LCASE()
STRSTARTS(), STRENDS(), CONTAINS(),regex() #返回布尔值
The regex() function expects its first argument to be either an xsd:string or a simple literal with no language tag。可以str()一下

Numeric Functions

abs(),round(),ceil(),flloor()
rand()+CONSTRUCT :生成样本数据

Date and Time Functions

now(),timezone(),tz()

Hash Functions

MD5(),SHA1(),SHA224(),SHA256(),SHA384(),SHA512()

Extension Functions

不同的SPARQL处理器支持的拓展函数不同
less portable

Updating Data With SPARQL

query the data with the SPARQL query language and manage it with the update language.

Getting Started with Fuseki

下载、安装

Adding Data to a Dataset

更新文件后缀为.ru，意思是一个插入请求，not a query

# INSERT DATA后面跟要插入的triples
# 简单快速插入数据
INSERT DATA 
{d:i8301 ab:homeTel "(718) 440-9821" . ab:Person a rdfs:Class .}

#where后面跟triple patterns，可以引用上面的变量
# 灵活创造
INSERT {d:i8301 ab:homeTel "(718) 440-9821" . ab:Person a rdfs:Class .}
WHERE {}

三元组模式就是任意位置都可以被变量替换的三元组

Deleting Data

DELETE DATA{}和DELETE{}WHERE{}
DELETE WHERE{} :删除匹配where条件的三元组
CLEAR DEFAULT

Changing Existing Data

在一次更新操作中，删除+插入
即使删除发生在插入之前，INSERT图形模式仍然具有WHERE子句存储的所有信息

Named Graphs

SPARQL更新允许使用分号连接多个操作
将三元组插入不存在的图时，SPARQL处理器会创建该图

Dropping Graphs

DROP GRAPH d:g1 #删除图g1
DROP DEFAULT #清除默认图，（因为默认图总是存在，即使为空
DROP NAMED：删除命名图
DROP ALL ：删除所有图
SPARQL Update没有UNDO操作，因此DROP ALL是个需要慎重的动作
这里的DROP换为CLEAR表示图中的清除三元组
CREATE GRAPH：创建一个空白图

Named Graph Syntax Shortcuts: WITH and USING

with语句指明要操作的图，比GRAPH节省
USING的作用类似于SELECT语句的FROM

USING NAMED === FROM NAMED

USING NAMED d:g2 
WHERE
{ 
    # 这里一定要声明GRAPH d:g2
    GRAPH d:g2 {?s dm:tag "five" . ?s dm:tag "six" .}
}

使用USING就不要使用WITH

Deleting and Replacing Triples in Named Graphs

1	DELETE DATA { GRAPH d:g2{ d:x dm:tag "six" }}

1
2
3

# GRAPH+图名或者变量名
DELETE { GRAPH ?g { ?s ?p "three" } } 
WHERE { GRAPH ?g { ?s ?p "three" } }

1
2
3

WITH d:g1 
DELETE { ?s ?p "four"}
WHERE { ?s ?p "four"}

Building Applications With SPARQL

将查询发送到端点的最常见方式是将查询的转义版本作为参数添加到端点的URI
D2RQ
SPARQLWrapper for python
ARQ source code for java

Glossary

blank node: A subject or object in an RDF graph that has no identity. These are typically used to group together other values
default graph: The triples in an RDF dataset that don’t belong to a named graph
IRI: Internationalized Resource Identifier: a URI that allows a wider choice of characters,making it “internationalized.”
literal: A value, as opposed to a URI, which is a name for something. A literal may have a datatype or a spoken language tag associated with it, but not both. A simple literal is a literal with no language tag or datatype
N3: A non-XML RDF serialization format developed by Tim Berners-Lee. Turtle is a simplified version of N3
N-Triples: A very simple RDF serialization format that shows complete URIs with no abbreviation and a triple on each line. Often used as a graph dump format
named graph: A set of triples, typically within a larger collection of them, that can be referenced with a particular name. The name is a URI
RDF/XML: RDF’s original serialization format, based on XML
RDFS: the RDF Schema (RDFS) specification lets you specify classes, properties, and metadata about those classes and properties. These serve as metadata to let you infer new facts about your data, not as validation rules to indicate correct versus incorrect data
triplestore: A specialized database manager designed for storing triples
Turtle: An increasingly popular RDF serialization format based on N3
URI: “URI” is used more often to refer to an identifier, and “URL” to refer to a locator, or address.We use URIs to identify resources and property names in RDF