GitHub - AColonnaDistria/rdf-sparql-engine: RDF graph engine (SparQL queries) in Java, benchmark of the engines

Star Query RDF Engine (HAI914I)

A high-performance Java-based RDF engine designed to evaluate Star Queries using the Hexastore indexing approach. This project was developed as part of the HAI914I module (NoSQL) at the University of Montpellier.

Authors

Each file has an author specified.

Antoine Colonna d'Istria antoine.colonnadistria@yahoo.com for the main RDF engine
Federico Ulliana federico.ulliana@inria.fr & Guillaume Pérution-Khili for the skeleton
Mehdi Bahktar & Théo Foutel-Rodier for the concurrent engine

Supervisor

Federico Ulliana federico.ulliana@inria.fr

Overview

The primary objective of this project is to implement an efficient system for querying RDF data. Unlike traditional relational databases, this engine uses a specialized Hexastore index—which maintains six separate indices for all permutations of Subject (S), Predicate (P), and Object (O) in order to provide O(1) or O(log N) lookups for RDF triple patterns.

Requirements

Java 21
Maven
RDF4J parser
Oracle: InteGraal 1.7.2

Build & Run

Clone the repository

git clone https://github.com/acolonnadistria/rdf-sparql-engine.git
cd rdf-sparql-engine

Build the project

mvn clean install

Create the dataset

mvn exec:java -Dexec.mainClass="qengine.watdiv.DatasetPreparer"

Run the performance test

mvn exec:java -Dexec.mainClass="qengine.benchmark.PerformanceTests"

Run the statistics analysis

cp data/output/statistics/*.json analysis
cd analysis && python3 analysis.py

Key Features

Dictionary Encoding: Maps long RDF URIs/Literals to compact integer IDs to reduce memory footprint and speed up comparisons.
Hexastore Indexing: Implementation of the six core indices (SPO, SOP, PSO, POS, OSP, OPS).
Giant-Table Implementation: A baseline Row-Store storage for performance benchmarking.
Star Query Evaluation: Optimized evaluation of SPARQL-style star queries (multiple triple patterns sharing a single variable).
Oracle Verification: Integration with the InteGraal library to verify result correctness and completeness.
Comparaison to Concurrent RDF Hexastores
Unit tests: each RDF engine is tested by a series of JUnit tests

Project structure

qengine.store : Core logic for the Dictionary and Hexastore index structures.
qengine.model : Internal representations of RDF Triples and Star Queries.
qengine.parser : Logic to parse .nt (N-Triples) and .queryset files.
qengine.program : Main entry points and usage examples.
qengine.statistics : Components for tracking query selectivity and engine performance.

Results

The engine was evaluated using the WatDiv (Waterloo SPARQL Diversity Test Suite) benchmark. The tests were conducted on a Framework Laptop 13 (AMD Ryzen 7 7840U, 32 GiB RAM) running Ubuntu 24.04.3 LTS.

Dataset Distribution (500k triplets & 2M triplets) / Data preparation

If we just run WatDiv without any smapling.

After removing duplicates and sampling it by number of answers. (Queries with 100+ answers will be unreliable as the dataset is much smaller).

Evaluating performance (500K triplets)

The primary performance benchmark compared the specialized Hexastore index against a baseline Giant Table implementation using 500k RDF triplets.

Query Selectivity (Bin)	Hexastore Avg Response (ms)	Giant Table Avg Response (ms)
0–1 answers (High)	1.373 ± 0.918	68.387 ± 12.798
1–20 answers (Medium)	0.063 ± 0.078	36.621 ± 2.714
20–100 answers (Low)	0.052 ± 0.004	35.881 ± 2.953
100+ answers (Very Low)	0.285 ± 0.346	22.800 ± 23.198

Key Finding: The Hexastore is significantly faster than the Giant Table, with speed improvements ranging from 80x to over 600x.

Evaluating performance (2M triplets)

Giant Table was not tested as it was too slow.

Bin	Hexastore (ms)
0–1	8.099 ± 2.778
1–20	2.503 ± 3.117
20–100	0.086 ± 0.039
100+	1.482 ± 1.780

Loading times

While the Hexastore provides superior query speeds, it requires building six distinct B-tree indices. However, in our benchmarks, the Hexastore maintained a much lower loading time compared to the Giant Table. The Giant Table loading exceeded 140 seconds for the test dataset, while the Hexastore completed in only a few seconds

Comparison to the concurrent engine (Théo Foutel-Rodier & Mehdi Bakhtar)

Factor Analysis (Two-Squared Experiment)

We performed a regression analysis to identify the main factors affecting system performance. The number of triplets is the most important factor (≈ 0.616), with JVM heap size (4 GB vs. 16 GB) also having a strong impact (≈ 0.355).

Correctness & Completeness In all scenarios, correctness and completeness were achieved and are at 100%. Verified against the InteGraal 1.7.2 library (Oracle) to ensure 100% Correctness and Completeness.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.metadata		.metadata
.settings		.settings
analysis		analysis
readme-images		readme-images
src		src
watdiv-mini-projet-partie-2		watdiv-mini-projet-partie-2
.classpath		.classpath
.gitattributes		.gitattributes
.gitignore		.gitignore
.project		.project
README.md		README.md
README.old.md		README.old.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Star Query RDF Engine (HAI914I)

Authors

Supervisor

Overview

Requirements

Build & Run

Key Features

Project structure

Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Star Query RDF Engine (HAI914I)

Authors

Supervisor

Overview

Requirements

Build & Run

Key Features

Project structure

Results

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages