Adds a readme

Tpt · Jan 9, 2019 · 52edbb9de43091964ebc40b1f6c4314137d7c083 · 52edbb9
1 parent 586e165
commit 52edbb9de43091964ebc40b1f6c4314137d7c083
Unified Split

Showing with 30 additions and 0 deletions.

+24 −0 README.md

+6 −0 download_wd_history.sh
diff --git a/README.md b/README.md
@@ -0,0 +1,24 @@
+  
+SPARQL endpoint for Wikidata history
+  
+====================================
+  
+
+  
+This repository provides a SPARQL endpoint for Wikidata history, allowing to do queries like "count the number of humans in Wikidata in 2015" or "how many contributors have added values for the sex or gender property".
+  
+
+  
+Warning: This is a work in progress and is not ready yet.
+  
+
+  
+
+  
+## User documentation
+  
+
+  
+A public endpoint should be available soon. Here are some example of queries:
+  
+
+  
+
+  
+
+  
+
+  
+## Developer documentation
+  
+
+  
+To setup a working endpoint do:
+  
+
+  
+* Compile the Java program `mvn package`
+  
+* Download the Wikidata history dumps to a directory `mkdir dumps && cd dumps && bash ../download_wd_history.sh`. Warning: it requies around 600GB of disk.
+  
+* Preprocess the dump to get all revision metadata and triples annotated with there insertions and deletions (takes a few days and all your CPU cores): `java -server -jar target/sparql-endpoint-0.1-SNAPSHOT.jar -preprocess`
+  
+* Build database indexes: `java -server -jar target/sparql-endpoint-0.1-SNAPSHOT.jar -load`. You may use the `--wdt-only` argument to only load wdt: triples
+  
+* Start the web server `java -server -classpath target/sparql-endpoint-0.1-SNAPSHOT.jar org.wikidata.history.web.Main`
diff --git a/download_wd_history.sh b/download_wd_history.sh
@@ -0,0 +1,6 @@
+  
+#!/usr/bin/env bash
+  
+
+  
+curl https://dumps.wikimedia.org/wikidatawiki/latest/ | grep -Po "wikidatawiki/[0-9]+/wikidatawiki-[0-9]+-pages-meta-history[0-9]+\.xml-[p0-9]+\.bz2" | while read -r url ; do
+  
+echo $url
+  
+wget -c "https://dumps.wikimedia.org/$url"
+  
+done