Making a simple Datalog Engine in Clojure

Making a simple datalog engine in Clojure.

I started reading about Datalog and watched this conference talk Domain Modeling With Datalog by Norbert Wojtowicz. The talk models Github using datalog. It starts out simple, but by the end models a significant portion of Github’s core features.

I was only familiar with SQL like databases, with data being conformed to a schema. This idea of modelling data using entity, attribute, value triples was very new to me. What I felt was more interesting was how it actually figured out the result. For e.g. the relations specified in the query can be in any order, can include variables, refer to each other etc. I wanted to implement this query engine.

The goal for me was to implement some of the examples presented in the talk. It’s not optimized, doesn’t have much features and probably has some bugs, but I had a lot of fun building it. I did have some classes on Prolog and logic programming in college, but didn’t pay that much attention. I only ever did basic things like, mary’s father is bob, bob’s father is dan, then is dan an ancestor of mary ? I didn’t realize it could be used as a query language for a database.

This post mentions the examples given in the talk, and how I implemented them.

Source Code: kaepr/dedukt

References #

Throughout this project, I took references from the below articles/talks/book.

Triples #

The data for this system is described by using only 3 things,

Entity, Attribute, Value

The datastructure for the database is a list of the above triples. Anytime some new information needs to be added, a new triple is added to this database. It’s append only.

For my usecase, I will just store it in memory, as a list of list.


(def db
  [[11 :attr/name "some-value"]
   [22 :attr/name "some-value-2"]
   ;; more entries
   [33 :attr/name "other-value-3"]])

Above is an example of such a database, with some random entities, attribute and values. Each row is called a datum, which is basically a fact we have in our the system. The database is a list of facts about the data.

Each row, in the entity column has unique ids. Different numbers means they are different things.

Below is the example from the talk, which is much more clear. It models user’s in Github.

(def db 
  [[11 :user/name "richhickey"]
   [22 :user/name "tonsky"]
   [33 :user/name "pithyless"]])

Each entity is a separate user ( based on the id ). Entity has some attribute attached to it, which is the keyword. And that attribute has a value to it. The current example only shows strings, but it can be anything.

Query #

The query looks similar to SQL queries.

;; syntax
[:find <var-name> <var-name>
 :where [e a v]
         ...
        [e a v]]

;; example
[:find ?name
 :where 
 [11 :user/name ?name]]

There’s a find clause, which finds ?name. This ?name is a user defined variable. The question mark in front of the variable name is sort of a standard practice. This is the unknown value, for which we are trying to find the values for. where clause expects a list of triples. Notice in this triple, ?name is used inside the where clause.

So this query will return all the values in all the triples, for which 11 is the entity, and :user/name is the attribute. So all usernames for a user with id 11.

The output is

#{["richhickey"]}

All outputs are returned as a set of tuples.

For now, let’s just implement this query, and use the db defined above.

There’s a small helper function to convert the query in vector form to map.

(defn parse-query
  "Returns query in map form.

  Each clause becomes a key inside map."
  [q]
  (let [find-index (.indexOf q :find)
        where-index (.indexOf q :where)
        vars (subvec q (inc find-index) where-index)
        wheres (subvec q (inc where-index))]
    {:find vars
     :where wheres}))

The naive solution ( just enough to make the example work ) is

(defn match
  "Returns the first matching fact and where clause."
  [[e a v :as _fact] where-clauses]
  (first
   (filter
    (fn [[we wa]]
      (and (= e we)
           (= a wa)))
    where-clauses)))

(defn run [db q]
  (let [{:keys [find where]} (parse-query q)
        rf (fn [acc fact]
             (if-let [matched (match fact where)]
               (conj acc [(nth fact 2)])
               acc))]
    (reduce rf #{} db)))

(run db q)
;; #{["richhickey"]}

It simply takes in the list of facts and where clauses, checks whether the entity and attribute match. If yes, then the value is a possible answer. This will work for different values of entity or attribute.

Let’s see the second query from the talk.


[:find ?id
 :where [?id :user/name "tonsky"]]

Now the query will return all the id’s with the given attribute and value. It’s definitely possible to modify the above code to handle this case. But even with just two queries it’s possible to see that is not the correct way forward. The implementation will become way too complex if I continued implementing in such a way.

The above solution made a lot of assumptions. For e.g. it assumed that value is the only thing we might query, what if there are multiples entities and attributes with same values ? What about multiple where clauses ?

Below are some more example queries.

[:find ?id ?name
 :where [?id _ "tonsky"]] ;; find all id? and ?name for any attributes, but value "tonsky"
 
[:find ?attr
 :where [_ ?attr _]] ;; find all attributes 

There are many more such variations, and it’s not possible to implement this by hand. The query runner must somehow figure out what to query, what to filter on, the constraints etc on it’s own.

Datalog #

Here is datalog’s wikipedia definition

Datalog is a declarative logic programming language. While it is syntactically a subset of Prolog, Datalog generally uses a bottom-up rather than top-down evaluation model. This difference yields significantly different behavior and properties from Prolog. It is often used as a query language for deductive databases.

Example


# datalog program consists of facts
edge(a, b); # there'a edge from a->b
edge(b, c); # edge from b->c
edge(d, e); # edge from d->e

# and also rules, which deduce new facts
path(x, y) :- edge(x, y)
path(x, y) :- edge(x, z), path(z, y)

The above example has a list of facts, e.g. edges in a graph. It also has rules, which define a path. A path is either a edge between two nodes, or through intermediary nodes.

Based on this information, a datalog query engine can figure out the path.

For e.g. if we ask the datalog system

# is there is path, using the facts and rules defined above ?
?- path(a, c)

The result will be true, as it figured out that there does exist a path from a to c based on the given facts. It will return false for path(a, d)

Datalog will also return the possible answers. E.g.

# find all possible values for Y, which you can figure out a path ( from a )
?- path(a, Y)

The Y above is a variable. Now the datalog engine returns all possible values of Y such that a path to a exists. In this case, it would be b, c.

Compare this to querying a database, and we can see the similarities.

For the query

[:find ?id
 :where [?id :user/name "tonsky"]]

the datalog engine will have to figure out the relations between the data. Then it will do something like

?- users(X, :user/name, "tonsky")

So it will return all possible values of X, such that the attribute was :user/name and name is "tonsky".

But how to make this engine, and how to represent the facts which this engine can understand.

How does the triple store translate to facts above?

The triple store defined in db above gets transformed as such.

[e a v] -> a(e, v)

The attributes inside the triple store becomes the predicates.

So the database of triples becomes something like

# triple store translated to datalog facts
user_name(11, "richhickey").
user_name(22, "tonsky").
user_name(33, "pithyless").
user_email(11, "[email protected]").

# query
user_name(UserId, "richhickey")?
user_email(UserId, Email)?

Above example has facts of user names and email and a query. Notice the UserId common in both queries. Datalog engine figures out this common variable and returns the email for the user who is named “richhickey”.

There’s a browser based datalog engine at Online Datalog Engine.

Query engine #

Let’s start with simple queries.

(def db 
  [[11 :user/name "richhickey"]
   [22 :user/name "tonsky"]
   [33 :user/name "pithyless"]])

[:find ?attr
 :where [22 ?attr "tonsky"]] ;; #{[:user/name]}

[:find ?id
 :where [?id :user/name "tonsky"]] ;; #{[22]}

[:find ?id ?name 
 :where [?id :user/name ?name]] ;; all entries

All of these are similar, just with different variables. There is only a single entry in the where clause. Let’s call the clause inside where a pattern. The pattern, which is itself a triple has either constants or variables. Variables are Clojure symbols, starting with ?. Constants are any valid Clojure value. This pattern will be checked against the database, which is a list of facts.

The pattern is checked against every fact. This can be handled generically by using pattern matching. This is unlike the first solution, where I just check for the values.

(defn match
  "Returns the fact, if it succesfully matches to the pattern.
   Otherwise nil."
  [pattern fact]
  (let [term-match? (fn [p f] (or (= '_ p) (variable? p) (= p f)))
        matched? (reduce (fn [acc [p f]]
                           (and acc (term-match? p f)))
                         true
                         (map vector pattern fact))]
    (when matched? fact)))

(match ['?id '_ "tonsky"] [22 :user/name "tonsky"])
;; [22 :user/name "tonsky"]

Above function tries to match a pattern against a fact. It simply checks for each item in the triple. If it’s a variable, then it’s matches. If it’s a constant, then it’s checked by value. There’s a special match all case, using the _ symbol in the pattern.

This match is run against all the facts in the database. If a value is found, great, otherwise there are no solutions.

This works, but will return the entire triple. We only want the values for symbols specified in the find clause.

This is done by the select function.

(defn select [find-vars pattern fact]
  (let [pattern-to-fact (map vector pattern fact)
        f (fn [fv]
            (some (fn [p-f]
                    (when (= (first p-f) fv) (second p-f))) pattern-to-fact))]
    (mapv f find-vars)))

It uses the pattern and the provided fact, and the list of variable names. Then merges to return the values of the variables.

Finally it’s the run function. Lets run’s this on the database and queries above.

(defn run [db q]
  (let [{:keys [find where]} (parse-query q)
        fact (some #(match (first where) %) db)
        selected (select find (first where) fact)]
    (set [selected])))

(def db 
  [... same as above ... ])

(run db [:find '?attr
         :where [22 '?attr "tonsky"]])
;; => #{[:user/name]} correct !

(run db [:find '?id
         :where ['?id :user/name "tonsky"]])
;; => #{[22]} correct !

(run db [:find '?id '?name
         :where ['?id :user/name '?name]])
;; => #{[11 "richhickey"]} wrong :(

The first two examples are correct. By using pattern matching, it was possible to offload the logic for figuring out which variable and it’s value we are trying to find.

The third is also technically correct, but we are only returning the first matching result. Ideally it should return all values. It’s a simple change, just return all matching facts for now.

Below is the modified run.

(defn run [db q]
  (let [{:keys [find where]} (parse-query q)
        facts (filter #(match (first where) %) db)
        values (mapv #(select find (first where) %) facts)]
    (set values)))

(run db [:find '?id
         :where ['?id :user/name '_]])
;; => #{[11] [22] [33]}

(run db [:find '?id '?name
         :where ['?id :user/name '?name]])
;; => #{[11 "richhickey"] [22 "tonsky"] [33 "pithyless"]}

So at this point, we are able to support multiple variables. Multiple facts in the database. A catch all variable _. And the result returns all possible values which satisfy the single where predicate we provide.

Multiple where predicates #

Let’s complicate the domain model and the possible queries. Each user will now have an email. We will update the database definition, and see an example query.

(def db
  [[11 :user/name "richhickey"]
   [22 :user/name "tonsky"]
   [33 :user/name "pithyless"]
   [11 :user/email "[email protected]"]
   [22 :user/email "[email protected]"]
   [33 :user/email "[email protected]"]])

[:find ?email
 :where [?id :user/name "richhickey"]
        [?id :user/email ?email]] ;; another where predicate 

Each user, identified by their numeric id, now has two attributes. name and now an email. The query includes another predicate. The interesting part is the ?id variable, which is used in both predicates.

The query is asking, find me the email of some user. This user has the name “richhickey”. The query engine will figure out, that user with id 11 has name as “richhickey”. Now using this updated information, it will find the triple, where entity id is 11 and the attribute is email, and will returns “[email protected]”.

Datalog queries also have no concept of order. Even if the order of where predicates is switched, the query engine will still figure it out.

There could be multiple ways of solving this. I will keep things simple and sort of brute force the solution.

For each pattern (where predicate), let’s find all possible answers. Instead of simply returning the result, it will be returned with the variable symbol used in the pattern. This will be used later to join the possible values together.

(defn find-variable-facts
  "Returns subset of fact, which are assigned as variables in the pattern."
  [pattern fact]
  (keep (fn [[p f]] (when (variable? p) f))
        (map vector pattern fact)))

(defn variable->fact
  "Return a map from variable to fact, with a given pattern and fact."
  [pattern fact]
  (zipmap (filter variable? pattern)
          (find-variable-facts pattern fact)))

(variable->fact
   '[?id :user/name ?name]
   '[11 :user/name "richhickey"])
;; => {?id 11, ?name "richhickey"}

(defn all-matches
  "Returns variable->fact mapping from all facts, with a given pattern."
  [pattern facts]
  (map #(variable->fact pattern %)
       (filter #(match pattern %) facts))) ;; reuses the `match` defined before

(all-matches
  '[?id :user/email ?email]
  db) ;; db defined before
;;=>({?id 11, ?email "[email protected]"}
;    {?id 22, ?email "[email protected]"}
;    {?id 33, ?email "[email protected]"})

So we have for each pattern ( variables ), all the facts which satisfy it. Till now it’s similar to the old solution ( apart from having a variable symbol map). As we have multiple patterns now, these list of variable bindings must be merged.

(defn merge-variable->facts
  [variable->facts-1 variable->facts-2]
  (for [vf1 variable->facts-1
        vf2 variable->facts-2
        :when (every? (fn [k] (= (get vf1 k)
                                 (get vf2 k)))
                      (set/intersection (set (keys vf1))
                                        (set (keys vf2))))]
    (merge vf1 vf2)))

(comment

  (merge-variable->facts
   '[{?id 11}]
   '[{?id 11 ?email "[email protected]"}
     {?id 22 ?email "[email protected]"}])
;; => ({?id 11, ?email "[email protected]"})
  ())

This is the core logic, and what I meant sort of by brute force. Given two patterns, we generate solutions for those two patterns. These solutions are just list of the variable bindings, generated using all-matches above.

Now to join these two lists of variable bindings, we make sure to keep only the valid combinations. E.g. if two patterns share a variable, then we keep only the variable bindings, where the values are common between them.

First is generating all solutions, by taking the cartesian product. This combines every variable binding map from the first result, with all the bindings from the other result. This is what I meant by brute force.

Now we filter these combinations. If there’s a common variable in both results, then it’s checked by value. Only keep the combination if the values match. If a variable exists in only one of the variable binding map, keep it. It might be used by the subsequent patterns .

Let’s combine all the above functions inside the process-patterns function.

(defn process-patterns [patterns facts]
  (let [initial-matches (all-matches (first patterns) facts)
        result (reduce (fn [acc pattern]
                         (merge-variable->facts acc
                                                (all-matches pattern facts)))
                       initial-matches
                       (rest patterns))]
    result))

(process-patterns
  '[[?id :user/name "richhickey"]
    [?id :user/email ?email]]
  db)
;; ({?id 11, ?email "[email protected]"})

And with the modified select function, we update the run function.

(defn select
  [variables result]
  (->> result
       (mapv #(vec (vals (select-keys % variables))))
       set))

(defn run [db q]
  (let [{:keys [find where]} (parse-query q)
        result (process-patterns where db)]
    (->> result
         (select find))))

We can now run queries which includes multiple variables across multiple clauses.

(def db
  [[11 :user/name "richhickey"]
   [22 :user/name "tonsky"]
   [33 :user/name "pithyless"]
   [11 :user/email "[email protected]"]
   [22 :user/email "[email protected]"]
   [33 :user/email "[email protected]"]
   [44 :org/name "clojure"]
   [55 :repo/slug "clojure/clojure"]
   [55 :repo/owner 44]
   [66 :repo/slug "tonsky/datascript"]
   [66 :repo/owner 22]])

(run db
     '[:find ?email
       :where
       [?id :user/email ?email]
       [?id :user/name "richhickey"]])
;; #{["[email protected]"]}

(run db
     '[:find ?user ?email
       :where
       [?id :user/email ?email]
       [?id :user/name ?user]])
;;#{["pithyless" "[email protected]"]
;   ["tonsky" "[email protected]"]
;   ["richhickey" "[email protected]"]}

(run db
     '[:find ?repo
       :where
       [?p :user/name "tonsky"]
       [?r :repo/owner ?p]
       [?r :repo/slug ?repo]])
;; #{["tonsky/datascript"]}

Limitations #

The current implementation is very basic. It generates all possible options for each clause, and filters the valid ones. It works for simple data, but it does not support the major feature of datalog engine which is rules, and recursively querying data.

For example, querying all repositories

[:find ?name ?repo
 :where 
   (or [?p :org/name ?name]
       [?p :user/name ?name])
   [?r :repo/owner ?p]
   [?r :repo/slug ?repo]]

This will return names of the owner and the repository. But owners, can either be user, or organizations. Hence the or. This kind of query is not supported.

The other major feature is recursively querying.

For e.g. let’s say we define a relation between repositories using fork.

["repo-2" :fork "repo-1"] ;; repo 2 is a fork of repo 1
["repo-3" :fork "repo-2"] ;; repo 3 is a fork of repo 2
["repo-4" :fork "repo-3"] ;; repo 4 is a fork of repo 3

Now we can have a query like finding all parents of “repo-4”. It will only return “repo-3”, but it should return all of the repos. (I don’t think forks have this behavior in Github, but you get the point)

This requires the query engine to use rules ( i.e. fork ), and query all the possible solutions.

I am yet to implement these features. Studying more about logic programming for now, and will try to implement the rest.

There’s also Learn Datalog Today. It’s a tutorial to using Datalog. It’s broken down into chapters, adding more features in each chapter. Quite similar to the conference talk I mentioned in the beginning. I might use this website and it’s challenges next time when trying to implement more features.