Last updated March 18, 2024
In AI Mysteries

Complete Guide to PyKeen: Python KnowlEdge EmbeddiNgs for Knowledge Graphs

Pykeen is a python package that generates knowledge graph embeddings while abstracting away the training loop and evaluation. The knowledge graph embeddings obtained using pykeen are reproducible, and they convey precise semantics in the knowledge graph.

Share

Published on March 21, 2021

by Pavan Kandru

Knowledge Graphs

The knowledge graph is a graph data structure that captures multimodal and multilateral information in terms of relationships between concepts. The concepts are represented as entities, and relationships between concepts are represented as edges of the knowledge graph.

This graph can be used for various tasks like search and retrieval of information.We can also predict new relations between two concepts making knowledge graphs an excellent choice for augmenting sparse data for ML and DL algorithms.

Knowledge Graph Embeddings

KGEs are vector space representations of entities and relationships in a knowledge graph.

These embeddings are obtained from a model called KGE model. These models essentially try to preserve the pairwise distance between entities, commensurate with their relation. Following is a list of such models available as a part of the Pykeen Package.

ComplEx	pykeen.models.ComplEx	Trouillon et al., 2016
ComplExLiteral	pykeen.models.ComplExLiteral	Kristiadi et al., 2018
ConvE	pykeen.models.ConvE	Dettmers et al., 2018
ConvKB	pykeen.models.ConvKB	Nguyen et al., 2018
DistMult	pykeen.models.DistMult	Yang et al., 2014
DistMultLiteral	pykeen.models.DistMultLiteral	Kristiadi et al., 2018
ERMLP	pykeen.models.ERMLP	Dong et al., 2014
ERMLPE	pykeen.models.ERMLPE	Sharifzadeh et al., 2019
HolE	pykeen.models.HolE	Nickel et al., 2016
KG2E	pykeen.models.KG2E	He et al., 2015
MuRE	pykeen.models.MuRE	Balažević et al., 2019
NTN	pykeen.models.NTN	Socher et al., 2013
PairRE	pykeen.models.PairRE	Chao et al., 2020
ProjE	pykeen.models.ProjE	Shi et al., 2017
RESCAL	pykeen.models.RESCAL	Nickel et al., 2011
RGCN	pykeen.models.RGCN	Schlichtkrull et al., 2018
RotatE	pykeen.models.RotatE	Sun et al., 2019
SimplE	pykeen.models.SimplE	Kazemi et al., 2018
StructuredEmbedding	pykeen.models.StructuredEmbedding	Bordes et al., 2011
TransD	pykeen.models.TransD	Ji et al., 2015
TransE	pykeen.models.TransE	Bordes et al., 2013
TransH	pykeen.models.TransH	Wang et al., 2014
TransR	pykeen.models.TransR	Lin et al., 2015
TuckER	pykeen.models.TuckER	Balažević et al., 2019
UnstructuredModel	pykeen.models.UnstructuredModel	Bordes et al., 2014

Let’s look closely at TransE, one of the basic and famous models for generating KGEs.

Translation Based Embedding

This model generates vectors for relations and entities in the same vector space. Following is the pseudocode for the algorithm behind this model.

Psuedocode of TransE Learning Algorithm.

The distance mentioned in the algorithm is the Frobenius norm between the arguments.

Here h is the head or source entity of a relationship in the knowledge graph.
Here l is the link or relation between entities of a relationship in the knowledge graph.
Here t is the tail or destination entity of a relationship in the knowledge graph.

This plot shows the structure of embeddings obtained using the TransE model.

Now let’s see how to use pykeen to extract these embeddings.

PyKeen Installation

Installation of pykeen is quite simple. You can just do a pip install.

! pip install pykeen==1.0.5

This package runs on top of PyTorch, so install PyTorch as well.

Data

Pykeen provides lots of Open Source datasets as classes for seamless integration with the rest of the module.Let’s check out the OpenBioLink Knowledge graph in this article.

 from pykeen.datasets import OpenBioLink
 dataset = OpenBioLink()
 dataset.training.triples

Each triple contains (head, link, tail) in the same order. The first tuple in the image is a gene-phenotype relation. A phenotype is an observable trait of humans like colour of the eyes, hair, skin etc. The first tuple is the gene NCBIGENE:11200 which is responsible for the phenotype HP:0009919(retina tumor). Use this link to check out the genes’ meanings, phenotypes, and anatomies of these tuples’ identifiers.

Model, Optimizer and Training Approach

Next, we need to pick an embedding model to extract embeddings from the OpenBioLink Knowledge graph. Following is the code to load TransE model in pykeen:

 # Pick a model
 from pykeen.models import TransE
 model = TransE(triples_factory=training_triples_factory)

We can choose optimizers from torch to train the model.

 # Pick an optimizer from Torch
 from torch.optim import Adam
 optimizer = Adam(params=model.get_grad_params())

We need to select a training approach to use to train the model

 # Pick a training approach (sLCWA or LCWA)
 from pykeen.training import SLCWATrainingLoop
 training_loop = SLCWATrainingLoop(model=model, optimizer=optimizer)

Whenever we have a knowledge graph we need to make certain assumptions to draw inferences from it. Closed World Assumption is one such assumption. It assumes that if a link is not present between two entities, then that link is false or the probability of a relationship between these entities is always zero. We can immediately see problems with this assumption. Once we assume this, we can’t predict any new links in the knowledge graph. Collecting Real-world Data is a challenging task and lots of relationships are not captured in the knowledge graph. This assumption turns all the missing data into false values.

Local Closed World Assumption(LCWA) solves this problem by specifying a predicate over areas that says whether the area of the knowledge graph is complete or not. Stochastic Local Closed World Assumption(sLCWA) is a stochastic version of the LCWA.

Training and Evaluation

We are all set to train the model now. Following command trains the model.

training_loop.train(num_epochs=5, batch_size=256)

Following is the code to evaluate the trained model using a test set.

 # Pick an evaluator
 from pykeen.evaluation import RankBasedEvaluator
 evaluator = RankBasedEvaluator()
 # Get triples to test
 mapped_triples = dataset.testing.mapped_triples
 # Evaluate
 results = evaluator.evaluate(model, mapped_triples, batch_size=1024)
 print(results)

To Evaluate the embeddings model we use a test set of triples.

Each of these triples is assigned a score by the model based on the plausibility of the triple.

We want the test set’s triples to be highly plausible so we expect higher score for these triples.But the score given by the model is not bounded and we cannot decide what a good value of this score is.We need to rank all the triples according to this score’s descending order and use the ranks to evaluate the model.

Pipeline

PyKeen provides a high-level entry point to access the models. It is called a pipeline. We should provide all the information about the model to the pipeline, and the pipeline takes care of everything required for training.

Example:

 from pykeen.pipeline import pipeline
 pipeline_result = pipeline(
     dataset='Nations',
     model='TransE',
     evaluator='RankBasedEvaluator',
     training_loop='sLCWA',
     negative_sampler='basic',
     model_kwargs=dict(
         scoring_fct_norm=2,
     ),
 )
 pipeline_result.save_to_directory('nations_transe')

Hyper Parameter Optimization

PyKeen provides a hyper parameter optimization pipeline function pykeen.hpo.hpo_pipeline().It uses optuna in the backend and does optimization.Following is a code snippet that shows how to optimize the hyperparameters.

 from pykeen.hpo import hpo_pipeline
 hpo_pipeline_result = hpo_pipeline(
    n_trials=30,
    dataset='Nations',
    model='TransE',
    loss='MarginRankingLoss',
    model_kwargs_ranges=dict(
         embedding_dim=dict(type=int, low=100, high=500, q=100),
     ),
    loss_kwargs_ranges=dict(
        margin=dict(type=float, low=1.0, high=2.0),
    ),
 )

Hpo_pipeline works more or less like the simple pipeline; it does a grid search or random search or a similar search over the specified model parameter range and returns a hpo_pipeline_result.

Ranges for hyperparameters can be provided using model_kwars_ranges argument.

There are many other types of hyperparameters offered by hpo_pipeline, which can be used to optimize the model fit.

Saving and Restoring Model

PyKeen Models are torch models with utility functions on the top. We can use the torch’s functionality to save and reload a model.

 import torch
 torch.save(model,'trained_model.pkl')
 my_pykeen_model = torch.load('trained_model.pkl')

We can also save the model checkpoints during training to restore the training process if training fails due to a crash.This functionality can be added using the training_kwargs argument

 training_kwargs=dict(
         num_epochs=2000,
         checkpoint_name='my_checkpoint.pt',
         checkpoint_directory='doctests/checkpoint_dir',
         checkpoint_frequency=5,
     )

To start the training process from a checkpoint, we simply need to use the same checkpoint name in the code.

Results

We have taken a knowledge graph and converted all the entities and relations into embeddings. Let’s see some of the interesting information we can extract from these embeddings.

What are the possible phenotypes observed due to the presence of the gene NCBIGENE:534?

 predicted_tails_df = model.predict_tails('NCBIGENE:534', 'GENE_PHENOTYPE')
 predicted_tails_df

The top prediction of phenotype for the gene is HP:0001337 which corresponds to Tremors. This information was not present in the original knowledge graph. We inferred it from the phenotypes of all closely related(only information present in KG is used to decide close genes) genes.

We can even ask other questions like What is the head given relation and tail ? or What are some most plausible triples?