MITB Banner

Complete Guide to PyKeen: Python KnowlEdge EmbeddiNgs for Knowledge Graphs

Pykeen is a python package that generates knowledge graph embeddings while abstracting away the training loop and evaluation. The knowledge graph embeddings obtained using pykeen are reproducible, and they convey precise semantics in the knowledge graph.

Share

Pykeen is a python package that generates knowledge graph embeddings while abstracting away the training loop and evaluation. The knowledge graph embeddings obtained using pykeen are reproducible, and they convey precise semantics in the knowledge graph.

Knowledge Graphs

The knowledge graph is a graph data structure that captures multimodal and multilateral information in terms of relationships between concepts. The concepts are represented as entities, and relationships between concepts are represented as edges of the knowledge graph.

 Example of a Knowledge Graph

This graph can be used for various tasks like search and retrieval of information.We can also predict new relations between two concepts making knowledge graphs an excellent choice for augmenting sparse data for ML and DL algorithms.

Knowledge Graph Embeddings

KGEs are vector space representations of entities and relationships in a knowledge graph.

These embeddings are obtained from a model called KGE model. These models essentially try to preserve the pairwise distance between entities, commensurate with their relation. Following is a list of such models available as a part of the Pykeen Package.

ComplExpykeen.models.ComplExTrouillon et al., 2016
ComplExLiteralpykeen.models.ComplExLiteralKristiadi et al., 2018
ConvEpykeen.models.ConvEDettmers et al., 2018
ConvKBpykeen.models.ConvKBNguyen et al., 2018
DistMultpykeen.models.DistMultYang et al., 2014
DistMultLiteralpykeen.models.DistMultLiteralKristiadi et al., 2018
ERMLPpykeen.models.ERMLPDong et al., 2014
ERMLPEpykeen.models.ERMLPESharifzadeh et al., 2019
HolEpykeen.models.HolENickel et al., 2016
KG2Epykeen.models.KG2EHe et al., 2015
MuREpykeen.models.MuREBalažević et al., 2019
NTNpykeen.models.NTNSocher et al., 2013
PairREpykeen.models.PairREChao et al., 2020
ProjEpykeen.models.ProjEShi et al., 2017
RESCALpykeen.models.RESCALNickel et al., 2011
RGCNpykeen.models.RGCNSchlichtkrull et al., 2018
RotatEpykeen.models.RotatESun et al., 2019
SimplEpykeen.models.SimplEKazemi et al., 2018
StructuredEmbeddingpykeen.models.StructuredEmbeddingBordes et al., 2011
TransDpykeen.models.TransDJi et al., 2015
TransEpykeen.models.TransEBordes et al., 2013
TransHpykeen.models.TransHWang et al., 2014
TransRpykeen.models.TransRLin et al., 2015
TuckERpykeen.models.TuckERBalažević et al., 2019
UnstructuredModelpykeen.models.UnstructuredModelBordes et al., 2014

Let’s look closely at TransE, one of the basic and famous models for generating KGEs.

Translation Based Embedding

This model generates vectors for relations and entities in the same vector space. Following is the pseudocode for the algorithm behind this model.

Psuedocode of TransE Learning Algorithm.

The distance mentioned in the algorithm is the Frobenius norm between the arguments.

  • Here h is the head or source entity of a relationship in the knowledge graph.
  • Here l is the link or relation between entities of a relationship in the knowledge graph.
  • Here t is the tail or destination entity of a relationship in the knowledge graph.

This plot shows the structure of embeddings obtained using the TransE model.

Now let’s see how to use pykeen to extract these embeddings.

PyKeen Installation

Installation of pykeen is quite simple. You can just do a pip install.

! pip install pykeen==1.0.5

This package runs on top of PyTorch, so install PyTorch as well.

Data

Pykeen provides lots of Open Source datasets as classes for seamless integration with the rest of the module.Let’s check out the OpenBioLink Knowledge graph in this article.

 from pykeen.datasets import OpenBioLink
 dataset = OpenBioLink()
 dataset.training.triples 

Each triple contains  (head, link, tail) in the same order. The first tuple in the image is a gene-phenotype relation. A phenotype is an observable trait of humans like colour of the eyes, hair, skin etc. The first tuple is the gene NCBIGENE:11200 which is responsible for the phenotype HP:0009919(retina tumor). Use this link to check out the genes’ meanings, phenotypes, and anatomies of these tuples’ identifiers.

Model, Optimizer and Training Approach

Next, we need to pick an embedding model to extract embeddings from the OpenBioLink Knowledge graph. Following is the code to load TransE model in pykeen:

 # Pick a model
 from pykeen.models import TransE
 model = TransE(triples_factory=training_triples_factory) 

We can choose optimizers from torch to train the model.

 # Pick an optimizer from Torch
 from torch.optim import Adam
 optimizer = Adam(params=model.get_grad_params()) 

We need to select a training approach to use to train the model 

 # Pick a training approach (sLCWA or LCWA)
 from pykeen.training import SLCWATrainingLoop
 training_loop = SLCWATrainingLoop(model=model, optimizer=optimizer) 

Whenever we have a knowledge graph we need to make certain assumptions to draw inferences from it. Closed World Assumption is one such assumption. It assumes that if a link is not present between two entities, then that link is false or the probability of a relationship between these entities is always zero. We can immediately see problems with this assumption. Once we assume this, we can’t predict any new links in the knowledge graph. Collecting Real-world Data is a challenging task and lots of relationships are not captured in the knowledge graph. This assumption turns all the missing data into false values.

Local Closed World Assumption(LCWA) solves this problem by specifying a predicate over areas that says whether the area of the knowledge graph is complete or not. Stochastic Local Closed World Assumption(sLCWA) is a stochastic version of the LCWA.

Training and Evaluation

We are all set to train the model now. Following command trains the model.

training_loop.train(num_epochs=5, batch_size=256)

Following is the code  to evaluate the trained model using a test set.

 # Pick an evaluator
 from pykeen.evaluation import RankBasedEvaluator
 evaluator = RankBasedEvaluator()
 # Get triples to test
 mapped_triples = dataset.testing.mapped_triples
 # Evaluate
 results = evaluator.evaluate(model, mapped_triples, batch_size=1024)
 print(results) 
Cropped Evaluation Results

To Evaluate the embeddings model we use a test set of triples.

Each of these triples is assigned a score by the model based on the plausibility of the triple.

We want the test set’s triples to be highly plausible so we expect higher score for these triples.But the score given by the model is not bounded and we cannot decide what a good value of this score is.We need to rank all the triples according to this score’s descending order and use the ranks to evaluate the model.

Pipeline

PyKeen provides a high-level entry point to access the models. It is called a pipeline. We should provide all the information about the model to the pipeline, and the pipeline takes care of everything required for training.

Example:

 from pykeen.pipeline import pipeline
 pipeline_result = pipeline(
     dataset='Nations',
     model='TransE',
     evaluator='RankBasedEvaluator',
     training_loop='sLCWA',
     negative_sampler='basic',
     model_kwargs=dict(
         scoring_fct_norm=2,
     ),
 )
 pipeline_result.save_to_directory('nations_transe') 

Hyper Parameter Optimization

PyKeen provides a hyper parameter optimization pipeline function pykeen.hpo.hpo_pipeline().It uses optuna in the backend and does optimization.Following is a code snippet that shows how to optimize the hyperparameters.

 from pykeen.hpo import hpo_pipeline
 hpo_pipeline_result = hpo_pipeline(
    n_trials=30,
    dataset='Nations',
    model='TransE',
    loss='MarginRankingLoss',
    model_kwargs_ranges=dict(
         embedding_dim=dict(type=int, low=100, high=500, q=100),
     ),
    loss_kwargs_ranges=dict(
        margin=dict(type=float, low=1.0, high=2.0),
    ),
 ) 

Hpo_pipeline works more or less like the simple pipeline; it does a grid search or random search or a similar search over the specified model parameter range and returns a hpo_pipeline_result.

 Ranges for hyperparameters can be provided using model_kwars_ranges argument.

There are many other types of hyperparameters offered by hpo_pipeline, which can be used to optimize the model fit.

Saving and Restoring Model

PyKeen Models are torch models with utility functions on the top. We can use the torch’s functionality to save and reload a model.

 import torch
 torch.save(model,'trained_model.pkl')
 my_pykeen_model = torch.load('trained_model.pkl') 

We can also save the model checkpoints during training to restore the training process if training fails due to a crash.This functionality can be added using the training_kwargs argument

 training_kwargs=dict(
         num_epochs=2000,
         checkpoint_name='my_checkpoint.pt',
         checkpoint_directory='doctests/checkpoint_dir',
         checkpoint_frequency=5,
     ) 

To start the training process from a checkpoint, we simply need to use the same checkpoint name in the code. 

Results

We have taken a knowledge graph and converted all the entities and relations into embeddings. Let’s see some of the interesting information we can extract from these embeddings.

What are the possible phenotypes observed due to the presence of the gene NCBIGENE:534? 

 predicted_tails_df = model.predict_tails('NCBIGENE:534', 'GENE_PHENOTYPE')
 predicted_tails_df 

The top prediction of phenotype for the gene is HP:0001337 which corresponds to Tremors. This information was not present in the original knowledge graph. We inferred it from the phenotypes of all closely related(only information present in KG is used to decide close genes) genes.

We can even ask other questions like What is the head given relation and tail ? or What are some most plausible triples?

References

Share
Picture of Pavan Kandru

Pavan Kandru

AI enthusiast with a flair for NLP. I love playing with exotic data.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.