Hyperbolic embedding paper and code implementation -- 1. Introduction to data set

Hyperbolic embedding paper and code implementation

Due to hyperbolic embedding, related articles have a series of codes. The main purpose of this blog is to realize the initial hyperbolic embedding paper, deduce some directly written contents in the paper in detail, and realize the corresponding code at the same time.

Because some codes are difficult to run, some are difficult to read (the degree of encapsulation is very high), and even some codes are written with some problems. Therefore, according to the settings of the paper, we reproduce the corresponding method with Python, run it successfully, and display the drawing at the same time.

1. Objectives

We have some hierarchical network type data. How can we replace each word with a vector according to the upper and lower structure path of each word? In other words, we map the word into a vector in the real number field (word embedding). The simplest idea is to use one hot word vectors, which are easy to construct, but usually not a good choice. The main reason is that the one hot word vector can not accurately express the similarity between different words, and can not describe the hierarchical structure between words. Among other methods, the most popular is word 2vec in European space. This way of embedding can effectively express the similarity between words, but it is still difficult to depict the hierarchical structure between words.

At this time, in order to measure not only the similarity between words, but also the hierarchical structure between words, the idea of hyperbolic geometry is introduced and embedded in hyperbolic space. The ability of hyperbolic embedding to represent hierarchical structure is much higher than that of Euclidean space embedding, but it needs less dimension.

Python code dependency Library

In order to successfully run through the following code, the libraries that the code needs to rely on are shown here:

import nltk
# nltk.download('wordnet') # Run this command for the first time to install wordnet dataset
from nltk.corpus import wordnet as wn
from math import *
import random
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
import networkx as nx

2. Dataset

The training data set is implemented by using the data in wordnet. The relevant data description has been introduced in the document last week and will not be repeated here.

Because the whole wordnet data set is relatively large, in order to test the code, we only use mammal and its related branches for learning. First, let's look at what the dataset looks like. Because we only need hierarchy information, we only need to read and build the relationship between the child node and the parent node of each mammalian related noun in the dataset.

network = {} # Building hierarchical networks
last_level = 8 # The deepest layer is set to 8 layers
levelOfNode = {} # The hierarchical information of the data. 0 is the mammal (root node) and 1 is the next structure of the mammal

# Building network recursively
def get_hyponyms(synset, level):
    if (level == last_level):
        levelOfNode[str(synset)] = level
    if not str(synset) in network:
        network[str(synset)] = [str(s) for s in synset.hyponyms()]
        levelOfNode[str(synset)] = level
    for hyponym in synset.hyponyms():
        get_hyponyms(hyponym, level + 1)

# Construct a hierarchical data set with mammals as the root node
mammal = wn.synset('mammal.n.01')
get_hyponyms(mammal, 0)
levelOfNode[str(mammal)] = 0

# Add the terminal leaf node to the network dictionary
for a in levelOfNode:
    if not a in network:
        network[a] = []

Data display

After running the above code, you can get the corresponding node level and the overall network branches.

Node level (the numerical value represents the level of layers, the deepest layer is set to 6, and 0 is the root node)

Network branches

In order to describe the tree structure more clearly, a code is used to further display the relevant hierarchy directly.

def norm(x):
    return np.dot(x, x)

def traverse(graph, start, node):
    node_name = node.name().split(".")[0]
    graph.depth[node_name] = node.shortest_path_distance(start)

    for child in node.hyponyms():
        child_name = child.name().split(".")[0]
        graph.add_edge(node_name, child_name) # Add edge
        traverse(graph, start, child) # Recursive construction

def hyponym_graph(start):
    G = nx.Graph() # Define a graph
    G.depth = {}
    traverse(G, start, start)
    return G

def graph_draw(graph):
    plt.figure(figsize=(10, 10)) # Show the overall network
    # plt.figure(figsize=(3, 3)) # Show elephant network
         node_size = [10 * graph.degree(n) for n in graph],
         node_color = [graph.depth[n] for n in graph],
         alpha = 0.8,
         font_size = 4,
         width = 0.5,
         with_labels = True)
    def get_keys(d, value):
        return [k for k,v in d.items() if v == value]
    root_name = get_keys(graph.depth, 0)[0]
    plt.savefig("~/hyperE/fig/" + root_name + ".png", dpi = 300)
graph = hyponym_graph(mammal)

The structure of all mammal s drawn is as follows (at this time, there is no spatial information, only hierarchical information, which is displayed in the style shown in the following figure for display):
[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-jrovckjn-1635688938826) (. / fig20210829 / magic. PNG)]
The darker the color, the larger the node, indicating that the level of the node is closer to the root node (mammal).

Because there are a lot of data, the display is not very clear. Here we simply put forward the elephant structure to further look at the data set.

elephant = wn.synset('elephant.n.01')
graph = hyponym_graph(elephant)

Later, we will use this data set to introduce the method and train the hyperbolic embedding model.

Tags: Python Algorithm Machine Learning Deep Learning

Posted on Sun, 31 Oct 2021 18:29:18 -0400 by sharpnova