{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\nStochastic Training of GNN for Link Prediction\n==============================================\n\nThis tutorial will show how to train a multi-layer GraphSAGE for link\nprediction on ogbn-arxiv provided by Open Graph Benchmark\n(OGB) __. The dataset\ncontains around 170 thousand nodes and 1 million edges.\n\nBy the end of this tutorial, you will be able to\n\n- Train a GNN model for link prediction on a single GPU with DGL's\n neighbor sampling components.\n\nThis tutorial assumes that you have read the :doc:Introduction of Neighbor\nSampling for GNN Training  and :doc:Neighbor\nSampling for Node Classification .\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Link Prediction Overview\n------------------------\n\nLink prediction requires the model to predict the probability of\nexistence of an edge. This tutorial does so by computing a dot product\nbetween the representations of both incident nodes.\n\n\\begin{align}\\hat{y}_{u\\sim v} = \\sigma(h_u^T h_v)\\end{align}\n\nIt then minimizes the following binary cross entropy loss.\n\n\\begin{align}\\mathcal{L} = -\\sum_{u\\sim v\\in \\mathcal{D}}\\left( y_{u\\sim v}\\log(\\hat{y}_{u\\sim v}) + (1-y_{u\\sim v})\\log(1-\\hat{y}_{u\\sim v})) \\right)\\end{align}\n\nThis is identical to the link prediction formulation in :doc:the previous\ntutorial on link prediction <../blitz/4_link_predict>.\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Loading Dataset\n---------------\n\nThis tutorial loads the dataset from the ogb package as in the\n:doc:previous tutorial .\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import os\n\nos.environ[\"DGLBACKEND\"] = \"pytorch\"\nimport dgl\nimport numpy as np\nimport torch\nfrom ogb.nodeproppred import DglNodePropPredDataset\n\ndataset = DglNodePropPredDataset(\"ogbn-arxiv\")\ndevice = \"cpu\" # change to 'cuda' for GPU\n\ngraph, node_labels = dataset\n# Add reverse edges since ogbn-arxiv is unidirectional.\ngraph = dgl.add_reverse_edges(graph)\nprint(graph)\nprint(node_labels)\n\nnode_features = graph.ndata[\"feat\"]\nnode_labels = node_labels[:, 0]\nnum_features = node_features.shape\nnum_classes = (node_labels.max() + 1).item()\nprint(\"Number of classes:\", num_classes)\n\nidx_split = dataset.get_idx_split()\ntrain_nids = idx_split[\"train\"]\nvalid_nids = idx_split[\"valid\"]\ntest_nids = idx_split[\"test\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Defining Neighbor Sampler and Data Loader in DGL\n------------------------------------------------\n\nDifferent from the :doc:link prediction tutorial for full\ngraph <../blitz/4_link_predict>, a common practice to train GNN on large graphs is\nto iterate over the edges\nin minibatches, since computing the probability of all edges is usually\nimpossible. For each minibatch of edges, you compute the output\nrepresentation of their incident nodes using neighbor sampling and GNN,\nin a similar fashion introduced in the :doc:large-scale node classification\ntutorial .\n\nDGL provides dgl.dataloading.as_edge_prediction_sampler to\niterate over edges for edge classification or link prediction tasks.\n\nTo perform link prediction, you need to specify a negative sampler. DGL\nprovides builtin negative samplers such as\ndgl.dataloading.negative_sampler.Uniform. Here this tutorial uniformly\ndraws 5 negative examples per positive example.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "negative_sampler = dgl.dataloading.negative_sampler.Uniform(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After defining the negative sampler, one can then define the edge data\nloader with neighbor sampling. To create an DataLoader for\nlink prediction, provide a neighbor sampler object as well as the negative\nsampler object created above.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sampler = dgl.dataloading.NeighborSampler([4, 4])\nsampler = dgl.dataloading.as_edge_prediction_sampler(\n sampler, negative_sampler=negative_sampler\n)\ntrain_dataloader = dgl.dataloading.DataLoader(\n # The following arguments are specific to DataLoader.\n graph, # The graph\n torch.arange(graph.num_edges()), # The edges to iterate over\n sampler, # The neighbor sampler\n device=device, # Put the MFGs on CPU or GPU\n # The following arguments are inherited from PyTorch DataLoader.\n batch_size=1024, # Batch size\n shuffle=True, # Whether to shuffle the nodes for every epoch\n drop_last=False, # Whether to drop the last incomplete batch\n num_workers=0, # Number of sampler processes\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can peek one minibatch from train_dataloader and see what it\nwill give you.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "input_nodes, pos_graph, neg_graph, mfgs = next(iter(train_dataloader))\nprint(\"Number of input nodes:\", len(input_nodes))\nprint(\n \"Positive graph # nodes:\",\n pos_graph.num_nodes(),\n \"# edges:\",\n pos_graph.num_edges(),\n)\nprint(\n \"Negative graph # nodes:\",\n neg_graph.num_nodes(),\n \"# edges:\",\n neg_graph.num_edges(),\n)\nprint(mfgs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The example minibatch consists of four elements.\n\nThe first element is an ID tensor for the input nodes, i.e., nodes\nwhose input features are needed on the first GNN layer for this minibatch.\n\nThe second element and the third element are the positive graph and the\nnegative graph for this minibatch.\nThe concept of positive and negative graphs have been introduced in the\n:doc:full-graph link prediction tutorial <../blitz/4_link_predict>. In minibatch\ntraining, the positive graph and the negative graph only contain nodes\nnecessary for computing the pair-wise scores of positive and negative examples\nin the current minibatch.\n\nThe last element is a list of :doc:MFGs \nstoring the computation dependencies for each GNN layer.\nThe MFGs are used to compute the GNN outputs of the nodes\ninvolved in positive/negative graph.\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Defining Model for Node Representation\n--------------------------------------\n\nThe model is almost identical to the one in the :doc:node classification\ntutorial . The only difference is\nthat since you are doing link prediction, the output dimension will not\nbe the number of classes in the dataset.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import torch.nn as nn\nimport torch.nn.functional as F\nfrom dgl.nn import SAGEConv\n\n\nclass Model(nn.Module):\n def __init__(self, in_feats, h_feats):\n super(Model, self).__init__()\n self.conv1 = SAGEConv(in_feats, h_feats, aggregator_type=\"mean\")\n self.conv2 = SAGEConv(h_feats, h_feats, aggregator_type=\"mean\")\n self.h_feats = h_feats\n\n def forward(self, mfgs, x):\n h_dst = x[: mfgs.num_dst_nodes()]\n h = self.conv1(mfgs, (x, h_dst))\n h = F.relu(h)\n h_dst = h[: mfgs.num_dst_nodes()]\n h = self.conv2(mfgs, (h, h_dst))\n return h\n\n\nmodel = Model(num_features, 128).to(device)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Defining the Score Predictor for Edges\n--------------------------------------\n\nAfter getting the node representation necessary for the minibatch, the\nlast thing to do is to predict the score of the edges and non-existent\nedges in the sampled minibatch.\n\nThe following score predictor, copied from the :doc:link prediction\ntutorial <../blitz/4_link_predict>, takes a dot product between the\nincident nodes\u2019 representations.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import dgl.function as fn\n\n\nclass DotPredictor(nn.Module):\n def forward(self, g, h):\n with g.local_scope():\n g.ndata[\"h\"] = h\n # Compute a new edge feature named 'score' by a dot-product between the\n # source node feature 'h' and destination node feature 'h'.\n g.apply_edges(fn.u_dot_v(\"h\", \"h\", \"score\"))\n # u_dot_v returns a 1-element vector for each edge so you need to squeeze it.\n return g.edata[\"score\"][:, 0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Evaluating Performance with Unsupervised Learning (Optional)\n------------------------------------------------------------\n\nThere are various ways to evaluate the performance of link prediction.\nThis tutorial follows the practice of GraphSAGE\npaper __.\nBasically, it first trains a GNN via link prediction, and get an embedding\nfor each node. Then it trains a downstream classifier on top of this\nembedding and compute the accuracy as an assessment of the embedding\nquality.\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To obtain the representations of all the nodes, this tutorial uses\nneighbor sampling as introduced in the :doc:node classification\ntutorial .\n\n

#### Note

If you would like to obtain node representations without\n neighbor sampling during inference, please refer to this user\n guide .

\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def inference(model, graph, node_features):\n with torch.no_grad():\n nodes = torch.arange(graph.num_nodes())\n\n sampler = dgl.dataloading.NeighborSampler([4, 4])\n train_dataloader = dgl.dataloading.DataLoader(\n graph,\n torch.arange(graph.num_nodes()),\n sampler,\n batch_size=1024,\n shuffle=False,\n drop_last=False,\n num_workers=4,\n device=device,\n )\n\n result = []\n for input_nodes, output_nodes, mfgs in train_dataloader:\n # feature copy from CPU to GPU takes place here\n inputs = mfgs.srcdata[\"feat\"]\n result.append(model(mfgs, inputs))\n\n return torch.cat(result)\n\n\nimport sklearn.metrics\n\n\ndef evaluate(emb, label, train_nids, valid_nids, test_nids):\n classifier = nn.Linear(emb.shape, num_classes).to(device)\n opt = torch.optim.LBFGS(classifier.parameters())\n\n def compute_loss():\n pred = classifier(emb[train_nids].to(device))\n loss = F.cross_entropy(pred, label[train_nids].to(device))\n return loss\n\n def closure():\n loss = compute_loss()\n opt.zero_grad()\n loss.backward()\n return loss\n\n prev_loss = float(\"inf\")\n for i in range(1000):\n opt.step(closure)\n with torch.no_grad():\n loss = compute_loss().item()\n if np.abs(loss - prev_loss) < 1e-4:\n print(\"Converges at iteration\", i)\n break\n else:\n prev_loss = loss\n\n with torch.no_grad():\n pred = classifier(emb.to(device)).cpu()\n label = label\n valid_acc = sklearn.metrics.accuracy_score(\n label[valid_nids].numpy(), pred[valid_nids].numpy().argmax(1)\n )\n test_acc = sklearn.metrics.accuracy_score(\n label[test_nids].numpy(), pred[test_nids].numpy().argmax(1)\n )\n return valid_acc, test_acc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Defining Training Loop\n----------------------\n\nThe following initializes the model and defines the optimizer.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model = Model(node_features.shape, 128).to(device)\npredictor = DotPredictor().to(device)\nopt = torch.optim.Adam(list(model.parameters()) + list(predictor.parameters()))\n\n\nimport sklearn.metrics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following is the training loop for link prediction and\nevaluation, and also saves the model that performs the best on the\nvalidation set:\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import tqdm\n\nbest_accuracy = 0\nbest_model_path = \"model.pt\"\nfor epoch in range(1):\n with tqdm.tqdm(train_dataloader) as tq:\n for step, (input_nodes, pos_graph, neg_graph, mfgs) in enumerate(tq):\n # feature copy from CPU to GPU takes place here\n inputs = mfgs.srcdata[\"feat\"]\n\n outputs = model(mfgs, inputs)\n pos_score = predictor(pos_graph, outputs)\n neg_score = predictor(neg_graph, outputs)\n\n score = torch.cat([pos_score, neg_score])\n label = torch.cat(\n [torch.ones_like(pos_score), torch.zeros_like(neg_score)]\n )\n loss = F.binary_cross_entropy_with_logits(score, label)\n\n opt.zero_grad()\n loss.backward()\n opt.step()\n\n tq.set_postfix({\"loss\": \"%.03f\" % loss.item()}, refresh=False)\n\n if (step + 1) % 500 == 0:\n model.eval()\n emb = inference(model, graph, node_features)\n valid_acc, test_acc = evaluate(\n emb, node_labels, train_nids, valid_nids, test_nids\n )\n print(\n \"Epoch {} Validation Accuracy {} Test Accuracy {}\".format(\n epoch, valid_acc, test_acc\n )\n )\n if best_accuracy < valid_acc:\n best_accuracy = valid_acc\n torch.save(model.state_dict(), best_model_path)\n model.train()\n\n # Note that this tutorial do not train the whole model to the end.\n break" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Evaluating Performance with Link Prediction (Optional)\n------------------------------------------------------\n\nIn practice, it is more common to evaluate the link prediction\nmodel to see whether it can predict new edges. There are different\nevaluation metrics such as\nAUC __\nor various metrics from information retrieval __.\nUltimately, they require the model to predict one scalar score given\na node pair among a set of node pairs.\n\nAssuming that you have the following test set with labels, where\ntest_pos_src and test_pos_dst are ground truth node pairs\nwith edges in between (or *positive* pairs), and test_neg_src\nand test_neg_dst are ground truth node pairs without edges\nin between (or *negative* pairs).\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Positive pairs\n# These are randomly generated as an example. You will need to\n# replace them with your own ground truth.\nn_test_pos = 1000\ntest_pos_src, test_pos_dst = (\n torch.randint(0, graph.num_nodes(), (n_test_pos,)),\n torch.randint(0, graph.num_nodes(), (n_test_pos,)),\n)\n# Negative pairs. Likewise, you will need to replace them with your\n# own ground truth.\ntest_neg_src = test_pos_src\ntest_neg_dst = torch.randint(0, graph.num_nodes(), (n_test_pos,))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First you need to compute the node representations for all the nodes\nwith the inference method above:\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "node_reprs = inference(model, graph, node_features)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since the predictor is a dot product, you can now easily compute the\nscore of positive and negative test pairs to compute metrics such\nas AUC:\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "h_pos_src = node_reprs[test_pos_src]\nh_pos_dst = node_reprs[test_pos_dst]\nh_neg_src = node_reprs[test_neg_src]\nh_neg_dst = node_reprs[test_neg_dst]\nscore_pos = (h_pos_src * h_pos_dst).sum(1)\nscore_neg = (h_neg_src * h_neg_dst).sum(1)\ntest_preds = torch.cat([score_pos, score_neg]).cpu().numpy()\ntest_labels = (\n torch.cat([torch.ones_like(score_pos), torch.zeros_like(score_neg)])\n .cpu()\n .numpy()\n)\n\nauc = sklearn.metrics.roc_auc_score(test_labels, test_preds)\nprint(\"Link Prediction AUC:\", auc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Conclusion\n----------\n\nIn this tutorial, you have learned how to train a multi-layer GraphSAGE\nfor link prediction with neighbor sampling.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Thumbnail credits: Link Prediction with Neo4j, Mark Needham\n# sphinx_gallery_thumbnail_path = '_static/blitz_4_link_predict.png'" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 0 }