# 8์ฅ: Mixed Precision ํ์ตยถ

(English Version)

DGL์ mixed precision ํ์ต์ ์ํด์ PyTorchโs automatic mixed precision package ์ ํธํ๋๋ค. ๋ฐ๋ผ์, ํ์ต ์๊ฐ ๋ฐ GPU ๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ๋์ ์ ์ฝํ  ์ ์๋ค.

## Half precision์ ์ฌ์ฉํ ๋ฉ์์ง ์ ๋ฌยถ

fp16์ ์ง์ํ๋ DGL์ UDF(User Defined Function)์ด๋ ๋นํธ์ธ ํจ์(์, `dgl.function.sum`, `dgl.function.copy_u`)๋ฅผ ์ฌ์ฉํด์ `float16` ํผ์ณ์ ๋ํ ๋ฉ์์ง ์ ๋ฌ์ ํ์ฉํ๋ค.

๋ค์ ์์ ๋ DGL ๋ฉ์์ง ์ ๋ฌ API๋ฅผ half-precision ํผ์ณ๋ค์ ์ฌ์ฉํ๋ ๋ฐฉ๋ฒ์ ๋ณด์ฌ์ค๋ค.

```>>> import torch
>>> import dgl
>>> import dgl.function as fn
>>> g = dgl.rand_graph(30, 100).to(0)  # Create a graph on GPU w/ 30 nodes and 100 edges.
>>> g.ndata['h'] = torch.rand(30, 16).to(0).half()  # Create fp16 node features.
>>> g.edata['w'] = torch.rand(100, 1).to(0).half()  # Create fp16 edge features.
>>> # Use DGL's built-in functions for message passing on fp16 features.
>>> g.update_all(fn.u_mul_e('h', 'w', 'm'), fn.sum('m', 'x'))
>>> g.ndata['x'][0]
tensor([0.3391, 0.2208, 0.7163, 0.6655, 0.7031, 0.5854, 0.9404, 0.7720, 0.6562,
0.4028, 0.6943, 0.5908, 0.9307, 0.5962, 0.7827, 0.5034],
device='cuda:0', dtype=torch.float16)
>>> g.apply_edges(fn.u_dot_v('h', 'x', 'hx'))
>>> g.edata['hx'][0]
tensor([5.4570], device='cuda:0', dtype=torch.float16)
>>> # Use UDF(User Defined Functions) for message passing on fp16 features.
>>> def message(edges):
...     return {'m': edges.src['h'] * edges.data['w']}
...
>>> def reduce(nodes):
...     return {'y': torch.sum(nodes.mailbox['m'], 1)}
...
>>> def dot(edges):
...     return {'hy': (edges.src['h'] * edges.dst['y']).sum(-1, keepdims=True)}
...
>>> g.update_all(message, reduce)
>>> g.ndata['y'][0]
tensor([0.3394, 0.2209, 0.7168, 0.6655, 0.7026, 0.5854, 0.9404, 0.7720, 0.6562,
0.4028, 0.6943, 0.5908, 0.9307, 0.5967, 0.7827, 0.5039],
device='cuda:0', dtype=torch.float16)
>>> g.apply_edges(dot)
>>> g.edata['hy'][0]
tensor([5.4609], device='cuda:0', dtype=torch.float16)
```

## End-to-End Mixed Precision ํ์ตยถ

DGL์ PyTorch์ AMP package๋ฅผ ์ฌ์ฉํด์ mixed precision ํ์ต์ ๊ตฌํํ๊ณ  ์์ด์, ์ฌ์ฉ ๋ฐฉ๋ฒ์ PyTorch์ ๊ฒ ๊ณผ ๋์ผํ๋ค.

GNN ๋ชจ๋ธ์ forward ํจ์ค(loss ๊ณ์ฐ ํฌํจ)๋ฅผ `torch.cuda.amp.autocast()` ๋ก ๋ํํ๋ฉด PyTorch๋ ๊ฐ op ๋ฐ ํ์์ ๋ํด์ ์ ์ ํ ๋ฐ์ดํฐ ํ์์ ์๋์ผ๋ก ์ ํํ๋ค. Half precision ํ์๋ ๋ฉ๋ชจ๋ฆฌ ํจ์จ์ ์ด๊ณ , half precision ํ์์ ๋ํ ๋๋ถ๋ถ ์ฐ์ฐ๋ค์ GPU tensorcore๋ค์ ํ์ฉํ๊ธฐ ๋๋ฌธ์ ๋ ๋น ๋ฅด๋ค.

`float16` ํฌ๋ฉง์ ์์ graident๋ค์ ์ธ๋ํ๋ก์ฐ(underflow) ๋ฌธ์ ๋ฅผ ๊ฐ๋๋ฐ (0์ด ๋๋ฒ๋ฆผ), PyTorch๋ ์ด๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด์ `GradScaler` ๋ชจ๋์ ์ ๊ณตํ๋ค. `GradScaler` ๋ loss ๊ฐ์ factor๋ฅผ ๊ณฑํ๊ณ , ์ด scaled loss์ backward pass๋ฅผ ์ํํ๋ค. ๊ทธ๋ฆฌ๊ณ  ํ๋ผ๋ฉํฐ๋ค์ ์๋ฐ์ดํธํ๋ optimizer๋ฅผ ์ํํ๊ธฐ ์ ์ unscale ํ๋ค.

๋ค์์ 3-๋ ์ด์ด GAT๋ฅผ Reddit ๋ฐ์ดํฐ์(1140์ต๊ฐ์ ์์ง๋ฅผ ๊ฐ๋)์ ํ์ต์ ํ๋ ์คํฌ๋ฆฝํธ์ด๋ค. `use_fp16` ๊ฐ ํ์ฑํ/๋นํ์ฑํ๋์์ ๋์ ์ฝ๋ ์ฐจ์ด๋ฅผ ์ดํด๋ณด์.

```import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.cuda.amp import autocast, GradScaler
import dgl
from dgl.data import RedditDataset
from dgl.nn import GATConv

use_fp16 = True

class GAT(nn.Module):
def __init__(self,
in_feats,
n_hidden,
n_classes,
super().__init__()
self.layers = nn.ModuleList()
self.layers.append(GATConv(in_feats, n_hidden, heads[0], activation=F.elu))

def forward(self, g, h):
for l, layer in enumerate(self.layers):
h = layer(g, h)
if l != len(self.layers) - 1:
h = h.flatten(1)
else:
h = h.mean(1)
return h

data = RedditDataset()
device = torch.device(0)
g = data[0]
g = g.int().to(device)
features = g.ndata['feat']
labels = g.ndata['label']
in_feats = features.shape[1]
n_hidden = 256
n_classes = data.num_classes
n_edges = g.num_edges()
heads = [1, 1, 1]
model = GAT(in_feats, n_hidden, n_classes, heads)
model = model.to(device)

# Create optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=5e-4)
# Create gradient scaler

for epoch in range(100):
model.train()

# Wrap forward pass with autocast
with autocast(enabled=use_fp16):
logits = model(g, features)

if use_fp16:
# Backprop w/ gradient scaling
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
else:
loss.backward()
optimizer.step()

print('Epoch {} | Loss {}'.format(epoch, loss.item()))
```

NVIDIA V100 (16GB) ํ๊ฐ๋ฅผ ๊ฐ๋ ์ปดํจํฐ์์, ์ด ๋ชจ๋ธ์ fp16์ ์ฌ์ฉํ์ง ์๊ณ  ํ์ตํ  ๋๋ 15.2GB GPU ๋ฉ๋ชจ๋ฆฌ๊ฐ ์ฌ์ฉ๋๋๋ฐ, fp16์ ํ์ฑํํ๋ฉด, ํ์ต์ 12.8G GPU ๋ฉ๋ชจ๋ฆฌ๊ฐ ์ฌ์ฉ๋๋ฉฐ, ๋ ๊ฒฝ์ฐ loss๊ฐ ๋น์ทํ ๊ฐ์ผ๋ก ์๋ ดํ๋ค. ๋ง์ฝ head์ ๊ฐฏ์๋ฅผ `[2, 2, 2]` ๋ก ๋ฐ๊พธ๋ฉด, fp16๋ฅผ ์ฌ์ฉํ์ง ์๋ ํ์ต์ GPU OOM(out-of-memory) ์ด์๊ฐ ์๊ธธ ๊ฒ์ด์ง๋ง, fp16๋ฅผ ์ฌ์ฉํ ํ์ต์ 15.7G GPU ๋ฉ๋ชจ๋ฆฌ๋ฅผ ์ฌ์ฉํ๋ฉด์ ์ํ๋๋ค.

DGL์ half-precision ์ง์์ ๊ณ์ ํฅ์ํ๊ณ  ์๊ณ , ์ฐ์ฐ ์ปค๋์ ์ฑ๋ฅ์ ์์ง ์ต์ ์ ์๋๋ค. ์์ผ๋ก์ ์๋ฐ์ดํธ๋ฅผ ๊ณ์ ์ง์ผ๋ณด์.