4.2 Download raw data (optional)¶
If a dataset is already in local disk, make sure it’s in directory
raw_dir
. If one wants to run the code anywhere without bothering to
download and move data to the right directory, one can do it
automatically by implementing function download()
.
If the dataset is a zip file, make MyDataset
inherit from
dgl.data.DGLBuiltinDataset
class, which handles the zip file extraction for us. Otherwise,
one needs to implement download()
like in QM7bDataset
:
import os
from dgl.data.utils import download
def download(self):
# path to store the file
file_path = os.path.join(self.raw_dir, self.name + '.mat')
# download file
download(self.url, path=file_path)
The above code downloads a .mat file to directory self.raw_dir
. If
the file is a .gz, .tar, .tar.gz or .tgz file, use extract_archive()
function to extract. The following code shows how to download a .gz file
in BitcoinOTCDataset
:
from dgl.data.utils import download, check_sha1
def download(self):
# path to store the file
# make sure to use the same suffix as the original file name's
gz_file_path = os.path.join(self.raw_dir, self.name + '.csv.gz')
# download file
download(self.url, path=gz_file_path)
# check SHA-1
if not check_sha1(gz_file_path, self._sha1_str):
raise UserWarning('File {} is downloaded but the content hash does not match.'
'The repo may be outdated or download may be incomplete. '
'Otherwise you can create an issue for it.'.format(self.name + '.csv.gz'))
# extract file to directory `self.name` under `self.raw_dir`
self._extract_gz(gz_file_path, self.raw_path)
The above code will extract the file into directory self.name
under
self.raw_dir
. If the class inherits from dgl.data.DGLBuiltinDataset
to handle zip file, it will extract the file into directory self.name
as well.
Optionally, one can check SHA-1 string of the downloaded file as the example above does, in case the author changed the file in the remote server some day.