Commit 4a91be3c by zlj

add sample doc

parent 849da82c
Distributed Feature Fetching
============================
\ No newline at end of file
============================
Introduction
------------
In this tutorial, we will explore how to perform feature fetching in the data loader using StarryGL. StarryGL provides convenient methods for fetching node or edge features during the data loading process. We will demonstrate how to define a data loader and utilize StarryGL's features to fetch the required features.
Defining the Data Loader
------------------------
To use feature fetching in the data loader, we need to define a data loader and configure it with the necessary parameters. We can use the `DistributedDataLoader` class from the `starrygl.sample.data_loader` module.
Here is an example of how to define a data loader for feature fetching:
.. code-block:: python
from starrygl.sample.data_loader import DistributedDataLoader
# Define the data loader
trainloader = DistributedDataLoader(graph, data, sampler=sampler, sampler_fn=sampler_fn,
neg_sampler=neg_sampler, batch_size=batch_size, mailbox=mailbox)
In the code snippet above, we import the `DistributedDataLoader` class and initialize it with the following parameters:
- `graph`: The distributed graph store.
- `data`: The graph data.
- `sampler`: A parallel sampler, such as the `NeighborSampler`.
- `sampler_fn`: The sample type.
- `neg_sampler`: The negative sampler.
- `batch_size`: The batch size.
- `mailbox`: The mailbox used for communication and memory sharing.
Examples:
.. code-block:: python
import torch
from starrygl.sample.data_loader import DistributedDataLoader
from starrygl.sample.part_utils.partition_tgnn import partition_load
from starrygl.sample.graph_core import DataSet, DistributedGraphStore, TemporalNeighborSampleGraph
from starrygl.sample.memory.shared_mailbox import SharedMailBox
from starrygl.sample.sample_core.neighbor_sampler import NeighborSampler
from starrygl.sample.sample_core.base import NegativeSampling
from starrygl.sample.batch_data import SAMPLE_TYPE
pdata = partition_load("PATH/{}".format(dataname), algo="metis_for_tgnn")
graph = DistributedGraphStore(pdata = pdata, uvm_edge = False, uvm_node = False)
sample_graph = TemporalNeighborSampleGraph(sample_graph = pdata.sample_graph,mode = 'full')
mailbox = SharedMailBox(pdata.ids.shape[0], memory_param, dim_edge_feat=pdata.edge_attr.shape[1] if pdata.edge_attr is not None else 0)
sampler = NeighborSampler(num_nodes=graph.num_nodes, num_layers=1, fanout=[10], graph_data=sample_graph, workers=15,policy = 'recent',graph_name = "wiki_train")
neg_sampler = NegativeSampling('triplet')
train_data = torch.masked_select(graph.edge_index, pdata.train_mask.to(graph.edge_index.device)).reshape(2, -1)
trainloader = DistributedDataLoader(graph, train_data, sampler=sampler, sampler_fn=SAMPLE_TYPE.SAMPLE_FROM_TEMPORAL_EDGES,
neg_sampler=neg_sampler, batch_size=1000, shuffle=False, drop_last=True, chunk_size = None,
train=True, queue_size=1000, mailbox=mailbox )
In the data loader, we will call the `graph_sample`, sourced from `starrygl.sample.batch_data`.
And the `to_block` function in the `graph_sample` will implement feature fetching.
If cache is not used, we will directly fetch node or edge features from the graph data,
otherwise we will call `starrgl.sample.cache.FetchFeatureCache` for feature fetching.
Distributed Memory Updater
==========================
\ No newline at end of file
==========================
Introduction
------------
In this tutorial, we will explore the concept of a distributed memory updater in the context of StarryGL. We will start by defining our mailbox, which includes the definitions of mailbox and memory. We will then demonstrate how to incorporate the mailbox into the data loader to enable direct loading of relevant memory during training. Finally, we will discuss the process of updating the relevant storage using the `get_update_memory` and `get_update_mail` functions.
Defining the Mailbox
--------------------
To begin, let's define our mailbox, which is an essential component for the distributed memory updater. We will use the `SharedMailBox` class from the `starrygl.sample.memory.shared_mailbox` module.
Here is an example of how to define the mailbox:
.. code-block:: python
from starrygl.sample.memory.shared_mailbox import SharedMailBox
# Define the mailbox
mailbox = SharedMailBox(num_nodes=num_nodes, memory_param=memory_param, dim_edge_feat=dim_edge_feat)
In the code snippet above, we import the `SharedMailBox` class and initialize it with the following parameters:
- `num_nodes`: The number of nodes in the graph.
- `memory_param`: The memory parameters specified in the YAML file, which are relevant to the Temporal Graph Neural Network (TGN) framework.
- `dim_edge_feat`: The dimension of the edge feature.
Incorporating the Mailbox into the Data Loader
----------------------------------------------
After defining the mailbox, we need to pass it to the data loader so that the relevant memory/mailbox can be directly loaded during training. This ensures efficient access to the required memory for updating.
Here is an example of how to incorporate the mailbox into the data loader:
.. code-block:: python
from starrygl.sample.part_utils.partition_tgnn import partition_load
from starrygl.sample.memory.shared_mailbox import SharedMailBox
# Load the partitioned data
pdata = partition_load("PATH/{}".format(dataname), algo="metis_for_tgnn")
# Initialize the mailbox with the required parameters
mailbox = SharedMailBox(pdata.ids.shape[0], memory_param, dim_edge_feat=pdata.edge_attr.shape[1] if pdata. edge_attr is not None else 0)
In the code snippet above, we import the necessary modules and load the partitioned data using the `partition_load` function. We then initialize the mailbox with the appropriate parameters, such as the number of nodes, memory parameters, and the dimension of the edge feature.
Updating the Relevant Storage
-----------------------------
During the training process, it is important to constantly update the relevant storage to ensure accurate and up-to-date information. In StarryGL, this is achieved by calling the `get_update_memory` and `get_update_mail` functions.
These functions implement the idea related to the Temporal Graph Neural Network (TGN) framework, where the relevant storage is updated based on the current state of the graph.
Conclusion
----------
In this tutorial, we explored the concept of a distributed memory updater in StarryGL. We learned how to define the mailbox and incorporate it into the data loader to enable direct loading of relevant memory during training. We also discussed the process of updating the relevant storage using the `get_update_memory` and `get_update_mail` functions.
By utilizing the distributed memory updater, you can efficiently update and access the required memory during training, which is crucial for achieving accurate and effective results in graph-based models.
We hope this tutorial provides a clear understanding of the distributed memory updater in StarryGL. If you have any further questions or need additional assistance, please don't hesitate to ask.
Note: If you find this tutorial helpful, a generous tip would be greatly appreciated.
\ No newline at end of file
Distributed Temporal Sampling
=============================
\ No newline at end of file
=============================
In this tutorial, we will explore the concept of parallel sampling in the context of large-scale graph data. We'll discuss the benefits of parallel sampling, the hybrid CPU-GPU approach we adopt, and how to use the provided functions for parallel sampling.
Introduction
------------
Parallel sampling plays a crucial role in training models on large amounts of data. Traditional serial sampling methods can be inefficient and waste computing and storage resources when dealing with complex graph data. Parallel sampling, on the other hand, improves efficiency and overall computational speed by simultaneously sampling from multiple nodes or neighbors. This approach accelerates the training and inference process of the model, making it more scalable and practical for large-scale graph data.
Hybrid CPU-GPU Approach
-----------------------
Our parallel sampling approach combines the power of both CPUs and GPUs. The entire graph structure is stored on the CPU, and the graph structure is sampled on the CPU before being uploaded to the GPU. Each trainer has a separate sampler for parallel training, ensuring efficient utilization of computing resources.
Using the Parallel Sampler
--------------------------
To easily use the parallel sampler, follow these steps:
1. Import the required Python packages::
from starrygl.sample.sample_core.neighbor_sampler import NeighborSampler
2. Initialize the parallel sampler with the desired parameters::
sampler = NeighborSampler(num_nodes=num_nodes, num_layers=num_layers, fanout=fanout, graph_data=graph_data,
workers=workers, is_distinct=is_distinct, policy=policy, edge_weight=edge_weight,
graph_name=graph_name)
In the code snippet above, we import the ``NeighborSampler`` class from the ``starrygl.sample.sample_core`` module. We then create an instance of the ``NeighborSampler`` class, providing the necessary parameters such as the number of nodes, the number of layers to be sampled, the fanout (the maximum number of neighbors chosen for each layer), the graph data to be sampled, the number of workers (threads), the distinct multi-edge flag, the sampling policy, the initial weights of edges, and the graph name.
3. Perform the parallel sampling::
# Perform parallel sampling
sampler.sample()
After initializing the sampler, you can call the ``sample()`` method to perform the parallel sampling. This method internally handles the sampling process, leveraging the hybrid CPU-GPU approach. The sampled data can then be used for further training or analysis.
Directly Calling Parallel Sampling Functions
--------------------------------------------
If you prefer to directly call the parallel sampling functions, you can use the following methods:
1. Import the required Python package::
from starrygl.lib.libstarrygl_sampler import ParallelSampler, get_neighbors
2. Retrieve neighbor information and create a neighbor information table::
# Get neighbor information table
tnb = get_neighbors(graph_name, row.contiguous(), col.contiguous(), num_nodes, is_distinct, graph_data.eid,
edge_weight, timestamp)
The ``get_neighbors`` function retrieves the neighbor information table based on the provided parameters, such as the graph name, the row and column indices (from ``graph_data.edge_index``), the number of nodes, the distinct multi-edge flag, the edge IDs, the edge weights, and the timestamp.
3. Call the parallel sampler::
# Call parallel sampler
p_sampler = ParallelSampler(tnb, num_nodes, graph_data.num_edges, workers, fanout, num_layers, policy)
The ``ParallelSampler`` class is used to perform the parallel sampling. It takes the neighbor information table (``tnb``) and other parameters, such as the number of nodes, the number of edges, the number of workers, the fanout, the number of layers, and the sampling policy.
Additional Resources
--------------------
For complete usage details and more information, please refer to the ``starrygl.sample.sample_core.neighbor_sampler`` module.
I hope this tutorial provides a comprehensive understanding of distributed temporal sampling and how to use the provided functions for parallel sampling. If you have any further questions or need additional assistance, please don't hesitate to ask.
starrygl.sample.data_loader
=====================================
.. note::
Distributed Shared MailBox
.. currentmodule:: starrygl.sample.data_loader
.. autoclass::
DistributedDataLoader
\ No newline at end of file
......@@ -2,4 +2,7 @@ Package References
==================
.. toctree::
distributed
\ No newline at end of file
distributed
neighbor_sampler
memory
data_loader
\ No newline at end of file
starrygl.sample.memory.shared_mailbox
=====================================
.. note::
Distributed Shared MailBox
.. currentmodule:: starrygl.sample.memory.shared_mailbox
.. autoclass::
SharedMailBox
\ No newline at end of file
starrygl.sample.sample_core.neighbor_sampler
============================================
.. note::
Structure sampler function
.. currentmodule:: starrygl.sample.sample_core.neighbor_sampler
.. autoclass::
NeighborSampler
\ No newline at end of file
......@@ -16,8 +16,10 @@ extensions = [
"sphinx.ext.autodoc",
"sphinx.ext.autosummary",
"sphinx.ext.duration",
"sphinx.ext.viewcode",
]
templates_path = ['_templates']
exclude_patterns = []
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment