Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function to convert Program Graph to PyTorch Geometric Graph #174

Open
zehanort opened this issue Jul 23, 2021 · 4 comments
Open

Function to convert Program Graph to PyTorch Geometric Graph #174

zehanort opened this issue Jul 23, 2021 · 4 comments
Labels
Enhancement New feature or request

Comments

@zehanort
Copy link

🚀 Feature

It would be nice to have a programl.to_pyg function to convert one or more Program Graphs to torch_geometric.data.Data, i.e. to PyTorch Geometric graphs.

Motivation

This would be extremely helpful in order to set up ML/DL pipelines with custom GNNs using the PyTorch Geometric library, which offers a lot of utilities regarding machine/deep learning tasks on graphs and it is a library that seems to gain a lot of popularity lately, especially in research.

Pitch

My idea is a 1-1 map between the nodes, edges and node features of the Program Graph to the PyG Graph, as well as turning the edge type of Program Graph (i.e., the CONTROL / DATA / CALL enum values) into a single edge feature of PyG Graph. Unfortunately, PyTorch Geometric does not (yet) explicitly support graph-level features. They seem to support only node-level features, node-level targets and graph-level targets for the time being. Therefore, a reasonable thing to do is to extend the torch_geometric.data.Data object with an additional attribute, as proposed in the documentation. Extending the first introductory example from the docs:

>>> import torch
>>> from torch_geometric.data import Data
>>> 
>>> edge_index = torch.tensor([[0, 1, 1, 2],
...                            [1, 0, 2, 1]], dtype=torch.long)
>>> x = torch.tensor([[-1], [0], [1]], dtype=torch.float)
>>> 
>>> data = Data(x=x, edge_index=edge_index)
>>> data
Data(edge_index=[2, 4], x=[3, 1])
>>>
>>> data.graph_y = torch.tensor([42]) # adding a graph-level target
>>> data
Data(edge_index=[2, 4], graph_y=[1], x=[3, 1])

I believe I am not forgetting anything (feel free to remind me if I do!).

If you don't have something like that in the works and you are interested, I would love to work on it and send a PR eventually. I intend to write such a tool anyway (i.e. Program Graph -> PyG Graph), so I would love to contribute it to the project as well.

@zehanort zehanort added the Enhancement New feature or request label Jul 23, 2021
@ChrisCummins
Copy link
Owner

Hi @zehanort, a PyTorch Geometric converter would be great! I would very happily review a patch for that, thanks a lot :)

Just thinking ahead - I'm a little wary of adding large dependencies like pytorch-geometric. Perhaps stick the converter in its own module like programl.torch_geometric_converter.to_pytorch_geometric() to simplify things? I'm thinking of doing something like that for the to_dgl() converter as that pulls in a lot of extras.

CC'ing @Zacharias030 as I believe he has some experience working with ProGraML using pytorch geometric

Cheers,
Chris

@Zacharias030
Copy link
Collaborator

Zacharias030 commented Jul 24, 2021

Hi @zehanort, I think this is a great pitch and I would welcome such an addition to the codebase very much!
Especially in light of the fact that #107 is incomplete, it would be great to interface to pytorch geometric such that training a range of models becomes very easy!

Note that in #107 we were also willing to introduce a dependency on pytorch geometric‘s Data and Batch classes.

@igabirondo16
Copy link

igabirondo16 commented May 16, 2024

Hi @ChrisCummins, @Zacharias030 and @zehanort ,

For my Master's thesis I have been working with ProGraML and Pytorch-Geometric, and I have an implementation of the to_pyg() method that you are mentioning.

My approach has been to use the torch_geometric.data.HeteroData data structure, which provides more flexibility for heterogeneous graphs than torch_geometric.data.Data.

The method I have been using takes as input a ProgramGraph and optionally the dictionary of the ProGraML vocabulary and makes the following conversion:

def to_pyg(graph: ProgramGraph, vocabulary: Optional[Dict[str, int]] = None) -> HeteroData:
        # 4 lists, one per edge type
        # (control, data, call and type edges)
        adjacencies = [[], [], [], []]
        edge_positions = [[], [], [], []]

        # Create the adjacency lists
        for edge in graph.edge:
            adjacencies[edge.flow].append([edge.source, edge.target])
            edge_positions[edge.flow].append(edge.position)

        node_text = [node.text for node in graph.node]

        vocab_ids = None
        if vocabulary is not None:
            vocab_ids = [
                vocabulary.get(node.text, len(vocabulary.keys()))
                for node in graph.node
            ]

        # Pass from list to tensor
        adjacencies = [torch.tensor(adj_flow_type) for adj_flow_type in adjacencies]
        edge_positions = [torch.tensor(edge_pos_flow_type) for edge_pos_flow_type in edge_positions]

        if vocabulary is not None:
            vocab_ids = torch.tensor(vocab_ids)

        # Create the graph structure
        hetero_graph = HeteroData()

        # Vocabulary index of each node
        hetero_graph['nodes']['text'] = node_text
        hetero_graph['nodes'].x = vocab_ids

        # Add the adjacency lists
        hetero_graph['nodes', 'control', 'nodes'].edge_index = (
            adjacencies[0].t().contiguous() if adjacencies[0].nelement() > 0 else torch.tensor([[], []])
        )
        hetero_graph['nodes', 'data', 'nodes'].edge_index = (
            adjacencies[1].t().contiguous() if adjacencies[1].nelement() > 0 else torch.tensor([[], []])
        )
        hetero_graph['nodes', 'call', 'nodes'].edge_index = (
            adjacencies[2].t().contiguous() if adjacencies[2].nelement() > 0 else torch.tensor([[], []])
        )
        hetero_graph['nodes', 'type', 'nodes'].edge_index = (
            adjacencies[3].t().contiguous() if adjacencies[3].nelement() > 0 else torch.tensor([[], []])
        )

        # Add the edge positions
        hetero_graph['nodes', 'control', 'nodes'].edge_attr = edge_positions[0]
        hetero_graph['nodes', 'data', 'nodes'].edge_attr = edge_positions[1]
        hetero_graph['nodes', 'call', 'nodes'].edge_attr = edge_positions[2]
        hetero_graph['nodes', 'type', 'nodes'].edge_attr = edge_positions[3]

        return hetero_graph

It first gathers the adjacency list of the graphs, the position attribute of the edges and the text of the nodes. If the vocabulary is given, it converts the text tokens to their respective vocabulary index. After that, the lists are transformed into tensors and stored in their respectives attributes. As you can see, using the HeteroData class provides more flexibility than Data, as it enables to add as many different type of nodes and edges as required.

I will create a pull request in the following days so that you can do further testing.

@ChrisCummins
Copy link
Owner

That's great thank you @igabirondo16! Look forward to your PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants