GSoC'23 Chronicles: Coding Week Two(2)

Building GNNs to perform Graph Regression

Hey guys!!

Welcome to the blog post where I share my exciting journey in the Google Summer of Code (GSOC) program, specifically focusing on my second week of progress. I hope you all had a fantastic week!

Following my previous update, I mentioned that my primary focus would be on building the Graph Neural Network (GNN) model. In line with this objective, my efforts this week were directed towards structuring the merged dataset obtained from cbioportal and Pathway Commons to align with the PyTorch Geometric (PyG) standard. This crucial step allows us to leverage the power of PyG and utilize the dataset to build a basic GNN for predicting the Overall Survival (in months) of patients using Graph Regression.

Graph regression involves predicting continuous or numerical values for nodes or edges within a graph. Unlike classification tasks that focus on discrete labels, graph regression leverages the structural and attribute information of the graph to estimate quantitative values. Now, let's delve into the progress I made during this week.

My Progress this Week

My main focus was on developing a basic GNN model to work with the dataset I had preprocessed in the previous week.

After a productive call with my mentor, we decided to adopt graph regression instead of graph classification which was the step I had in mind while processing my dataset before. This required me to preprocess my dataset again by creating individual graphs for each patient in the dataset. Since the dataset consisted of 78 patients, with 62 in the training set and 16 in the test set, I needed to generate a list of 62 graphs for the training set and 16 graphs for the test set.

These lists were then converted into a list of data objects using the PyG function: torch_geometric.data. With guidance from my mentors, we addressed an error I encountered while creating patient-specific graphs during another insightful call. Following these steps, the next stage involved creating batches from the data and smoothly building the model.

Summary of progress made

  • Dataset Preparation: I successfully created two distinct lists of patient-specific graphs, one for the training set and the other for the test set. These lists were then converted into a format compatible with PyTorch Geometric (PyG) by transforming them into a list of Data objects.
# Convert graphs_train to a list of Data objects
data_train = [Data(x=torch.tensor(graph[0].reshape(len(graphs_train[0][0]), 1)), edge_index=graph[1], y=torch.tensor(graph[2])) for graph in graphs_train]

# Convert graphs_test to a list of Data objects
data_test = [Data(x=torch.tensor(graph[0].reshape(len(graphs_test[0][0]), 1)), edge_index=graph[1], y=torch.tensor(graph[2])) for graph in graphs_test]
  • Batch Creation: I generated batches from the list of data objects for both the training and test sets. This step was crucial for efficient model training and evaluation.

  • GNN Model Building: Using the PyG module GCNConv, I constructed a basic Graph Neural Network (GNN) model. The model consisted of 2 hidden layers, 1 input layer, and 1 output layer, resulting in a total of 5 layers. However, the initial evaluation showed room for improvement as the mean-squared error (MSE) indicated suboptimal performance, with an initial value of approximately 850.

Challenges encountered

The past week presented me with several challenges, each of which I approached with determination and perseverance. Firstly, I encountered difficulties while attempting to split my data into batches using the Batch function from PyG. Additionally, I faced issues with the patient-specific graphs I had created, as they were yielding unexpected results. Furthermore, during the model training phase, I encountered errors related to data type compatibility. However, with the guidance and support of my mentors, I was able to troubleshoot and resolve these issues, ultimately getting my model up and running successfully.

Overcoming these challenges has provided me with valuable learning experiences and reinforced the importance of seeking guidance and collaboration within the GSoC community. I am excited to tackle the upcoming tasks with renewed confidence and continue making progress in my project.

My Plans for next week

This week, my focus will be on two major things which include

  1. Improving Model Performance: To enhance the model's efficacy, I will implement the following strategies: increasing the number of training epochs to 100 or more, exploring the addition of more layers to the model, and experimenting with training the model without using the pre-created batches. I will also consult with my mentors to explore alternative methods for enhancing GNN models.

  2. Dataset Contribution to PyTorch Geometric: I will actively engage with the PyG community to understand the necessary steps for contributing the preprocessed basic dataset I have generated during these initial project stages. This process will familiarize me with the requirements and guidelines for preparing datasets for contribution to PyG.

    Additionally, I will dedicate time to refining my notebooks by adding informative sections and comments. These edits will enhance clarity and enable readers to follow the step-by-step process involved in creating the merged dataset, constructing the graph structure, and building the basic GNN model.

In conclusion, I'm excited to share the remarkable progress I have made this week and the developments in my project.

Looking ahead, this new week holds great promise as I dive deeper into refining the model's performance and embarking on the initial steps to contribute the dataset to PyTorch Geometric. These endeavours will bring me closer to achieving the goals of my GSoC journey.

I warmly invite you to join me next week as I continue to share the latest updates and insights. Until then, thank you for being a part of this journey!

(P.S. If you are interested in knowing more about my project, feel free to check it out on GitHub. https://github.com/cannin/gsoc_2023_pytorch_pathway_commons)