GSoC'23 Chronicles: Coding Weeks Three(3) and Four(4)

Hey guys!!

Welcome to my latest blog post, where I'll be sharing the highlights of my journey in the Google Summer of Code (GSOC) program. Today, I'll be covering the progress I made during the third and fourth weeks.

In my previous report for coding week 2, I mentioned my focus on enhancing the performance of the GNN model and exploring the steps required to contribute the dataset to PyTorch Geometric (PyG). Continuing from there, during coding week 3, I had an insightful discussion with Matthias Fey, the Founding Engineer and Creator of PyTorch Geometric. We delved into the essential steps for contributing a dataset to PyG. According to Matthias, the key stages involve ensuring that the dataset contains the necessary attributes of a typical PyG dataset, followed by submitting a pull request on the PyG GitHub page, where the community will take it from there.

My Progress in Week 4

My primary focus last week was to ensure the dataset adhered to the PyG functions. To achieve this, I utilized the InMemoryDataset data object, which is part of the torch_geometric.data package. This object is specifically designed for creating graph datasets that fit into CPU memory seamlessly. To leverage its functionality, four important functions need to be defined:

However, after a productive meeting with my mentors, we decided to pause the contribution of this particular dataset due to its small size. Instead, we discussed new avenues and established fresh objectives for me to tackle in the upcoming week.

Challenges encountered

I encountered a challenge this week related to the download function within the InMemoryDataset class. This function is responsible for downloading the raw dataset from a specified link to the user's system. However, I noticed that the dataset was not downloading in the correct format as expected. Interestingly, when I downloaded it manually, the format was correct. Before proceeding with the contribution of a larger dataset to the PyG community, I aim to investigate and resolve this error.

My Plans for next week

Moving on to my plans for the next week, which were thoughtfully discussed with my mentors:

  1. Create a baseline model for comparison: I will explore other machine learning algorithms and experiment with automl models to establish a baseline model. I will leverage the previously formatted dataset for this task.

  2. Working with a larger dataset: I will focus on processing the breast cancer dataset obtained from the cBio datahub to make it compatible with PyG. This dataset will serve as the target for my future contribution to PyG, once the necessary steps have been followed.

As we wrap up this article, I'm excited to share the progress I've made over the past weeks. The upcoming week holds tremendous potential as I delve into a larger dataset and evaluate the performance of the Graph Neural Network (GNN) by comparing it with other models.

I cordially invite you to join me next week for more updates and insights. Feel free to give it a read, provide your valuable feedback, and don't forget to clap if you enjoyed the post! Your support means a lot to me. Thank you for being a part of this journey!

(P.S. If you're interested in learning more about my project, feel free to check it out on GitHub. https://github.com/cannin/gsoc_2023_pytorch_pathway_commons)