GSoC'23 Chronicles: Unveiling the First Steps of My Project

GSoC'23 Chronicles: Unveiling the First Steps of My Project

My Path from Acceptance to the First Week of Coding

This email brought me mixed feelings of happiness, nervousness, fear, shock, and so on. I looked at it countless times, lol. With an overwhelming surge of excitement, I embarked on this incredible journey as a Google Summer of Code 2023 participant. As I write this blog post, I am thrilled to share about the first weeks of my GSoC experience.

The organization I will be working with is the National Resource for Network Biology (NRNB), a renowned institution at the forefront of network biology research. Guiding me through this exciting endeavour are my mentors, Augustin Luna and Yoshitaka Inoue, esteemed experts in the field. Together, we are embarking on the "Generate Example Dataset for PyTorch Geometric Based on Pathway Commons and Prototype" project.

This project aims to integrate the Pathway Commons and cBioPortal datasets to build and analyze graph-based models using PyTorch Geometric(PyG). The goal is to create an example dataset for PyG based on the integrated dataset and use it to train and evaluate Graph Neural Networks(GNNs) for various downstream tasks.

Pathway Commons(Link) is a resource for collecting and sharing biological pathway and interaction data using the BioPAX standard. cBioPortal for Cancer Genomics(cBioPortal) is an open-access resource that facilitates the exploration of complex cancer genomics data by providing access to molecular profiles and clinical attributes from large-scale projects.

Community Bonding Period

During the community period, which spanned from May 4 to May 28, GSoC participants like myself were tasked with familiarizing themselves with their mentors, studying documentation, and gaining the necessary knowledge to kickstart their projects. During this time, I engaged in extensive discussions with my mentors to gain deeper insights into the datasets I would be working with.

Additionally, I utilized the community period to explore various techniques, such as parsing the datasets using pandas and creating visualizations using NetworkX. To enhance my understanding of Graph Neural Networks, a fundamental aspect of my project, I delved into the PyG documentation and supplemented my learning by watching instructional videos. Furthermore, I made sure to download the essential libraries required for development, ensuring that I was well-prepared for the coding phase.

Lastly, I reached out to a helpful member of the PyG community on Slack to discuss my project and inquire about the requirements for contributing a dataset to PyG. He kindly informed me that there were no specific guidelines in place, but emphasized the significance of ensuring the dataset is implemented correctly. To assist me in this process, he directed me to a useful GitHub repository that offers code guidance. Once I have completed the implementation, the community member assured me that they would be available to provide further assistance and handle the next steps.

Coding Period Commences

Following the community bonding phase, the coding period officially commenced on May 29th. As I dived into the first week of coding, it became an exciting time filled with valuable learning experiences and a fair share of encountering errors along the way. During this week, my main focus revolved around two primary tasks.

  1. The first task involved matching selected datasets, namely PathwayCommons12.reactome.hgnc.sif.gz and acc_tcga_pan_can_atlas_2018, based on their shared features. This matching process aimed to integrate relevant data and ensure compatibility for subsequent analysis.

  2. Building a basic GNN model using the integrated dataset.

My Progress this Week

  • I utilized PyG to create a graph structure from the Pathway Commons data. This step was necessary to ensure compatibility with PyG and lay the foundation for building the GNN model

  • I focused on establishing a match between the datasets data_clinical_patient.txt and data_clinical_sample.txt from acc_tcga_pan_can_atlas_2018. I performed data preprocessing, followed by merging the datasets based on the patient identifier's column. This integration allowed for a comprehensive understanding of patient-related information.

  • To align the graph structure with the merged dataset, I created a Data Object using PyG. This step aimed to establish a coherent connection between the graph structure derived from Pathway Commons and the merged dataset

  • Lastly, I generated training and testing sets from the integrated data.

My Plans for next week

In the upcoming week, my primary focus will be on the development of the GNN model. I intend to dedicate my time and efforts to building the model, training it using the designated training set, and subsequently using the trained model to predict outcomes on the test set. Specifically, I will be utilizing the model to predict the overall survival in months for the patients.

Challenges encountered

  1. Firstly, while attempting to split the data, I faced difficulties in implementing the Data Loader function from PyG to create batches. As a result, I have temporarily set this task aside and will revisit it at a later stage.

  2. I encountered errors due to an incorrect data structure while building the model. This issue arose during the process of creating the Data Object with PyG. Moving forward, I am determined to troubleshoot and rectify this problem to ensure successful training of the GNN model.

As we conclude this article, I would like to express my gratitude for your readership and support. I am thrilled to share my progress this week and the remarkable developments in my project.

The upcoming week holds great promise as I delve deeper into building the Graph Neural Network model, training it, and evaluating its performance. I am eager to witness the insights and predictions this model can provide, particularly in predicting the overall survival of patients.

I invite you to join me next week as I continue to share the latest updates and insights from my GSoC journey. Thank you for your continued interest and encouragement. Until next time!

(P.S. If you are interested in knowing more about my project, feel free to check it out on GitHub and read about it on the GSoC projects list. github.com/cannin/gsoc_2023_pytorch_pathway.., summerofcode.withgoogle.com/programs/2023/p..)