GSoC'23 Chronicles: Coding Week Eight(8)

GSoC'23 Chronicles: Coding Week Eight(8)

Baseline Models with Automl models

Hi guys!!

Welcome to my latest blog post, where I'm thrilled to share the progress I made during the eighth week of the Google Summer of Code (GSOC) program. I apologise for not updating you in week 7; I was under the weather. Don't worry, I'll cover all the updates from both weeks in this post.

In my previous blog post for week 6, I mentioned my focus on creating baseline models for the two datasets and working on the InMemoryDataset class for the breast cancer dataset. While these tasks began in week 7, I'm happy to report that they were completed in week 8! Let's dive into the progress I made during these two weeks.

My Progress in Week 8

During this week, my primary focus was on two crucial tasks: creating the InMemoryDataset class for the breast cancer dataset and building baseline models using autokeras and FLAML.

To start, I delved into FLAML for building the baseline model. My mentor suggested this lightweight Python library for efficient automation of machine learning and AI operations, including model selection and hyperparameter tuning. Using flaml.AutoML, a task-oriented AutoML class, I incorporated it into my project and found that it outperformed both autokeras and the basic GNN model. The test set performance was particularly impressive. The parameters used are shown below:

The performance on the test set is shown below:

Moving on, I tackled the creation of the InMemoryDataset class for the breast cancer dataset. Following the same procedure I used for the ACC dataset, I wrapped the breast cancer dataset in this class. By implementing the four essential functions as I explained in the week 6 report, I was able to create the class seamlessly. I leveraged the existing functions from the ACC dataset class, making only minor adjustments to ensure the correct dataset was downloaded.

Summary of Progress Made

During this week, I made significant strides in my project:

  1. Created a robust Baseline model for the ACC dataset using FLAML. The model's performance was remarkable, achieving a prediction mse of 694, surpassing the results of other attempted modelling techniques.

  2. Successfully developed the InMemoryDataset Class for the breast cancer dataset. As a result, I obtained a comprehensive list of data objects for both the train set and test set of the breast cancer dataset.

Challenges Encountered

Throughout the week, I encountered minimal challenges. The only minor obstacle I faced was initially understanding how to utilize FLAML for model building. Fortunately, the well-documented and informative tutorials on the GitHub page provided clear guidance, allowing me to swiftly overcome this hurdle.

My Plans for Next Week

After a productive weekly call with my mentors, we have outlined the next steps to finalize the dataset submission to PyG:

  1. Refine the InMemoryDataset Class for the breast cancer dataset: I will focus on using only the essential features and labels, rather than including separate train and test sets split with an 80:20 ratio. This ensures a unified dataset format, giving users the flexibility to split the dataset according to their preferences. This is also to ensure consistency with other datasets on PyTorch Geometric.

  2. Establish a validation set and retrain models for the breast cancer dataset, using the widely adopted 60:20:20 ratio for train, validation, and test splits. This approach will explore alternative methods to enhance model performance. With the dataset's substantial size, having the three splits allows for robust experimentation without compromising data size.

Overall, these past weeks have been incredibly productive, and the outcomes have been promising. I'm looking forward to sharing more updates and insights in the coming weeks. Thank you for reading to the end.

I'm thrilled to share my progress with you all, and I invite you to join me for my next blog post. I'll delve into the final stages of preparing the dataset for submission to PyTorch Geometric. Your support and engagement have been invaluable throughout this exciting adventure!

(P.S. If you are interested in knowing more about my project, feel free to check it out on GitHub. github.com/cannin/gsoc_2023_pytorch_pathway..)