GSOC_Report

GSoC with TensorFlow (Keras Team)

Hey!

Thanks for checking out my work

First things first I’m really thankful to Google for organizing this wonderful event every year and also huge thanks to the TensorFlow-Keras Team for having me on the team and to my mentors Matthew Watson and Chen Qian who helped me throughout the journey.

My work mainly focused on contributing to KerasNLP a new library which is currently pre-release and aims to build “Industry-strength Natural Language Processing workflows with Keras”

I started contributing to the library in March 2022 and really liked the codebase as it was pretty easy to navigate through and the maintainers were really helpful in guiding beginners!

I mainly worked towards adding Data Augmentation Techniques, tokenizers and tokenizer training utilities, fixing bugs, adding new options to pre-existing utilities and writing tutorials for keras.io

My PRs:

Data Augmentation Techniques

PR Link	Status	Description
Random Deletion Layer	Merged	Adds the Random Deletion operation as a Keras Layer described in the paper EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
Random Swap Layer	Merged	Adds the Random Swap operation as a Keras Layer described in the paper EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
Random Replacement Layer	In Review	Adds the Random Replacement operation as a Keras Layer described in the paper EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
Random Insertion Layer	In Review	Adds the Random Insertion operation as a Keras Layer described in the paper EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
Minor fixes to the Random Deletion Layer	Merged	Fixed some minor bugs in the Deletion Layer
Docstring and Test Fixes for Random Deletion Layer	Merged	Made fixes in the Random Deletion Layer to improve docstring and remove redundancy in tests

The work was majorly aimed towards adding support for Data Augmentation Techniques in the form of Keras Pre-Processing Layers. The Layers are graph mode compatible as well and hence work with tf datasets which are more efficient. The layers also provide granular control such as deciding which tokens to skip using a list, a tf function or even a native python function. Insertion and Replacement layers also have fine grained control over how to choose the new token using either a list, a tf function or a native python function.

Major portion of my GSoC timeline went towards this work as this included several API redesigns to make it usable for the end users and also needed to be graph mode compatible to be usable with tf datasets

Tokenizers

PR Link	Status	Description
Fixes for the WordPieceTrainer	Merged	These tests removed dependency between tests in the docstring and those in the test files for file handling
Created a trainer for SentencePiece Tokenizer	Merged	Added a trainer which Trains a SentencePiece vocabulary from an input dataset or a list of filenames.
Fixed Bug in Unicode Tokenizer Vocab Size	Merged	Fixed bug caused by mistake in argument name
Added a vocabulary_size argument to UnicodeCharacterTokenizer	Merged	Incorporated capping OOV tokens in the UnicodeCharacterTokenizer by setting vocabulary_size
Adding Utility to Detokenize as list of Strings to Tokenizer Base Class	Merged	Added a utility which decodes tensors into list of strings over bytestring recursively
UnicodeCharacterTokenizer	Merged	Added a new tokenizer for tokenization into Unicode Characters
Fixing rank 1 outputs for WordPieceTokenizer	Merged	Fixed issue with Rank 1 outputs in WordPieceTokenizer

Tokenizers are an essential part of any NLP Library. KerasNLP also has its share of tokenizers. I majorly contributed to the building of the UnicodeCharacterTokenizer, some fixes for WordPieceTokenizer and Trainer and creating a utility to train create proto files for SentencePiece Tokenizer.

PR Link	Status	Description
Adding Eval Script for BERT on SQUAD Dataset	In Works	Aims to add an Eval Script for BERT on SQUAD Dataset
Migrating from Datasets to TFDS for GLUE Example	Merged	Removed dependency on Datasets for GLUE and instead migrated to TFDS

This work is mainly towards adding more features to the BERT model present in the repository which makes it easy for the end user to rebuild models by modifying the code. I worked towards changing dataset dependency and also I’m currently working on adding an Eval Script for SQUAD Dataset.

Guides

PR Link	Status	Description
Guide on Open Ended Text Generation Guide KerasNLP	In Review	Added a guide for `keras.io` which showcases tradeoff between Byte and Unicode Tokenizer

This guide aims to showcase our tokenizers to the end-users and also attract attention towards the library

General Work

PR Link	Status	Description
Added Debug Info for Line Ending Issues	Merged	Added some documentation to address issues caused while running linters in wrong file ending mode
Fixed Import Error	Merged	Fixed error caused by missing init file
Fixed Import for top_p_search util	Merged	Fixed error caused by missing import in init file for top_p_search
Added Kernel and Bias Initializers	Merged	Added Kernel and Bias Initializers to Encoder and Decoder classes

These are minor bugs and fixes along with some basic features which I worked towards fixing/adding.