This series of articles aims to describe the use of the GenX tool as introduced in “Evaluating and Explaining Natural Language Generation with GenX” by K. Duskin, S. Sharma, J.Y. Yun, E. Saldanha and D. Arendt.
“The GenX tool is designed to enable interactive exploration and explanation of natural language generation outputs with a focus on the detection of memorization.”
We begin this series of articles with explaining the process of extracting and processing the ACL Anthology dataset. We use this processed textual data to train a generation transformer model using which we generate new set of textual data. In the following parts we will process the training, validation and generated data to be used by the GenX tool and then explain the working of the toolkit.
For better understanding of the process, this article is split into the following section:
Extracting and Processing Abstracts from ACL Anthology Data
“The ACL Anthology is a digital archive of conference and journal papers in natural language processing and computational linguistics.”
ACL provides a list of all the articles available in their database in the form of a BibTeX file, with as well as without the abstracts. We use these abstracts as the textual data to be used for generation and then evaluation using the GenX toolkit.
First we download the “Full Anthology as BibTeX with abstracts” file from the ACL Anthology website.
After downloading the dataset we use the “bibtexparser” python library to read the BibTeX file and extract data.
The above code generates the “acl_entries” list which contain all the data from the downloaded BibTeX file, “anthology+abstracts.bib”. Each element of this list is a dictionary containing all the data available for a single publication like URL, publisher, author, title, abstracts, etc.
We extract the abstracts from these dictionaries in a list and add Beginning-of-Sentence, “[BOS]” , and End-of-Sentence,“[EOS]” , tags before and after the abstracts, respectively. These tags are required by our generation model for learning the beginning and ending of a document. We would provide the “[BOS]” tag to the trained generation model as a prompt to generate text and the will stop the generation at the first “[EOS]” tag generated.
The following code extracts and processes the ACL Anthology abstracts from the entries abstracted in the above code, splits them into training and validation data and then stores them in their respective text files.
Training and Generation
After extracting and processing the ACL abstracts, we now move on to training our generation model. We will be using GPT2 as our generation model. We will be using the HuggingFace’s transformer library to train the GPT2 model and then use the custom model to generate new abstracts.
HugginFace has provided example scripts for generation using GPT2 and other transformer model as well as a script to generate using custom as well as pre-trained generation model. We will be changing these scripts a little to be used with our data. First clone their GitHub repository.
This repository contains the custom scripts for training and generating abstracts. Replace the “run_clm.py” in “tranformers/examples/pytorch/language-modeling” and replace the “run_generation.py” in “transformers/examples/pytorch/text-generation”.
To use this scripts either you can create a bash script or directly input the following code in the terminal.
This concludes this part of the series. In the following article we will look into how to process the generation so that it can be used by the GenX toolkit.