Generate Summaries using Google’s Pegasus library
Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models
PEGASUS stands for Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models. It uses self-supervised objective Gap Sentences Generation (GSG) to train a transformer encoder-decoder model. The paper can be found on arXiv. In this article, we will only focus on generating state of the art abstractive summaries using Google’s Pegasus library.
As of now, there is no easy way to generate the summaries using Pegasus library. However, Hugging Face is already working on implementing this and they are expecting to release it around September 2020. In the meantime, we can try to follow the steps mentioned Pegasus Github repository and explore Pegasus. So let’s get started.
Based on the request from many readers, I have added the full code at the end of the article. Hope this helps you!!
This step will clone the library on GitHub, create /content/pegasus folder, and install requirements.
Next, follow the instructions to install gsutil. The below steps worked well for me in Colab.
This will create a folder named ckpt under /content/pegasus/ and then download all the necessary files (fine-tuned models, vocab etc.) from Google Cloud to /content/pegasus/ckpt.
If all the above steps completed successfully, we see the below folder structure in Google Colab. Under each downstream dataset, we can see fine-tuned models that we can use for generating extractive/abstractive summaries.
Though it’s not mentioned in Pegasus Github repository README instruction, below pegasus installation step is necessary otherwise you will run into errors. Also, make sure you are in root folder /content before executing this step.
Now, let us try to understand about pre-training corpus and downstream datasets of Pegasus. Pegasus is pre-trained on C4 & Hugenews corpora and it is then fine-tuned on 12 downstream datasets. The evaluation results on downstream datasets are mentioned in Github and also in the paper. Some of these datasets are extractive & some are abstractive. So the use of the dataset depends on if we are looking for extractive summaries or abstractive summaries.
Once all the above steps are taken care of, we can now jump to
evaluate.py step mentioned below but it will take longer to complete as it will try to make predictions on all the data which are part of the evaluation set of the respective fine-tuned dataset being used. Since we are interested in summaries of custom text or sample text, we need to make minor changes
public_params.py file found under
/content/pegasus/pegasus/params/public_params.py as shown below.
Here I am making changes to
reddit_tifu as I am trying to use
reddit_tifu dataset for generating an abstractive summary. In case if you are experimenting with
aeslc or other downstream datasets you are requested to make similar changes.
Here we are passing text from this news article is
inp which is then copied to
inputs. Note that empty string to passed to
targets as this is what we are going to predict. Then both
targets are used to create
tfrecord, which pegusus expects.
inp = ‘“replace this with text from the above this article’’’
As the final step, when
evaluate.py is run, the model makes a prediction or generates a summary of the above news article’s text. This will generate 4 output files in the respective downstream dataset’s folder. In this case
text_metric text files will be created under
Abstractive summary (prediction):
“India and Afghanistan on Monday discussed the evolving security situation in the region against the backdrop of a spike in terrorist violence in the country.”
This looks like a very well generated abstractive summary when we compare with the news article we passed as input for generating the summary. By using different downstream datasets we can generate extractive or abstractive summaries. Also, we can play around with different parameter values and see how it changes summaries.
In order to save some time and also some space on Google Colab, the below code downloads only
reddit_tifu fine-tuned dataset instead of downloading all the fine-tuned models as mentioned in the article.
You can verify the predictions in the output of the last cell in the above code as below:
PREDICTIONS: : India and Afghanistan on Monday discussed the evolving security situation in the region against the backdrop of a spike in terrorist violence in the country.
PEGASUS: A State-of-the-Art Model for Abstractive Text Summarization
Students are often tasked with reading a document and producing a summary (for example, a book report) to demonstrate…
Generating Abstractive Summaries Using Google’s PEGASUS Model
A quick way to use the pre-trained pegasus model by Google Brain to create abstractive summaries.