Natural language processing NLP star sky intelligent dialogue robot series: in depth understanding of Transformer natural language processing summarizing documents with T5 large
Summarizing documents with T5-large
We will create a summary function that can be called with any text. We will summarize the text with legal and financial examples, which will challenge the limitations of this method.
Creating a summarization function
First, we create a summary function called summary, which has two parameters. The first parameter is preprocess_text is the text to be summarized. The second parameter is ml, which is the maximum length of the summary text.
We apply the T5 task prefix "summarize" to the input text. The T5 model has a unified structure. The task is through the prefix prefix+input sequence method. This seems very simple, but the NLP transformer model is closer to this universal training and zero shot downstream task
def summarize(text,ml): preprocess_text = text.strip().replace("\n","") t5_prepared_Text = "summarize: "+preprocess_text print ("Preprocessed and prepared text: \n", t5_prepared_Text) tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device) # summmarize summary_ids = model.generate(tokenized_text, num_beams=4, no_repeat_ngram_size=2, min_length=30, max_length=ml, early_stopping=True) output = tokenizer.decode(summary_ids[0], skip_special_tokens=True) return output
It seems simple, doesn't it? It took more than 35 years from RNN and CNN to transformers. Some of the world's smartest research teams almost didn't need to fine tune from transformers models designed for specific tasks to multi task models. Google's research team created a standard format of transformer input text, which contains a prefix prefix indicating the problem to be solved NLP problem, this is a feat!
A general topic sample
text=""" The United States Declaration of Independence was the first Etext released by Project Gutenberg, early in 1971. The title was stored in an emailed instruction set which required a tape or diskpack be hand mounted for retrieval. The diskpack was the size of a large cake in a cake carrier, cost $1500, and contained 5 megabytes, of which this file took 1-2%. Two tape backups were kept plus one on paper tape. The 10,000 files we hope to have online by the end of 2001 should take about 1-2% of a comparably priced drive in 2001. """ print("Number of characters:",len(text)) summary=summarize(text,50) print ("\n\nSummarized text: \n",summary)
The operation results are as follows:
Number of characters: 534 Preprocessed and prepared text: summarize: The United States Declaration of Independence was the first Etextreleased by Project Gutenberg, early in 1971. The title was storedin an emailed instruction set which required a tape or diskpack behand mounted for retrieval. The diskpack was the size of a largecake in a cake carrier, cost $1500, and contained 5 megabytes, ofwhich this file took 1-2%. Two tape backups were kept plus one onpaper tape. The 10,000 files we hope to have online by the end of2001 should take about 1-2% of a comparably priced drive in 2001. Summarized text: the united states declaration of independence was the first etext published by project gutenberg, early in 1971. the 10,000 files we hope to have online by the end of2001 should take about 1-2% of a comparably priced drive in
Number of characters: 534
Pre processed and prepared text:
Overview: American Declaration of independence It is the first declaration of independence that Gutenberg plans to issue in early 1971. The title is stored in an e-mail instruction set, which requires a tape or disk package and is loaded for retrieval. The disk package is the size of a large cake in the cake carrier, the price is US $1500, and contains 5 megabytes, of which the file accounts for 1-2%. Two tape backups and one are retained Paper tape backup. We hope that by the end of 2001, the 10000 files we have online will account for 1-2% of the hard drives at the same price in 2001.
Summary text:
The American Declaration of independence is the first electronic version of Gutenberg's plan to publish in early 1971. We hope to have 10000 documents online by the end of 2001, which will account for 1-2% of hard drives at the same price
The Bill of Rights sample
#Bill of Rights,V text =""" No person shall be held to answer for a capital, or otherwise infamous crime, unless on a presentment or indictment of a Grand Jury, except in cases arising in the land or naval forces, or in the Militia, when in actual service in time of War or public danger; nor shall any person be subject for the same offense to be twice put in jeopardy of life or limb; nor shall be compelled in any criminal case to be a witness against himself, nor be deprived of life, liberty, or property, without due process of law; nor shall private property be taken for public use without just compensation. """ print("Number of characters:",len(text)) summary=summarize(text,50) print ("\n\nSummarized text: \n",summary)
The operation results are as follows:
Number of characters: 591 Preprocessed and prepared text: summarize: No person shall be held to answer for a capital, or otherwise infamous crime,unless on a presentment or indictment of a Grand Jury, except in cases arisingin the land or naval forces, or in the Militia, when in actual servicein time of War or public danger; nor shall any person be subject forthe same offense to be twice put in jeopardy of life or limb;nor shall be compelled in any criminal case to be a witness against himself,nor be deprived of life, liberty, or property, without due process of law;nor shall private property be taken for public use without just compensation. Summarized text: no person shall be held to answer for a capital, or otherwise infamous crime, unless ona presentment or indictment ofa Grand Jury. nor shall any person be subject for the same offense to be twice put
A corporate law sample
#Montana Corporate Law #https://corporations.uslegal.com/state-corporation-law/montana-corporation-law/#:~:text=Montana%20Corporation%20Law,carrying%20out%20its%20business%20activities. text ="""The law regarding corporations prescribes that a corporation can be incorporated in the state of Montana to serve any lawful purpose. In the state of Montana, a corporation has all the powers of a natural person for carrying out its business activities. The corporation can sue and be sued in its corporate name. It has perpetual succession. The corporation can buy, sell or otherwise acquire an interest in a real or personal property. It can conduct business, carry on operations, and have offices and exercise the powers in a state, territory or district in possession of the U.S., or in a foreign country. It can appoint officers and agents of the corporation for various duties and fix their compensation. The name of a corporation must contain the word "corporation" or its abbreviation "corp." The name of a corporation should not be deceptively similar to the name of another corporation incorporated in the same state. It should not be deceptively identical to the fictitious name adopted by a foreign corporation having business transactions in the state. The corporation is formed by one or more natural persons by executing and filing articles of incorporation to the secretary of state of filing. The qualifications for directors are fixed either by articles of incorporation or bylaws. The names and addresses of the initial directors and purpose of incorporation should be set forth in the articles of incorporation. The articles of incorporation should contain the corporate name, the number of shares authorized to issue, a brief statement of the character of business carried out by the corporation, the names and addresses of the directors until successors are elected, and name and addresses of incorporators. The shareholders have the power to change the size of board of directors. """ print("Number of characters:",len(text)) summary=summarize(text,50) print ("\n\nSummarized text: \n",summary)
Number of characters: 1816 Preprocessed and prepared text: summarize: The law regarding corporations prescribes that a corporation can be incorporated in the state of Montana to serve any lawful purpose. In the state of Montana, a corporation has all the powers of a natural person for carrying out its business activities. The corporation can sue and be sued in its corporate name. It has perpetual succession. The corporation can buy, sell or otherwise acquire an interest in a real or personal property. It can conduct business, carry on operations, and have offices and exercise the powers in a state, territory or district in possession of the U.S., or in a foreign country. It can appoint officers and agents of the corporation for various duties and fix their compensation.The name of a corporation must contain the word "corporation" or its abbreviation "corp." The name of a corporation should not be deceptively similar to the name of another corporation incorporated in the same state. It should not be deceptively identical to the fictitious name adopted by a foreign corporation having business transactions in the state.The corporation is formed by one or more natural persons by executing and filing articles of incorporation to the secretary of state of filing. The qualifications for directors are fixed either by articles of incorporation or bylaws. The names and addresses of the initial directors and purpose of incorporation should be set forth in the articles of incorporation. The articles of incorporation should contain the corporate name, the number of shares authorized to issue, a brief statement of the character of business carried out by the corporation, the names and addresses of the directors until successors are elected, and name and addresses of incorporators. The shareholders have the power to change the size of board of directors. Summarized text: a corporation can be incorporated in the state of Montana to serve any lawful purpose. the corporation has perpetual succession and can sue and be sued in its corporate name. it can conduct business, carry on operations, and have offices