Emotional analysis of Streamlit in action twitter

Streamlit is an excellent machine learning tool development library. This tutorial will learn how to use streamlit and flair to develop a twitter micro blog emotion analysis application.

Related links: Streamlit Development Manual

1. streamlit overview

Not everyone is a data scientist, but everyone needs the power of data science. Streamlit helps us to solve this problem. Using streamlit to deploy machine learning model is so simple that only a few function calls are needed.

For example, if you run the following code:

import streamlit as st

x = st.slider('Select a value')
st.write(x, 'squared is', x * x)

Streamlit creates a slider input like this:

Installing Streamlit is also simple:

pip3 install streamlit

Then you can run the app:

streamlit run <FILE>

Note that running your streamlit file directly in python is not possible!

The code in this article can go to Here Download.

2. Emotional classification

Emotion classification is a classic problem in natural language processing (NLP), which aims to judge whether the emotion tendency of a sentence is Positive or Negative.

For example, "I love Python!" This sentence should be classified as Positive, while "Python is the worst!" should be classified as Negative.

3. Flair development library

Many popular machine learning development libraries provide the implementation of emotion classifier. In this tutorial, we use Flair, a top-level NLP classifier development package, for simplicity and effectiveness.

You can install Flair by executing the following command:

pip3 install flair

4. Sentiment140 data set

Any data science project needs data sets. Sentiment140 data set is a perfect match for our project. The data set contains 1.6 million tweet s marked with 0 for negative and 4 for positive.

Download from here Sentiment140 data set.

5. Data loading and preprocessing

Once the Sentiment140 data set has been downloaded, you can load the data using the following code:

import pandas as pd

col_names = ['sentiment','id','date','query_string','user','text']
data_path = 'training.1600000.processed.noemoticon.csv'

tweet_data = pd.read_csv(data_path, header=None, names=col_names, encoding="ISO-8859-1").sample(frac=1) # .sample(frac=1) shuffles the data
tweet_data = tweet_data[['sentiment', 'text']] # Disregard other columns
print(tweet_data.head())

Running the above code will output the following results:

        sentiment                                               text
1459123          4  @minalpatel Any more types of glassware you'd...
544833           0  I was a bit puzzled as to why it seemed to it...
398665           0  Yay...my car is ready....Was about 2500 miles...
708548           0               @JoshEJosh How ya been? I MISS you! 
264000           0  @MrFresh0587 yeah i know. well...i'm going to...

However, because we use. sample(frac=1) to randomly prioritize the data, you may get slightly different results.

Now the data is still messy. Let's preprocess it first:

import re

allowed_chars = ' AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz0123456789~`!@#$%^&*()-=_+[]{}|;:",./<>?'
punct = '!?,.@#'
maxlen = 280

def preprocess(text):
    return ''.join([' ' + char + ' ' if char in punct else char for char in [char for char in re.sub(r'http\S+', 'http', text, flags=re.MULTILINE) if char in allowed_chars]])[:maxlen]

The above function is a bit boring, but in short, the purpose of this code is to eliminate all unrecognized characters, links, etc. in the text and truncate them to 280 characters. There is a better way to clean up links and other preprocessing, but we use the most simple method here.

Flair has specific requirements for data format, which looks like this:

__label__<LABEL>    <TEXT>

In our Weibo emotion analysis application, the data should be sorted as follows:

__label__4    <PRE-PROCESSED TWEET>
__label__0    <PRE-PROCESSED TWEET>
...

To do this, we need three steps:

1. Execute preprocessing function

tweet_data['text'] = tweet_data['text'].apply(preprocess)

2. Prefix each emotion tag with a label

tweet_data['sentiment'] = '__label__' + tweet_data['sentiment'].astype(str)

3. Save data

import os

# Create directory for saving data if it does not already exist
data_dir = './processed-data'
if not os.path.isdir(data_dir):
    os.mkdir(data_dir)

# Save a percentage of the data (you could also only load a fraction of the data instead)
amount = 0.125

tweet_data.iloc[0:int(len(tweet_data)*0.8*amount)].to_csv(data_dir + '/train.csv', sep='\t', index=False, header=False)
tweet_data.iloc[int(len(tweet_data)*0.8*amount):int(len(tweet_data)*0.9*amount)].to_csv(data_dir + '/test.csv', sep='\t', index=False, header=False)
tweet_data.iloc[int(len(tweet_data)*0.9*amount):int(len(tweet_data)*1.0*amount)].to_csv(data_dir + '/dev.csv', sep='\t', index=False, header=False)

In the above code, you may notice two problems:

  • We only saved part of the data. This is because the Sentiment140 data set is too large. If Flair loads the complete data set, it needs too much memory.
  • We divide the data into training set, test set and development set. When Flair loads data, it needs the data to be split in this way. By default, the split ratio is 8-1-1, that is, 80% of the data goes into the training set, 10% into the test set, and 10% into the development set

Now, the data is ready!

6. Realization of text emotion classification based on Flair

In this tutorial, we only cover the basics of Flair. If you need more details, it is recommended that you check Flair's official documentation.

First, we use Flair's NLPTaskDataFetcher class to load data:

from flair.data_fetcher import NLPTaskDataFetcher
from pathlib import Path

corpus = NLPTaskDataFetcher.load_classification_corpus(Path(data_dir), test_file='test.csv', dev_file='dev.csv', train_file='train.csv')

Then we construct a label dictionary to record all the tags assigned to the text in the corpus

label_dict = corpus.make_label_dictionary()

Now you can load Flair's built-in GloVe word embed:

from flair.embeddings import WordEmbeddings, FlairEmbeddings

word_embeddings = [WordEmbeddings('glove'),
#                    FlairEmbeddings('news-forward'),
#                    FlairEmbeddings('news-backward')
                  ]

The two lines of code commented out are options provided by Flair for better results, but my memory is limited, so I can't test them.

After loading the word embedding vector, initialize it with the following code:

from flair.embeddings import DocumentRNNEmbeddings

document_embeddings = DocumentRNNEmbeddings(word_embeddings, hidden_size=512, reproject_words=True, reproject_words_dimension=256)

Now integrate word embedding vector and label dictionary to get a TextClassifier model:

from flair.models import TextClassifier

classifier = TextClassifier(document_embeddings, label_dictionary=label_dict)

Next, we can create a model trainer instance to train the model with our corpus:

from flair.trainers import ModelTrainer

trainer = ModelTrainer(classifier, corpus)

Once we start training, we need to wait a moment:

trainer.train('model-saves',
              learning_rate=0.1,
              mini_batch_size=32,
              anneal_factor=0.5,
              patience=8,
              max_epochs=200)

After the model training, you can use the following code to test:

from flair.data import Sentence

classifier = TextClassifier.load('model-saves/final-model.pt')

pos_sentence = Sentence(preprocess('I love Python!'))
neg_sentence = Sentence(preprocess('Python is the worst!'))

classifier.predict(pos_sentence)
classifier.predict(neg_sentence)

print(pos_sentence.labels, neg_sentence.labels)

You should get something like this:

[4 (0.9758405089378357)] [0 (0.8753706812858582)]

It seems that the prediction is correct!

7. Grabbing twitter Weibo

Yes, now we have a prediction that the emotional color of a single tweet is positive or negative. But this is not very useful, so how to improve?

My idea is to grab the latest tweet s with specified query conditions, classify emotions one by one, and then calculate the positive / negative ratio.

I personally like to use Twitter crawler to catch twitter Weibo. Although it is not fast, you can bypass the request limit set by twitter. Install twitter scraper with the following command:

pip3 install twitterscraper

It's installed. Later, we will carry out specific grabbing.

8. Script Streamlit

Create a new file main.py, and then introduce some modules first:

import datetime as dt
import re

import pandas as pd
import streamlit as st
from flair.data import Sentence
from flair.models import TextClassifier
from twitterscraper import query_tweets

Next, we can do some basic processing, such as setting the page title and loading the classification model:

# Set page title
st.title('Twitter Sentiment Analysis')

# Load classification model
with st.spinner('Loading classification model...'):
    classifier = TextClassifier.load('models/best-model.pt')

The code block with st.spinner allows us to give the user a progress prompt when loading the classification model.

Next, we can copy the preprocessing function written before:

import re

allowed_chars = ' AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz0123456789~`!@#$%^&*()-=_+[]{}|;:",./<>?'
punct = '!?,.@#'
maxlen = 280

def preprocess(text):
    return ''.join([' ' + char + ' ' if char in punct else char for char in [char for char in re.sub(r'http\S+', 'http', text, flags=re.MULTILINE) if char in allowed_chars]])[:maxlen]

We first implement the classification of a single tweet:

st.subheader('Single tweet classification')

tweet_input = st.text_input('Tweet:')

As long as the input text is not empty, we will process as follows:

  • Preprocessing tweet
  • Forecast
  • Explicit prediction results
if tweet_input != '':
    # Pre-process tweet
    sentence = Sentence(preprocess(tweet_input))

    # Make predictions
    with st.spinner('Predicting...'):
        classifier.predict(sentence)

    # Show predictions
    label_dict = {'0': 'Negative', '4': 'Positive'}

    if len(sentence.labels) > 0:
        st.write('Prediction:')
        st.write(label_dict[sentence.labels[0].value] + ' with ',
                sentence.labels[0].score*100, '% confidence')

Use st.write You can write any text, or even directly and explicitly Pandas data frames.

Now you can run:

streamlit run main.py

The result looks like this:

Next, we can realize the previous idea: search twitter Weibo of a subject and calculate the positive and negative emotional ratio.

st.subheader('Search Twitter for Query')

# Get user input
query = st.text_input('Query:', '#')

# As long as the query is valid (not empty or equal to '#')...
if query != '' and query != '#':
    with st.spinner(f'Searching for and analyzing {query}...'):
        # Get English tweets from the past 4 weeks
        tweets = query_tweets(query, begindate=dt.date.today() - dt.timedelta(weeks=4), lang='en')

        # Initialize empty dataframe
        tweet_data = pd.DataFrame({
            'tweet': [],
            'predicted-sentiment': []
        })

        # Keep track of positive vs. negative tweets
        pos_vs_neg = {'0': 0, '4': 0}

        # Add data for each tweet
        for tweet in tweets:
            # Skip iteration if tweet is empty
            if tweet.text in ('', ' '):
                continue
            # Make predictions
            sentence = Sentence(preprocess(tweet.text))
            classifier.predict(sentence)
            sentiment = sentence.labels[0]
            # Keep track of positive vs. negative tweets
            pos_vs_neg[sentiment.value] += 1
            # Append new data
            tweet_data = tweet_data.append({'tweet': tweet.text, 'predicted-sentiment': sentiment}, ignore_index=True)

Finally, we show the collected data:

try:
    st.write(tweet_data)
    # Show positive to negative tweet ratio
    try:
        st.write('Positive to negative tweet ratio:', pos_vs_neg['4']/pos_vs_neg['0'])
    except ZeroDivisionError: # if no negative tweets
        st.write('All postive tweets')
except NameError: # if no queries have been made yet
    pass

Run the application again, and the result is as follows:

Here is our complete streamlit application script:

import datetime as dt
import re

import pandas as pd
import streamlit as st
from flair.data import Sentence
from flair.models import TextClassifier
from twitterscraper import query_tweets

# Set page title
st.title('Twitter Sentiment Analysis')

# Load classification model
with st.spinner('Loading classification model...'):
    classifier = TextClassifier.load('models/best-model.pt')

# Preprocess function
allowed_chars = ' AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz0123456789~`!@#$%^&*()-=_+[]{}|;:",./<>?'
punct = '!?,.@#'
maxlen = 280

def preprocess(text):
    # Delete URLs, cut to maxlen, space out punction with spaces, and remove unallowed chars
    return ''.join([' ' + char + ' ' if char in punct else char for char in [char for char in re.sub(r'http\S+', 'http', text, flags=re.MULTILINE) if char in allowed_chars]])

### SINGLE TWEET CLASSIFICATION ###
st.subheader('Single tweet classification')

# Get sentence input, preprocess it, and convert to flair.data.Sentence format
tweet_input = st.text_input('Tweet:')

if tweet_input != '':
    # Pre-process tweet
    sentence = Sentence(preprocess(tweet_input))

    # Make predictions
    with st.spinner('Predicting...'):
        classifier.predict(sentence)

    # Show predictions
    label_dict = {'0': 'Negative', '4': 'Positive'}

    if len(sentence.labels) > 0:
        st.write('Prediction:')
        st.write(label_dict[sentence.labels[0].value] + ' with ',
                sentence.labels[0].score*100, '% confidence')

### TWEET SEARCH AND CLASSIFY ###
st.subheader('Search Twitter for Query')

# Get user input
query = st.text_input('Query:', '#')

# As long as the query is valid (not empty or equal to '#')...
if query != '' and query != '#':
    with st.spinner(f'Searching for and analyzing {query}...'):
        # Get English tweets from the past 4 weeks
        tweets = query_tweets(query, begindate=dt.date.today() - dt.timedelta(weeks=4), lang='en')

        # Initialize empty dataframe
        tweet_data = pd.DataFrame({
            'tweet': [],
            'predicted-sentiment': []
        })

        # Keep track of positive vs. negative tweets
        pos_vs_neg = {'0': 0, '4': 0}

        # Add data for each tweet
        for tweet in tweets:
            # Skip iteration if tweet is empty
            if tweet.text in ('', ' '):
                continue
            # Make predictions
            sentence = Sentence(preprocess(tweet.text))
            classifier.predict(sentence)
            sentiment = sentence.labels[0]
            # Keep track of positive vs. negative tweets
            pos_vs_neg[sentiment.value] += 1
            # Append new data
            tweet_data = tweet_data.append({'tweet': tweet.text, 'predicted-sentiment': sentiment}, ignore_index=True)

# Show query data and sentiment if available
try:
    st.write(tweet_data)
    try:
        st.write('Positive to negative tweet ratio:', pos_vs_neg['4']/pos_vs_neg['0'])
    except ZeroDivisionError: # if no negative tweets
        st.write('All postive tweets')
except NameError: # if no queries have been made yet
    pass

Original link: Streamlit+Flair develop micro blog emotion analysis application huizhi.com

Tags: Python encoding

Posted on Wed, 29 Jan 2020 23:21:09 -0500 by wolfraider