Python topic modeling, LDA model, t-SNE dimension reduction clustering, word cloud visualization, text mining, newsgroup dataset

In this article, we discuss based on gensim Package to visualize the output and results of the topic model (LDA) .

introduce

We follow a structured workflow and construct a topic model based on potential Dirichlet assignment (LDA) algorithm.

In this article, we will use the topic model to explore a variety of strategies to use matplotlib The drawing effectively visualizes the results .

I will use part of the 20 newsgroup datasets because the focus is more on the method of visualizing the results.

Let's start by importing the package and the 20 News Groups dataset.

import matplotlib.pyplot as plt # NLTK stop word fom nlt.copus imort stowods sop_wrds = stowords.wrds('chinse')

Import newsgroup dataset

Let's import the newsgroup dataset and keep only 4 Category.

# Import dataset d = f.oc\[dftargt_name.in(\[so.relion.chritan\], 'ec.sot.okey', 'ak.piticmdast' 'rec.oorcyces'\]) , :\] prin(f.hpe) #> (2361, 3) df.(

Mark sentences and clean them up

Delete email, line breaks, single quotes, and finally use gensim to split sentences into word lists simple_preprocess(). set up deacc=True Option removes punctuation.

def snds(seecs): for setees in sntces: sent = r.sub('\\S*@\\S*\\s?', '', sent) # Delete email snt = re.sb('\\s+', '', sent) # Remove newline character set = re.sb("\\'", "", sent) # Remove single quotes set = geim.uls.smplprerss(str(sent), deacc=True) # Convert to list data = df.cnt.lus.tolist()

Building two letter group and three letter group models and reasoning

Let's use the model to form two letter groups and three letter groups. To improve execution speed, this model is passed to Phraser().

Next, restore each word form to its root form, retaining only nouns, adjectives, verbs and adverbs.

We only keep these POS tags because they contribute the most to the meaning of the sentence. Here, I use spacy for lexical processing.

# Build big word and three word model bigrm = endl.Pres(dta_ords, mncnt=5, thrshl=100) # Higher thresholds reduce phrases. tigam = genm.del.Prses(bga\[dtawors\], thrhld=100) bigm_od = gsim.molpss.Pasr(bgrm) tigrmod = genm.mos.pres.hrser(tigam) # Run once at the terminal ""Delete stop loss words to form big words, three words and phrases"" texts = \[\[wor fo wrd in sipeeproe(tr(dc)) \[iram_od\[oc\] for doc in txts\] tets = \[rirammod\[igrmmod\[dc\]\] for dc in tets\] tetout = \[\] np = scy.oad('en', dial=\['Parser', 'ner'\]) for set in txs: dc = np(" ".join(sn)) tex_.ppd(\[tknlea_ fr toen in oc if toenpo_ in aowed_ots\]) # After lexicalization, delete the stop word again atady = roe\_os(daa\_ds) # Processed text data!

Build topic model

To build an LDA topic model using, you need a corpus and a dictionary. Let's create them first, and then build the model. The trained topics (keywords and weights) are also output below.

If you check the topic keywords, they together represent the topic we originally selected. Church, hockey, regional and motorcycle. Good!

# Create dictionary id2od = copoDciary(dta_eay) # Create a corpus. Term document frequency crpus = \[i2wod.o2bow(ext) for txt in daa_ey\] # Establish LDA model Lal(copus=copus, id2wrd=id2wrd, nu_tpic=4, radom_ate=100, updaeeery=1, chnsie=10, pas=10. alha='symmetric', iteatos=100, prdics=True) (ldampcs())

What is the dominant theme and its percentage contribution in each document

In the LDA model, each document consists of multiple topics. However, usually only one theme dominates. The following code extracts the main topic of each sentence and displays the weight of the topic and keyword in a well formatted output.

In this way, you will know which document belongs primarily to which topic.

# Start output se_tpcf = p.Dataame() # Get the main topics in each document for i, ro_isin enate(ldmoel\[crps\]): rw = rw\_s0\] if lamoel.pe\_wortopis else rowlis row = soed(ow, ky=laba x: (x\[1\]), evre=True) # Get the leading topic, perc contribution, and keywords for each document for j, (toicum, pr_pic) in enate(row): if j == 0: # => Dominant topic wp = ldel.shotoic(topic_num) # Add the original text at the end of the output deeos = fratcs(lodel=damoe, copus=crpus, tets=dary) # format topic = os.retidex()

The most representative sentence of each topic

Sometimes you want to get a sample of sentences that best represent a given topic. This code gets the most typical sentences for each topic.

# Display settings to display more characters in the column for i, grp in serpd: senlet = pd.cnct(\[senlet, gp.srtes(\['Peion'\], asng=Fase).hed(1)\] ais=0) # Reset index seet.resex(drp=True, inlce=True) # format senllet.couns = \['Toum', "TopCorib", "Kywrds", "rsa xt"\] # display sencoet.head(10)

Frequency distribution of the number of words in a document

When working with a large number of documents, you want to know the overall size of the document and the subject size. Let's plot the word distribution of the document.

# mapping plt.fiue(fe=(6,7), dpi=60) plt.xtcs(nplic(0,00,9))

import sebon as sns fig.titat() fig.sbts_juo0.90) plt.xticks(np.lisa(0,00,9)) plt.sow()

The first N keyword word clouds of each topic

Although you have seen the subject keywords in each subject, the word cloud whose word size is proportional to the weight is a good visualization method.

# one Word cloud of the first N words in each topic from matplotlib import pyplot as plt from worcloud mport WrCloud,STOPWODS clod = WordClud(stopwds=stp_ords, barounolr='white', reer_oronal=1.0) plt.sow()

Number of words of subject keywords

When it comes to keywords in a topic, the importance (weight) of keywords is very important. In addition, the frequency of these words in the document is also interesting.

Let's plot the number of words and the weight of each keyword in the same chart.

You should focus on words that appear in multiple topics and words whose relative frequency is greater than the weight. Usually, these words become less important. The chart I draw below is the result of adding several such words to the stop word list at the beginning and rerunning the training process.

tops = l_mdl.swtcs(foatd=Fase) # Draw the word number and weight diagram of subject keywords fig, as = pltuls(2, 2, fiiz=(16,10), sey=rue, di=160) fig.tigh\_lyut\_pad=2) plt.shw()

Sentence chart colored by topic

Each word in the document represents one of the four topics. Let's color each word in a given document according to its subject id.

# Sentences coloring N sentences for i, ax in eumate(xes): cour = corp\[i-1\] . topprcs, wrdits, wrdihius = lda\[copr\] wodoac = \[(lmod2word\[wd\], tpic\[0\]) or w, tpc in odid_opcs\] # Draw a rectangular area tpcred = soted(tpps, key= x: (x\[1\]), rvese=True) word_pos = 0.06 plt.subdt(wsace=0, hsace=0) plt.show()

What is the most discussed topic in the document?

Let's calculate the total number of documents attributed to each topic.

# Color N sentences Dominant topic = \[\] Topic percentage = \[\] for i, crp in euete(opu_el): topcs, wordics, wrlues = moel\[crp\] dopic = soted(torcs, key = lmda x: x\[1\], reerse=Tue)\[0\]\[0\] . doics, toages = topent(mol=lda, copus=crus,en=-) # Distribution of dominant topics in each document dfc = dh\_dc.t\_frme(ame='cunt').eeinex() # Total subject distribution by actual weight topweig = pd.DaaFae(\[dct(t) for t in toges\] ) # The first three keywords of each topic \[(i, tpic) for i, tocs in lda.shcs(fted=Flse) for j, (tic, wt) in eae(toic)if j < 3)

Let's make two figures:

The number of documents per topic is calculated by assigning the document to the topic with the largest weight in the document.
The number of documents for each topic is calculated by summarizing the actual weight contribution of each topic to their respective documents.

from mtpltli.tiker import ucFattr # mapping fig, (ax1, ax2) = pl.supot(1, 2) # Issues distributed by main issues ax1.bar(data=df_dc) # Topic distribution by topic weight ax2.ar(x='iex', hegh='cout', dat=dfoc, with=.5, plt.sow()

t-SNE (t-distribution random neighbor embedding) clustering diagram

Let's use t-SNE (t-distribution random neighbor embedding) algorithm to visualize document clusters in 2D space.

# Get topic weight and dominant topic ------------ # Get topic weight for i, row_list: tophts.apd(\[w for i, w in rost\[0\]\] ) # Array of subject weights arr = pd.Dame(tohts).fna(0).vales # Maintain good separation points (optional) rr = ar\[p.aax(rr) > 0.35\] . # Number of main issues in each document to_n = np.agax(rr, ais=1) # tSNE dimensionality reduction tsel = TSE(n=2, vre=1, rae=0, ae=.99, int='pca') tlda = tsl.frm(arr) # Using Bokeh to draw topic cluster diagram oueook() n_tics = 4 m plot.scatter(xda\[:,\])

pyLDAVis

Finally, pyLDAVis is the most commonly used and a good way to visualize the information contained in the topic model.

pyLDvis.enaok()

conclusion

We import, clean up, and process newsgroup datasets from scratch to build LDA models. Then we saw a variety of methods to visualize the output of topic models, including word clouds, which intuitively tell you which topic is dominant in each topic. t-SNE clustering, pyLDAVis Provides more details about topic clustering.