Python topic modeling, LDA model, t-SNE dimension reduction clustering, word cloud visualization, text mining, newsgroup dataset

Original link:

In this article, we discuss based on gensim   Package to visualize the output and results of the topic model (LDA)  .


We follow a structured workflow and construct a topic model based on potential Dirichlet assignment (LDA) algorithm.

In this article, we will use the topic model to explore a variety of strategies to use matplotlib   The drawing effectively visualizes the results  .

I will use part of the 20 newsgroup datasets because the focus is more on the method of visualizing the results.

Let's start by importing the package and the 20 News Groups dataset.

import matplotlib.pyplot as plt

#  NLTK stop word
fom nlt.copus imort stowods
sop_wrds = stowords.wrds('chinse')

Import newsgroup dataset

Let's import the newsgroup dataset and keep only 4   Category.

#  Import dataset

d = f.oc\[\[so.relion.chritan\], 'ec.sot.okey', 'ak.piticmdast' 'rec.oorcyces'\]) , :\]
prin(f.hpe) #> (2361, 3)

Mark sentences and clean them up

Delete email, line breaks, single quotes, and finally use gensim to split sentences into word lists   simple_preprocess(). set up   deacc=True   Option removes punctuation.

def snds(seecs):
    for setees in sntces:
        sent = r.sub('\\S*@\\S*\\s?', '', sent) #  Delete email
        snt ='\\s+', '', sent) #  Remove newline character
        set ="\\'", "", sent) #  Remove single quotes
        set = geim.uls.smplprerss(str(sent), deacc=True) 

#  Convert to list
data = df.cnt.lus.tolist()

Building two letter group and three letter group models and reasoning

Let's use the model to form two letter groups and three letter groups. To improve execution speed, this model is passed to Phraser().

Next, restore each word form to its root form, retaining only nouns, adjectives, verbs and adverbs.

We only keep these POS tags because they contribute the most to the meaning of the sentence. Here, I use spacy for lexical processing.

#  Build big word and three word model
bigrm = endl.Pres(dta_ords, mncnt=5, thrshl=100) #  Higher thresholds reduce phrases.
tigam = genm.del.Prses(bga\[dtawors\], thrhld=100)  
bigm_od = gsim.molpss.Pasr(bgrm)
tigrmod = genm.mos.pres.hrser(tigam)

 #  Run once at the terminal

    ""Delete stop loss words to form big words, three words and phrases""
    texts = \[\[wor fo wrd in sipeeproe(tr(dc)) 
     \[iram_od\[oc\] for doc in txts\]
    tets = \[rirammod\[igrmmod\[dc\]\] for dc in tets\]
    tetout = \[\]
    np = scy.oad('en', dial=\['Parser', 'ner'\])
    for set in txs:
        dc = np(" ".join(sn)) 
        tex_.ppd(\[tknlea_ fr toen in oc if toenpo_ in aowed_ots\])
    #  After lexicalization, delete the stop word again

atady = roe\_os(daa\_ds) #  Processed text data!

Build topic model

To build an LDA topic model using, you need a corpus and a dictionary. Let's create them first, and then build the model. The trained topics (keywords and weights) are also output below.

If you check the topic keywords, they together represent the topic we originally selected. Church, hockey, regional and motorcycle. Good!

#  Create dictionary
id2od = copoDciary(dta_eay)

#  Create a corpus. Term document frequency
crpus = \[i2wod.o2bow(ext) for txt in daa_ey\]

#  Establish LDA model


What is the dominant theme and its percentage contribution in each document

In the LDA model, each document consists of multiple topics. However, usually only one theme dominates. The following code extracts the main topic of each sentence and displays the weight of the topic and keyword in a well formatted output.

In this way, you will know which document belongs primarily to which topic.

    #  Start output
    se_tpcf = p.Dataame()

    #  Get the main topics in each document
    for i, ro_isin enate(ldmoel\[crps\]):
        rw = rw\_s0\] if\_wortopis else rowlis            
        row = soed(ow, ky=laba x: (x\[1\]), evre=True)
        #  Get the leading topic, perc contribution, and keywords for each document
        for j, (toicum, pr_pic) in enate(row):
            if j == 0: # =>  Dominant topic
                wp = ldel.shotoic(topic_num)

    #  Add the original text at the end of the output

deeos = fratcs(lodel=damoe, copus=crpus, tets=dary)

#  format
topic = os.retidex()

The most representative sentence of each topic

Sometimes you want to get a sample of sentences that best represent a given topic. This code gets the most typical sentences for each topic.

#  Display settings to display more characters in the column

for i, grp in serpd:
    senlet = pd.cnct(\[senlet, 
                                             gp.srtes(\['Peion'\], asng=Fase).hed(1)\] 

#  Reset index     
seet.resex(drp=True, inlce=True)

#  format
senllet.couns = \['Toum', "TopCorib", "Kywrds", "rsa xt"\]

#  display

Frequency distribution of the number of words in a document

When working with a large number of documents, you want to know the overall size of the document and the subject size. Let's plot the word distribution of the document.

#  mapping
plt.fiue(fe=(6,7), dpi=60)


import sebon as sns

The first N keyword word clouds of each topic

Although you have seen the subject keywords in each subject, the word cloud whose word size is proportional to the weight is a good visualization method.


#  one   Word cloud of the first N words in each topic
from matplotlib import pyplot as plt
from worcloud mport WrCloud,STOPWODS

clod = WordClud(stopwds=stp_ords,


Number of words of subject keywords

When it comes to keywords in a topic, the importance (weight) of keywords is very important. In addition, the frequency of these words in the document is also interesting.

Let's plot the number of words and the weight of each keyword in the same chart.

You should focus on words that appear in multiple topics and words whose relative frequency is greater than the weight. Usually, these words become less important. The chart I draw below is the result of adding several such words to the stop word list at the beginning and rerunning the training process.

tops = l_mdl.swtcs(foatd=Fase)

#  Draw the word number and weight diagram of subject keywords
fig, as = pltuls(2, 2, fiiz=(16,10), sey=rue, di=160)

Sentence chart colored by topic

Each word in the document represents one of the four topics. Let's color each word in a given document according to its subject id.

#  Sentences coloring N sentences

      for i, ax in eumate(xes):
            cour = corp\[i-1\] . 
            topprcs, wrdits, wrdihius = lda\[copr\]
            wodoac = \[(lmod2word\[wd\], tpic\[0\]) or w, tpc in odid_opcs\]    
            #  Draw a rectangular area
            tpcred = soted(tpps, key= x: (x\[1\]), rvese=True)

            word_pos = 0.06

    plt.subdt(wsace=0, hsace=0)

What is the most discussed topic in the document?

Let's calculate the total number of documents attributed to each topic.

#  Color N sentences
    Dominant topic = \[\]
    Topic percentage = \[\]
    for i, crp in euete(opu_el):
        topcs, wordics, wrlues = moel\[crp\]
        dopic = soted(torcs, key = lmda x: x\[1\], reerse=Tue)\[0\]\[0\] . 

doics, toages = topent(mol=lda, copus=crus,en=-)            

#  Distribution of dominant topics in each document

dfc = dh\_dc.t\_frme(ame='cunt').eeinex()

#  Total subject distribution by actual weight
topweig = pd.DaaFae(\[dct(t) for t in toges\] )

#  The first three keywords of each topic
 \[(i, tpic) for i, tocs in lda.shcs(fted=Flse) 
                                 for j, (tic, wt) in eae(toic)if j < 3)

Let's make two figures:

  1. The number of documents per topic is calculated by assigning the document to the topic with the largest weight in the document.
  2. The number of documents for each topic is calculated by summarizing the actual weight contribution of each topic to their respective documents.
from mtpltli.tiker import ucFattr

#  mapping
fig, (ax1, ax2) = pl.supot(1, 2)

#  Issues distributed by main issues

#  Topic distribution by topic weight'iex', hegh='cout', dat=dfoc, with=.5, 


t-SNE (t-distribution random neighbor embedding) clustering diagram

Let's use t-SNE (t-distribution random neighbor embedding) algorithm to visualize document clusters in 2D space.

#  Get topic weight and dominant topic  ------------

#  Get topic weight
for i, row_list:
    tophts.apd(\[w for i, w in rost\[0\]\] )

#  Array of subject weights     
arr = pd.Dame(tohts).fna(0).vales

#  Maintain good separation points (optional)
rr = ar\[p.aax(rr) > 0.35\] . 

#  Number of main issues in each document
to_n = np.agax(rr, ais=1)

#  tSNE dimensionality reduction
tsel = TSE(n=2, vre=1, rae=0, ae=.99, int='pca')
tlda = tsl.frm(arr)

#  Using Bokeh to draw topic cluster diagram
n_tics = 4


Finally, pyLDAVis is the most commonly used and a good way to visualize the information contained in the topic model.



We import, clean up, and process newsgroup datasets from scratch to build LDA models. Then we saw a variety of methods to visualize the output of topic models, including word clouds, which intuitively tell you which topic is dominant in each topic. t-SNE clustering,   pyLDAVis   Provides more details about topic clustering.

Most popular insights

1.On the research hotspots of big data journal articles

2.618 online shopping data inventory - what are hand choppers paying attention to

3.r language text mining, TF IDF topic modeling, emotion analysis, n-gram modeling

4.python topic Modeling Visualization lda and t-sne interactive visualization

5.News data observation under epidemic situation

6.python topic lda modeling and t-sne visualization

7.Topic modeling analysis of text data in r language

8.Theme model: listen to the "online affairs" on the message board of people's network

9.web crawling lda topic semantic data analysis by python crawler

Tags: Algorithm Machine Learning Deep Learning Data Mining

Posted on Fri, 19 Nov 2021 03:50:55 -0500 by mduran