Python uses AI artificial intelligence technology to automatically categorize content

Original Link: https://my.oschina.net/u/165676/blog/1836301

In the second half of 2017, because the work involves AI artificial intelligence, it has been studied for a short time. This paper is only a preliminary research result, which is just a starting point.

Previous articles described network crawling, in fact, AI is also closely related, because AI needs to model data before intelligent analysis, so crawling technology can improve the efficiency and accuracy of AI processing.

Here's a description of business needs: Suppose you need to automatically categorize diseases that users ask, such as respiratory, cardiology, digestive, and so on.

Processing steps are:
1. Classification of some medical websites crawled first
2. Use AI to train these questions
3. Verify the recognition effect by entering a disease problem

1. Data crawling
This example uses data from the Ask a Doctor website (https://www.jiankang.com), which crawls each question into a separate file.

2. Data Processing Codes

from sklearn.datasets import loadfiles from sklearn.featureextraction.text import CountVectorizer, TfidfTransformer
from nerutils import *
from sklearn.linear_model import SGDClassifier

# Select text categories to participate in the analysis
categories = ['Respiratory Medicine', 'Internal Medicine-Cardiovascular Department', 'GI Medicine']

train_path='category/train'

# Get raw data from hard disk
twenty_train=load_files(trainpath, categories=categories, loadcontent = True,
encoding='utf-8',
decodeerror='strict', shuffle=True, randomstate=42)
# Count word occurrences
count_vect = CountVectorizer()

for index in range(len(twenty_train.data)):
twenty_train.data[index] = ' '.join(ner( twenty_train.data[index]))

from sklearn.pipeline import Pipeline
# Set up Pipeline
textclf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, niter=5,
random_state=42)),
])

# Training Classifier
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)
# Print classifier information
print(text_clf)

# Read test data
categories = ['Respiratory Medicine']

test_path = 'category/test'

test_train=load_files(testpath, categories=categories, loadcontent = True,
encoding='utf-8',
decodeerror='strict', shuffle=True, randomstate=42)

for index in range(len(test_train.data)):
test_train.data[index] = ' '.join(ner( test_train.data[index]))

test_train.target = [0]*len(test_train.target)

docs_test = test_train.data

# Classified prediction using test data
predicted = text_clf.predict(docs_test)
print("Classified data:" + str(predicted))
score = text_clf.score

# Calculating the accuracy of predictions
import numpy as np
print("Accuracy:")
print(np.mean(predicted == test_train.target) * 100)

Below is the result of the test output with 100% accuracy, which is unexpected!
 

Classified data: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0]
Accuracy:
100.0

Because the work only lasted for about a month, no further applications follow, but in terms of personal industry experience, AI really has a very large complement to many aspects. In terms of this classification, there are many business areas that can be used, such as a car research project, which needs to collect steam from various websites.Vehicle information is then categorized, which can be pre-categorized by AI and then BI by displacement, quality, engine, etc.

For more applications, you are welcome to participate in the discussion.

Reprinted at: https://my.oschina.net/u/165676/blog/1836301

Tags: encoding network Steam

Posted on Mon, 09 Sep 2019 16:22:55 -0400 by egmax