2021 China University big data challenge question A full version

First, sort out the key data according to the topic requirements

Since the topic also talks about three key indicators, the following anomaly detection must focus on these three key indicators. These three indicators are highly correlated in theory. Strictly speaking, anomaly detection cannot be done separately

Although only a few indicators are required in the title, they can also be used if other indicators are to be considered to describe the operation status of the equipment. If other equipment indicator data is to be used, it depends on the changes of the abnormal data detected in the first question and the correlation is calculated in combination with the other indicator data. If there is obvious correlation, it can be included in the post-analysis, and if there is no correlation, it will not be done

This topic tells that the data is periodic and also talks about anomalies. It is obvious that the direction of time series problem-solving is correct. Use the anomaly detection algorithm to detect the abnormal data, and then correct the abnormal data. Just now we talked about the strong correlation of three key indicators, so the anomaly detection here should be analyzed from the perspective of the data trend of the three. How to analyze the trend? The first method is to take the 30% and 70% values of the index data from small to large, and standardize them with the formula of the maximum and minimum method, and the second method is to standardize the mean value in the reference index data, After this processing, it can ensure that the three data have similar values under the original trend, and then traverse to find the variance. If the variance is greater than how many bits of abnormality, you can set a threshold here. You must remember that each cell is analyzed separately.

Although this topic does not say public opinion, but think of the user's active, not the Internet, it must be related to entertainment, movies, COVID-19, stock market futures, pork prices and so on. This is not to say that everyone should crawl for comments from micro-blog, know, East Fortune and other websites, and do hot words analysis. It is too late to recommend too much time. Here is the Baidu index. Micro index, Google trend, 360 trend.

This is the Baidu Index crawler code written before and shared with you. It collects the search index every day, and is also divided into PC and mobile terminal

import requests
word_url = 'http://index.baidu.com/api/SearchApi/thumbnail?area=0&word={}'
headers = {
'Accept': 'application/json, text/plain, */*',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'no-cache',
'Cookie': '',#Cookie s use their own
'DNT': '1',
'Host': 'index.baidu.com',
'Pragma': 'no-cache',
'Proxy-Connection': 'keep-alive',
'Referer': 'http://index.baidu.com/v2/main/index.html',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.90 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',

words = [[{"name": "pork", "wordType": 1}]]
keyword = str(words).replace("'", '"')
url = f'http://Index. Baidu. COM / API / searchapi / index? Area = 0 & word = {keyword} & area = 0 & StartDate = {start} & enddate = {end} '# interface
resp = requests.get(url, headers=headers)
data = resp.json().get('data')#After reading, it is found that the data is encrypted and needs to be decoded
user_indexes = data.get('userIndexes')[0]
uniqid = data.get('uniqid')#Read uniqid parameter
url = 'http://index.baidu.com/Interface/ptbk?uniqid={}'
resp = requests.get(url.format(uniqid), headers=headers)
ptbk = resp.json().get('data')#Obtaining the corresponding json data is also decoding
all_data = user_indexes.get('all').get('data')
pc_data = user_indexes.get('pc').get('data')
wise_data = user_indexes.get('wise').get('data')
#Build dictionary
n = list(ptbk)
a = {}
ln = int(len(n)/2)
start = n[ln:]
end = n[:ln]
for j,k in zip(start, end):
    a.update({k: j})
#Total index = pc search index + mobile search index
result1 = []
for j in all_data:
    result1.append(a.get(j))#Decoded as a single character
result1 = ''.join(result1)#combination
result1 = result1.split(',')#Split with comma
print('Total index')
#pc search index
result2 = []
for j in pc_data:
    result2.append(a.get(j))#Decoded as a single character
result2 = ''.join(result2)#combination
result2 = result2.split(',')#Split with comma
print('pc End search index')
#Mobile search index
result3 = []
for j in wise_data:
    result3.append(a.get(j))#Decoded as a single character
result3 = ''.join(result3)#combination
result3 = result3.split(',')#Split with comma
print('Mobile search index')

Epidemic data are available at home and abroad




And the trend of the stock market or major events



Or more concerned about the price trend of pork and vegetables


Wait, we need to make data prediction later. Time series can not predict the data trend of active users in the future, but it may be more real if combined with public opinion. It includes the correction of abnormal data. For example, build the index system of the three key indicators of the subject through the public opinion indicators found above, and then use these index data as training input and key index data as output to correct the detected data through machine learning methods. How to add highlights to yourself, consider it yourself, and apply the algorithm casually

Let's take a look at the first question. We must analyze each cell. We also analyzed how to do anomaly detection. We don't directly throw a LOF algorithm to identify the outliers in each indicator data. Remember the three indicators emphasized in the topic, and the three indicators have a strong correlation, so we should consider the common trend level of the three indicators. In order to obtain just high accuracy, the data can be smoothed slightly. Although smoothing will lead to data distortion, it must not be smoothed too much. At least the abnormal data segments should be highlighted after smoothing, and the specific parameters need to be explored by yourself. If we look at the anomaly detection in real life, let's not say from 0:00 to 7:00, let's look at the time between 1:00 and 6:00. If the data value in this time period is high, there must be something wrong with the equipment. Therefore, this aspect needs to be analyzed separately. Never apply the anomaly detection algorithm you have learned directly to the Internet. Remember to see the problem background first, and then talk about table 1 below. Here, you can give a time length to judge whether it is an anomaly isolated point. The anomaly cycle is within the time length you set. If there are other anomaly points, it is different from other anomaly points, An anomaly point in the nearest neighbor can form an anomaly cycle. Of course, it is also necessary to judge the size of the anomaly data, as shown in the figure below, which is basically at the same height.

Two methods are recommended for time period, one is the average time period of Fourier change, and the other is the time delay in chaos theory (common methods include autocorrelation method, mutual information method, average displacement method, etc., which are all available in matlab chaotic time series toolbox)

function T_mean=period_mean_fft(data)
%This function uses fast Fourier transform FFT Calculate the average period of the sequence
%data: time series
%T_mean: Returns the fast Fourier transform FFT Calculated sequence average period
Y = fft(data);       %fast FFT Transformation
N = length(Y);    %FFT Transformed data length
Y(1) = [];           %Remove Y The first data, it is data Sum of all data
power = abs(Y(1:N/2)).^2;  %Find the power spectrum
nyquist = 1/2;
freq = (1:N/2)/(N/2)*nyquist; %Find frequency
plot(freq,power); grid on     %Draw power spectrum
title('Power spectrum')
period = 1./freq;                %Calculation cycle
plot(period,power); grid on  %Draw cycle-Power spectrum curve
title('Periodic power spectrum')
[mp,index] = max(power);       %Find the subscript corresponding to the highest spectral line
T_mean=period(index);            %Find the average period from the subscript

The first question is almost to determine the periodic parameters and a field of data detection and processing. Let's look at the second question

Note that each cell is analyzed separately. If there is an exception, there will be an exception. If there is no exception, don't deliberately add an exception

If you look for public opinion data according to the above, it's easier to ask. Why did you say that if you can use the data of equipment status, the operation of equipment is mainly caused by weather and load. Since the organizer said that you don't need to consider the region, you don't consider the weather. The correlation between equipment parameters and the three key indicators can be analyzed. If there are equipment status indicators with high correlation, they can be considered in this question. Here, the load of the base station can also be considered, such as the active data (three key indicators) of all users within the service range of the same base station. The second question is anomaly prediction. In the historical data, three key indicators, some public opinion indicators and the overall three key indicator values of the cell's base station are used as inputs. The abnormal point outputs 1 and the normal output is 0. A binary classification model is established. Why do you need to analyze it in combination with the base station? A base station serves multiple cells, even if the data of a cell looks normal, However, there is no guarantee that there will be abnormalities caused by high overload in other cells, such as normal network fluctuations. When doing problems, we must analyze them in combination with the reality of life. While using the algorithm to calculate better results, we should also have perfect logic

F test is used for model test

Accuracy rate = predict positive class as positive class / all predictions as positive class TP / (TP+FP)
Recall rate = predict positive classes as positive classes / all true positive classes TP / (TP+FN)
F value = precision rate * recall rate * 2 / (precision rate + recall rate) (F value is the harmonic average of precision rate and recall rate)

The third question is prediction, which is to predict the three key index data after exception processing. It must be analyzed by each cell. The period of the first question needs to be used. In chaotic time series, time delay and period are the input parameters of the algorithm, which is just connected with the first question. This question can be predicted by chaotic time series method first (RBF neural network one-step prediction, RBF neural network multi-step prediction, Volterra series one-step prediction, Volterra series multi-step prediction, etc.) , what can be predicted can only be a time series. Then, through the public opinion indicators from September 26 to 28, we can make a prediction for this time period by referring to the practice of the second question. Why do we make two predictions here? The first one is mainly to predict the data cycle, and the second one is to predict the actual user activity in these three days. Next, we need to combine the two results, which is the simplest The method is to take the average of the two, but the more rigorous point is to copy a copy of the data, and both of them are smoothed. Here, the smoothing can be harder. The main two trends are shown, and then the smoothed data of the former is subtracted from the latter, and the change of public opinion relative to the periodic data will be obtained. The change will be added to the time series data. If public opinion is not considered, it can be used directly Good results can also be obtained from inter series / chaotic time series.

My mathematical modeling group: 912166339, reach out to the party do not disturb, only for communication.
If you like learning python, you can also come to my Python group: 970353786
Welcome to the official account of individuals: Kawakawa Natori later I will organize the modeling information to you. The code seems to be running out now. I can update it again during the day, so we can advance the group.

Tags: Big Data Algorithm Data Analysis

Posted on Fri, 29 Oct 2021 14:10:15 -0400 by cyberdwarf