Guangdong Province and eight other provinces began their first new college entrance examination. This change is actually a change of soup rather than medicine. The total score is 750. The choice of examination subjects is "3 + 1 + 2". It is an examination subject tuple composed of three compulsory subjects of Chinese, mathematics and foreign language (note that English is not English, but one of foreign languages), one preferred choice subject of physics or history, and two re selected subjects of politics, geography, biology and chemistry. This visualization takes the physics subject selection area corresponding to "1" (i.e. "physics candidates") as the score distribution sample. Firstly, find the "one by one table" of physics candidates in the national unified entrance examination for colleges and universities in Guangdong Province in 2021 on the Internet, and abstract the data into a relational model for narration.
In the above relationship model, G is the score, Gr is the real score value, and Q is the number of people. This relationship model is at least the third paradigm, and the probability of error in data analysis is low. Because the error probability is low, it is not necessary to use System.Linq in C#.NET for secondary filtering, and directly use Python for data analysis.
import numpy import matplotlib.pyplot # make axises allow Chinese character to present matplotlib.pyplot.rcParams['font.family'] = 'sans-serif' matplotlib.pyplot.rcParams['font.sans-serif'] = 'SimHei'
The above is the library used. Needless to say, I use the pyplot module for visualization.
There are table files, but how to determine the distribution of grades and numbers? Draw the scatter diagram in the table file first:
Take out the value after the abscissa 500 of this figure, and divide the value of the y-axis (number of people axis) by 337988 to get a graph that looks like a normal distribution. These basis can not determine whether the scores obey the normal distribution, so it is necessary to infer whether they obey the chi square distribution and F distribution. Considering that the highest points of chi square distribution and F distribution have percentiles, while the largest one in the current score table is only the thousandth. Therefore, normal distribution is more appropriate. Thus, the function of finding the average and variance of samples was born.
file = open(grade_file) x_list = [] y_list = [] grade_list = [] for line in file: s = line.split(",") x_list.append(float(s[0])) y_list.append(float(s[1])) grade_list.append(float(s[2])) average = 0 length = len(x_list) y_sum = sum(y_list) for i in range(0, length): average += x_list[i] * y_list[i] / y_sum real_average = grade_list[0] - average difference = 0 for i in x_list: difference += (i - average) ** 2 / length error = 0 for i in range(0, length): error += (y_list[i] / y_sum - numpy.exp(-(x_list[i] - average) ** 2 / (2 * difference)) / (numpy.sqrt(2 * numpy.pi * difference))) ** 2 file.close()
Here is a real_ The former refers to the average value of scores, and the latter refers to the average value obtained when reading files. grade_list[0] refers to the highest score. Encapsulated.
def analysing_grade(grade_file): file = open(grade_file, encoding="utf-8") x_list = [] y_list = [] grade_list = [] for line in file: s = line.split(",") x_list.append(float(s[0])) y_list.append(float(s[1])) grade_list.append(float(s[2])) average = 0 length = len(x_list) y_sum = sum(y_list) for i in range(0, length): average += x_list[i] * y_list[i] / y_sum real_average = grade_list[0] - average difference = 0 for i in x_list: difference += (i - average) ** 2 / length error = 0 for i in range(0, length): error += (y_list[i] / y_sum - numpy.exp(-(x_list[i] - average) ** 2 / (2 * difference)) / (numpy.sqrt(2 * numpy.pi * difference))) ** 2 file.close() return real_average, difference, error, y_sum, x_list, y_list, grade_list
The final result is:
We can see that Guangdong doesn't roll at all. To be honest, I don't believe this data at all in the college entrance examination in the Pearl River Delta, but I still have to accept it with a normal distribution. To get back to business, after we get the data, we have to write the second function. The data demand is that the filing score (minimum score) of colleges and universities enrolled in Guangdong Province in 2021 corresponds to the table of physics. I found it on the Internet and turned it into a form. Because the majors for filing and enrollment in Guangdong Province in 2021 are called "professional groups", there are several or even more professional groups in a school everywhere. For example, Jilin University set up 12 professional groups when recruiting students in Guangdong this year, and Suzhou University set up 9 professional groups. For these professional groups, the number of students enrolled and the corresponding minimum score, I calculated the weighted average value for the second time. The corresponding score of Jilin University is 612 and that of Suzhou University is 607. I made a new table for these average values, and abstracted the relationship pattern according to the relationship instance.
In this relationship model, the former is the University, and the latter is the weighted average of the scores of each professional group according to the number of people. The relational paradigm of the original table is the second paradigm, and its processing requires stack and query - the difficulty is here, so I weighted the scores of the professional group first, and then re made the filing table. And set the function for the organization structure of the table, with the following code:
def plot_normal_distribution_with_school(average, difference, school_file, initial_person_list, initial_grade_list): matplotlib.pyplot.figure(figsize=(32, 24), ) x = numpy.arange(550, 700, 10 ** (-2)) # set border border = matplotlib.pyplot.gca() border.spines['right'].set_color('none') border.spines['top'].set_color('none') # plot matplotlib.pyplot.xlim(550, 700) matplotlib.pyplot.ylim(0.001, 0.0019) matplotlib.pyplot.plot(x,numpy.exp(-(x - average) ** 2 / (2 * difference)) / (numpy.sqrt(2 * numpy.pi * difference)), linewidth=1.5, label='$f(x)=N({},{})$'.format(average, difference)) matplotlib.pyplot.xlabel('x', fontsize=24) matplotlib.pyplot.ylabel('y', fontsize=24) matplotlib.pyplot.legend(prop={'size': 24}) # school and grade file dealing school = open(school_file) school_list = [] grade_list = [] for line in school: ls = line.split(",") school_list.append(ls[0]) grade_list.append(int(ls[1])) # mark up every school and ranks matplotlib.pyplot.xticks(grade_list, school_list, rotation=270) grade_probability_list = [] person_probability_list = [] for item in grade_list: grade_probability_list.append(numpy.exp(-(item - average) ** 2 / (2 * difference)) / (numpy.sqrt(2 * numpy.pi * difference))) for i in range(0, len(initial_grade_list)): if initial_grade_list[i] == item: person_probability_list.append(sum(initial_person_list[:i])) break matplotlib.pyplot.yticks(grade_probability_list, person_probability_list) # plot lines and show for i in range(0, len(grade_probability_list)): matplotlib.pyplot.plot([grade_list[i], grade_list[i]], [0, grade_probability_list[i]], '*-') matplotlib.pyplot.plot([0, grade_list[i]], [grade_probability_list[i], grade_probability_list[i]], '*-') matplotlib.pyplot.show()
You can also see from the function that I am pinching the tip (in the third line). When running these two functions, it is like this:
tuple_list = [] tuple1 = analysing_grade("test_guangdong2021.csv") plot_normal_distribution_with_school(tuple1[0], tuple1[1], "accept_guangdong_2021.csv", tuple1[3],tuple1[4], tuple1[5], tuple1[6])
The final running diagram is as follows (the length and width of the picture are 3200 and 2400):
On the x-axis shown in the figure, the darker the color, the more colleges and universities (four or five are also possible) recruit students at this score. There is also a picture:
The purpose of this picture is to show that Guangdong is still volume.
If I have time, I can collect more information and make more maps of several provinces to see how this situation can be achieved.