The new epidemic has changed the fate of millions of people, including me

In the face of such a brutal natural disaster, I have made a prediction model, hoping to contribute a little to the natural disaster and let it pass early

There is a good song:

Let me be sad, let me regret, let the epidemic pass without trouble, let me be bitter, let me be tired, let me see everyone's smile

Now, that's the end of the crap, let's get to the point:

Before that, there was a prediction project on github:

https://github.com/YiranJing/Coronavirus-Epidemic-2019-nCov

So I down ed his source code and looked at it carefully

The author of this project has implemented his own SIR and SEIR model

(two classical infectious disease models: https://baike.baidu.com/item/%E4%BC%A0%E6%9F%93%E7%97%85%E6%A8%A1%E5%9E%8B/5130035?fr=aladdin)

The key part of the model is to define a distribution function:

`def Binomial_log_like(n: int, p: float, k: int): loglik = lgamma(n+k+1) - lgamma(k+1) - lgamma(n+1) + k*np.log(p) + n*np.log(1-p)`

The above formula defines a hypothetical formula for the trend of the number of people infected by the epidemic

As the name suggests, he regards the curve of the number of people infected as a binary logarithm distribution

Is it appropriate to build models using SIR and SEIR?

In fact, the factors of isolation measures are not considered in the dynamic models of these two infectious diseases

Both are based on natural communication and natural rehabilitation, so if we want to make more objective prediction, we need to consider the impact of human isolation activities

Does the actual distribution of epidemic data meet this distribution? Let's take out the historical data:

Trend of new cases every day in China

As can be seen from the above figure, the curve of new cases is more similar to gamma distribution in general

The incomplete part of the curve, it is speculated, is due to the strong isolation measures taken by the relevant government departments, resulting in the interruption of virus transmission

Therefore, from a macro point of view, the curve of new cases is the result of continuous confrontation between two forces:

Number of new infections = natural transmission force of virus - human blocking friction force

If you have to make an analogy with a physical phenomenon, it's like stepping on the brake for a snowball rolling down

Suppose that the human blocking force is a constant after the closure of 23 (it has reached the highest level of blocking that human can achieve)

In fact, we can approximate the curve after the 23rd with gamma curve

So here I customize a control function:

`def control_curve(x, i0, missing, latency=7, infection_num=2): """ //Infection curve, assuming isolation after incubation period //Number of people infected today = (actual number of people infected yesterday - number of people quarantined yesterday) * average number of people infected //Number of quarantined people yesterday = actual number of infected people seven days ago //Number of people infected today = (actual number of people infected yesterday - actual number of people infected seven days ago) * average number of people infected :param x: Infection time :param i0: Number of initial infections :param missing: Missing rate :param latency: Incubation period (specifically, this represents the average time of onset, assuming that it is infectious every day during the onset) :param infection_num: Average number of infected persons """ if x >= latency: isolated = i0 * (infection_num ** (x - latency)) else: isolated = 0 day_num = i0 * missing * (infection_num ** x) - isolated if day_num < 0: day_num = 0 return day_num`

There are four unknown parameters in this function. We can use the optimizer of scipy to solve the optimal value automatically:

`from scipy.optimize import curve_fit def control_curve_list(x, i0, missing, latency, infection_num): y = list(map(control_curve, x, [i0] * len(x), [missing] * len(x), [latency] * len(x), [infection_num] * len(x))) return y popt,pcov = curve_fit(f=control_curve_list, xdata=x_data[:-1], ydata=y_data[:-1])`

Through curve fit function, we can get a group of approximate optimal solutions quickly

After fitting out, look at the effect:

120 day results predicted by the model with optimal parameters

It looks very smooth. I'm afraid the actual data will be much rougher than this

So does the prediction match the reality?

It seems that the variance is still large. Is there any way to get a more accurate prediction?

We can boldly assume that the strength of human isolation may be different in every stage and region

Parameter values may vary slightly for each period

Therefore, we will continue to explore the optimal parameters with the Bayesian optimizer

JS divergence is introduced here to facilitate the comparison of results by Bayesian optimizer

`from bayes_opt import BayesianOptimization def JS_divergence(p,q): M=(p+q)/2 return 0.5*scipy.stats.entropy(p, M)+0.5*scipy.stats.entropy(q, M)`

Build another objective function to return the comparable score

`def result_function(nu, close_contacts, k, I0, E0, T, death_rate): N = 14 * (10 ** 8) econ = len(t) R0 = I0 * (death_rate) * 2 try: Est = Estimate_parameter(nu=nu, k=close_contacts, t=t, I=I) eso = Estimate_Wuhan_Outbreak(Est, k=k, N=N, E0=E0, I0=I0, R0=R0, T=T, econ=econ) result = eso._run_SIER('Forecat', 'China', 'Days after 2019-12-1', death_rate=death_rate, show_Plot=False) score = JS_divergence(I,result['Infected']) except Exception as e: score = 999 print(e) try: score = float(score) except: score = 999 if score > 999: score = 999 return -score`

Then, set some search intervals manually according to the previous optimal parameters:

`rf_bo = BayesianOptimization( result_function, {'nu': (1/20, 1/10), 'close_contacts': (5, 15), 'k': (1, 3), 'I0': (1, 10000), 'E0': (0,10000), 'T':(3,14), 'death_rate':(0.2,0.3) } )`

Finally, run our optimizer and let it find out for itself

`rf_bo.maximize(init_points=5, n_iter=1000,)`

Finally, we get a relatively reliable result:

If we use this parameter to predict, the gap between the actual result and the prediction is not large

After the whole set of logic is written, the process can also be written as a pipeline, which can be fitted automatically every day

In this way, we can constantly approach the most objective results!

These are the ideas I share. If there is something wrong, please comment more!