Recently, I studied DNA methylation and deep learning in school Thesis github address , when learning deep learning, I found that many knowledge of machine learning are forgotten. I wrote this series of articles mainly to take a note.
preface
DNA methylation is a form of DNA chemical modification, which can change genetic performance without changing DNA sequence. DNA methylation refers to the covalent binding of a methyl group at the cytosine 5 carbon position of CpG dinucleotide in the genome under the action of DNA methyltransferase. Simply put, it is to add a methyl group to the C base in the CPG site. CpG site is a site composed of C base, p phosphate group and G base on a single chain
1, Selection of partitioned data sets
- Set aside method
- Cross validation
- Self help method
2, Principle
1. Set aside method
The set aside method is to simply divide the data set into two mutually exclusive sets. During the division, we need to ensure the consistency of the data distribution of the divided set. Generally, layered sampling is adopted, and multiple random divisions are carried out to return the average of multiple results.
2. Cross validation
We will use the data set D D D divided into k k k mutually exclusive subsets of the same size. Each subset ensures the consistency of data distribution as much as possible, and then uses k − 1 k-1 The union of k − 1 subsets is used as the training set, and the remaining subsets are used as the test set k k k training results. When k k When k is equal to the number of samples, we become to leave one method, which is often closer to the real result, but the time cost is large.
3. Self help method
We start from the dataset D D Select a sample randomly from D and put it into D t e s t D^{test} In Dtest, put the sample back into the D D In D, this process is repeated m m m times, such a part of the data will D t e s t D^{test} Dtest appears repeatedly, but some samples do not appear. This method is mainly used when the data set is small and it is difficult to divide the training set and the test set.
3. Code
Import related libraries
from sklearn.model_selection import train_test_split from sklearn.model_selection import KFold from sklearn.utils import resample import pandas as pd import h5py as h5 import numpy as np import random
Read the data file and get the name of the data
source_data = h5.File(r"C:\Users\***\Desktop\c1_000000-001000.h5", 'r') names = list(source_data['inputs']['cpg'].keys())
Construct a container for storing data
column = [] for i in range(50): name = 'state%s' % i column.append(name) for i in range(50): name = 'dist%s' % i column.append(name) df_empty = data_all = pd.DataFrame(np.random.randn(5000, len(column)), columns=column)
Read data
features = ['dist', 'state'] for i, j in enumerate(names): features1 = np.array(source_data['inputs']['cpg'][j]['state']) features2 = np.array(source_data['inputs']['cpg'][j]['dist']) row = list(range(i*1000, i*1000+1000)) df_empty.loc[row, column[0:50]] = np.array(features1) df_empty.loc[row, column[50:100]] = np.array(features2)
Read label
df_empty_label = data_all = pd.DataFrame() for i, j in enumerate(names): laebl = source_data['outputs']['cpg'][j] df_empty_label[j] = laebl
Fill in missing values with random methylation status
for i in range(len(df_empty_label.values)): for j in range(len(df_empty_label.values[i])): if df_empty_label.values[i][j] == -1: df_empty_label.values[i][j] = random.choice([0, 1])
Partition data set by method
train_test_split()
Parameters
- 10: Data to be divided
- y: Label corresponding to the data
- test_size: proportion of test set
- train_size: proportion of training set
- random_state: random seed
x = list(df_empty_label) y = [] for i in x: y.extend(df_empty_label[i].values.tolist()) X = df_empty.values X_train,X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, shuffle=True) print(len(X_train) / (len(X_test) + len(X_train))) <<< 0.7
KFold
Parameters
- n_ Splits: the value of K for k-fold cross validation
- shuffle: mixed washing
- random_state: random seed
Ditto, use cross validation to divide the data set
kf = KFold(n_splits=10) y = np.array(y) count = 0 for train_index, test_index in kf.split(X): print('The first{}Secondary cross validation'.format(count + 1)) X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] count = count + 1 <<< 1st cross validation <<< Second cross validation <<< 3rd cross validation <<< 4th cross validation <<< 5th cross validation <<< 6th cross validation <<< 7th cross validation <<< 8th cross validation <<< 9th cross validation <<< 10th cross validation
Self help method
D1_train = [] y1_train = [] for i in range(1000): X1, y1 = resample(X, y, n_samples=1) D1_train.append(X1) y1_train.append(y1) np.unique(D1_train) np.unique(y1_train)
summary
If you find a problem, I hope you can help me point it out.