Study diary-1

Recently, I studied DNA methylation and deep learning in school Thesis github address , when learning deep learning, I found that many knowledge of machine learning are forgotten. I wrote this series of articles mainly to take a note.


DNA methylation is a form of DNA chemical modification, which can change genetic performance without changing DNA sequence. DNA methylation refers to the covalent binding of a methyl group at the cytosine 5 carbon position of CpG dinucleotide in the genome under the action of DNA methyltransferase. Simply put, it is to add a methyl group to the C base in the CPG site. CpG site is a site composed of C base, p phosphate group and G base on a single chain

1, Selection of partitioned data sets

  • Set aside method
  • Cross validation
  • Self help method

2, Principle

1. Set aside method

The set aside method is to simply divide the data set into two mutually exclusive sets. During the division, we need to ensure the consistency of the data distribution of the divided set. Generally, layered sampling is adopted, and multiple random divisions are carried out to return the average of multiple results.

2. Cross validation

We will use the data set D D D divided into k k k mutually exclusive subsets of the same size. Each subset ensures the consistency of data distribution as much as possible, and then uses k − 1 k-1 The union of k − 1 subsets is used as the training set, and the remaining subsets are used as the test set k k k training results. When k k When k is equal to the number of samples, we become to leave one method, which is often closer to the real result, but the time cost is large.

3. Self help method

We start from the dataset D D Select a sample randomly from D and put it into D t e s t D^{test} In Dtest, put the sample back into the D D In D, this process is repeated m m m times, such a part of the data will D t e s t D^{test} Dtest appears repeatedly, but some samples do not appear. This method is mainly used when the data set is small and it is difficult to divide the training set and the test set.

3. Code

Import related libraries

from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.utils import resample
import pandas as pd
import h5py as h5
import numpy as np
import random

Read the data file and get the name of the data

source_data = h5.File(r"C:\Users\***\Desktop\c1_000000-001000.h5", 'r')
names = list(source_data['inputs']['cpg'].keys())

Construct a container for storing data

column = []
for i in range(50):
    name = 'state%s' % i

for i in range(50):
    name = 'dist%s' % i

df_empty = data_all = pd.DataFrame(np.random.randn(5000, len(column)), columns=column)

Read data

features = ['dist', 'state']
for i, j in enumerate(names):
    features1 = np.array(source_data['inputs']['cpg'][j]['state'])
    features2 = np.array(source_data['inputs']['cpg'][j]['dist'])
    row = list(range(i*1000, i*1000+1000))
    df_empty.loc[row, column[0:50]] = np.array(features1)
    df_empty.loc[row, column[50:100]] = np.array(features2)

Read label

df_empty_label = data_all = pd.DataFrame()

for i, j in enumerate(names):
    laebl = source_data['outputs']['cpg'][j]
    df_empty_label[j] = laebl

Fill in missing values with random methylation status

for i in range(len(df_empty_label.values)):
    for j in range(len(df_empty_label.values[i])):
        if df_empty_label.values[i][j] == -1:
            df_empty_label.values[i][j] = random.choice([0, 1]) 

Partition data set by method


  • 10: Data to be divided
  • y: Label corresponding to the data
  • test_size: proportion of test set
  • train_size: proportion of training set
  • random_state: random seed
x = list(df_empty_label)
y = []
for i in x:
X = df_empty.values
X_train,X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, shuffle=True)
print(len(X_train) / (len(X_test) + len(X_train)))
<<< 0.7


  • n_ Splits: the value of K for k-fold cross validation
  • shuffle: mixed washing
  • random_state: random seed

Ditto, use cross validation to divide the data set

kf = KFold(n_splits=10)
y = np.array(y)
count = 0
for train_index, test_index in kf.split(X):
    print('The first{}Secondary cross validation'.format(count + 1))
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    count = count + 1
    <<< 1st cross validation
	<<< Second cross validation
	<<< 3rd cross validation
	<<<	4th cross validation
	<<<	5th cross validation
	<<<	6th cross validation
	<<<	7th cross validation
	<<<	8th cross validation
	<<<	9th cross validation
	<<<	10th cross validation

Self help method

D1_train = []
y1_train = []
for i in range(1000):
    X1, y1 = resample(X, y, n_samples=1)


If you find a problem, I hope you can help me point it out.

Tags: Python Machine Learning

Posted on Thu, 11 Nov 2021 13:52:18 -0500 by cstegner