数据集是人和小鼠的lncRNA的启动子序列,获自真核启动子数据库(EPD)(考虑到RNA聚合酶通常结合在TSS的上游区域)

在TSS周围采集阳性样品,并含有更多的上游区域。从远离TSS的下游区域采集阴性样品。构建了7个基于序列长度从61到301 bp的数据集,步长为40 bp。对于具有nbp长度的序列的数据集,从TSS上游(n-20)bp至TSS下游20 bp提取阳性样品。以相同的方式提取阴性样品,阳性与阴性样品的比例保持在1:1。对于每个数据集,以6:2:2比例划分。

对数据的处理和特征提取,主要用了三种方式:1.one-hot 编码 2.NCP  3.DPCP。

 1.独热编码方法可以有效地表示DNA序列并将脱氧核苷酸编码到二元载体中。在此基础上,将'A'编码为(1,0,0,0),'T'编码为(0,1,0,0),'G'编码为(0,0,1,0),'C'编码为(0,0,0,1)。因此,长度为L的DNA序列可以被转换成4×L矩阵A1。

2.四种脱氧核糖核苷酸携带不同的碱基,其在环结构、氢键强度和化学功能上不同。

3.连续脱氧核苷酸组合具有不同的物理化学性质,这是基因组功能元件鉴定的重要特征,并已用于启动子预测。在这项研究中,六个DPCP,即扭曲,倾斜,滚动,移位,滑动和上升,被用来编码DNA序列。

为了证明DeepLncPro在捕获信息基序方面的能力,我们计算了PWM(详见基序提取部分),以分析从第一卷积层的24个过滤器中提取的基序。然后使用TOMTOM将从每个过滤器学习的基序映射到JASPAR数据库中已知的转录因子(TF)结合基序。

结果表明,基于181 bp序列长度和三种编码方案组合的模型对人和小鼠lncRNA启动子的识别准确率分别为87.07%和87.73%。

我们提取并可视化了DeepLncPro所有层的输入和输出,即原始输入,第一卷积层的输出,第二卷积层的输出和全连接层的输出。为了便于理解这些特征,使用UMAP [49]显示阳性和阴性样本的分布。

原作者没有给出完整代码,以下是部分代码:

import argparse
import warnings
import torch
import numpy as np
import sys
import torch.nn as nn
import torch.nn.functional as F
import datetime

warnings.filterwarnings("ignore")


def predict(input, model, device):

    '''
    Overloading DeepLncPro for inference
    :return: Model inference results.
    '''
    model.to(device)
    with torch.no_grad():
        input = input.to(device)
        output = model(input)
        pred = output.detach().cpu().numpy().reshape(output.shape[0])
        return pred


def motif_check(full_sequence_name, full_sequence, matrix_input_fullseq, model, device, species):
    '''
    Predicted transcription factor binding sites in sequences.
    :param full_sequence_name:List of names of samples.
    :param full_sequence:Sequence for each sample.
    :param matrix_input_fullseq:A list of input matrices for each sample subsequence.
    :return:Results of writing to the output file.
    '''
    key_filter_dic = {
        'm': [0, 4, 7, 9, 12, 16, 17, 18, 20, 23],
        'h': [0, 4, 6, 7, 9, 12, 14, 16, 17, 18, 20]
    }
    bounds_dic = {
        'm': [0.7895888715982438, 0.7881207838654518, 0.8854293584823608, 1.3167156040668488, 1.0154612064361572,
              1.0301212042570116, 0.9173028439283372, 0.9543586015701295, 0.7805692762136459, 1.1233348727226258],
        'h': [0.7927006408572197, 0.7680746629834175, -0.31310261189937594, 0.8761804014444351, 1.316281932592392,
              1.0314663887023925, 1.0088603087700905, 1.0300287902355194, 0.922800037264824, 0.9950408935546875,
              0.7875141471624375]
    }
    filter_matched_motif_dic = {
        'm': ['MA0746.1', 'MA0810.1', 'MA0079.3', 'MA0140.2', 'MA1125.1', 'MA1106.1', 'MA0140.2', 'MA1125.1',
              'MA0746.1', 'MA0140.2'],
        'h': ['MA0746.1', 'MA0810.1', 'MA1149.1', 'MA0079.3', 'MA0140.2', 'MA1125.1', 'MA0108.2', 'MA1106.1',
              'MA0140.2', 'MA1125.1', 'MA0746.1']
    }
    motif_TF_dic = {
        'm': ['SP3', 'TFAP2A(var.2)', 'SP1', 'GATA1::TAL1', 'ZNF384', 'HIF1A', 'GATA1::TAL1', 'ZNF384', 'SP3',
              'GATA1::TAL1'],
        'h': ['SP3', 'TFAP2A(var.2)', 'RARA::RXRG', 'SP1', 'GATA1::TAL1', 'ZNF384', 'TBP', 'HIF1A', 'GATA1::TAL1',
              'ZNF384', 'SP3']
    }
    key_filter = key_filter_dic[species]
    filter_matched_motif = filter_matched_motif_dic[species]
    motif_TF = motif_TF_dic[species]
    bounds = bounds_dic[species]
    name_list, location_list, sequence_list, motif_list, TF_list = [], [], [], [], []
    model.to(device)
    for i in range(len(full_sequence_name)):
        if len(matrix_input_fullseq[i]) == 0:
            continue
        input = np.asarray([one_matrix for one_matrix in matrix_input_fullseq[i]], dtype=np.float32)
        input = torch.from_numpy(input)
        with torch.no_grad():
            input = input.to(device)
            conv1_out = model.conv1(input).detach().cpu().numpy()
        for j in range(len(key_filter)):
            num_filter = key_filter[j]
            filter_out = conv1_out[:, num_filter, :]
            for k in range(len(filter_out[0, :]) - 1):
                if filter_out[0, k] > bounds[j]:
                    name_list.append(full_sequence_name[i])
                    location_list.append(str(k + 1) + ' to ' + str(k + 10))
                    sequence_list.append(full_sequence[i][k:k + 10])
                    motif_list.append(filter_matched_motif[j])
                    TF_list.append(motif_TF[j])
            for l in range(len(filter_out[:, 171])):
                if filter_out[l, 171] > bounds[j]:
                    name_list.append(full_sequence_name[i])
                    location_list.append(str(l + 172) + ' to ' + str(l + 181))
                    sequence_list.append(full_sequence[i][l + 171:l + 181])
                    motif_list.append(filter_matched_motif[j])
                    TF_list.append(motif_TF[j])

    return [name_list, location_list, sequence_list, motif_list, TF_list]


def load_dataset(data_path):
    '''
    Load the sample sequence from the input file. Perform the process of encoding, sampling, etc.
    :param data_path:Input file.
    :return:Information used to predict and write to the file.


            Include: sample name, sequence, location information, and matrix.
    '''
    One_hot = {'A': [1, 0, 0, 0],
               'T': [0, 1, 0, 0],
               'G': [0, 0, 1, 0],
               'C': [0, 0, 0, 1]}
    NCP = {'A': [1, 1, 1],
           'T': [0, 1, 0],
           'G': [1, 0, 0],
           'C': [0, 0, 1]}
    DPCP = {'AA': [0.5773884923447732, 0.6531915653378907, 0.6124592000985356, 0.8402684612384332, 0.5856582729115565,
                   0.5476708282666789],
            'AT': [0.7512077598863804, 0.6036675879079278, 0.6737051546096536, 0.39069870063063133, 1.0,
                   0.76847598772376],
            'AG': [0.7015450873735896, 0.6284296628760702, 0.5818362228429766, 0.6836002897416182, 0.5249586459219764,
                   0.45903777008667923],
            'AC': [0.8257018549087278, 0.6531915653378907, 0.7043281318652126, 0.5882368974116978, 0.7888705476333944,
                   0.7467063799220581],
            'TA': [0.3539063797840531, 0.15795248106354978, 0.48996729107629966, 0.1795369895818257, 0.3059118434042811,
                   0.32686549630327577],
            'TT': [0.5773884923447732, 0.6531915653378907, 0.0, 0.8402684612384332, 0.5856582729115565,
                   0.5476708282666789],
            'TG': [0.32907512978081865, 0.3312861433089369, 0.5205902683318586, 0.4179453841534657, 0.45898067049412195,
                   0.3501900760908136],
            'TC': [0.5525570698352168, 0.6531915653378907, 0.6124592000985356, 0.5882368974116978, 0.49856742124957026,
                   0.6891727614587756],
            'GA': [0.5525570698352168, 0.6531915653378907, 0.6124592000985356, 0.5882368974116978, 0.49856742124957026,
                   0.6891727614587756],
            'GT': [0.8257018549087278, 0.6531915653378907, 0.7043281318652126, 0.5882368974116978, 0.7888705476333944,
                   0.7467063799220581],
            'GG': [0.5773884923447732, 0.7522393476914946, 0.5818362228429766, 0.6631651908463315, 0.4246720956706261,
                   0.6083143907016332],
            'GC': [0.5525570698352168, 0.6036675879079278, 0.7961968911255676, 0.5064970193495165, 0.6780274730118172,
                   0.8400043540595654],
            'CA': [0.32907512978081865, 0.3312861433089369, 0.5205902683318586, 0.4179453841534657, 0.45898067049412195,
                   0.3501900760908136],
            'CT': [0.7015450873735896, 0.6284296628760702, 0.5818362228429766, 0.6836002897416182, 0.5249586459219764,
                   0.45903777008667923],
            'CG': [0.2794124572680277, 0.3560480457707574, 0.48996729107629966, 0.4247569687810134, 0.5170412957708868,
                   0.32686549630327577],
            'CC': [0.5773884923447732, 0.7522393476914946, 0.5818362228429766, 0.6631651908463315, 0.4246720956706261,
                   0.6083143907016332]}

    name_list, sequence_list, matrix_list = [], [], []
    with open(data_path, 'rt') as f:
        lines = f.readlines()
        if lines[0][0] != '>':
            print('Please check the input file format!')
            sys.exit()
        k = 0
        sequence = ''
        for line in lines:
            if line[0] == '>':
                name_list.append(line[1:].strip())
                if k == 1:
                    sequence_list.append(sequence)
                    sequence = ''
                k = 1
            else:
                sequence = sequence + line.strip()
        sequence_list.append(sequence)

        name_output, location_output, sequence_output = [], [], []
        matrix_oneseq_all = []

        #Sampling with sliding windows.
        for i in range(len(name_list)):
            matrix_oneseq = []
            name_cur = name_list[i]
            sequence_cur = sequence_list[i]
            if len(sequence_cur) < 181:
                print('The input sequence ''%s'' does not meet the minimum length requirement of 181bp.' % name_list[i])
            else:
                # Slide the middle part of the window.
                matrix_middle = np.zeros([13, 178])
                for pos in range(178):
                    matrix_middle
                    matrix_middle[0:4, pos] += np.asarray(np.float32(One_hot[sequence_cur[pos + 1]]))
                    matrix_middle[4:7, pos] += np.asarray(np.float32(NCP[sequence_cur[pos + 1]]))
                    matrix_middle[7:13, pos] += np.asarray(np.float32(DPCP[sequence_cur[pos:pos + 2]])) / 2
                    matrix_middle[7:13, pos] += np.asarray(np.float32(DPCP[sequence_cur[pos + 1:pos + 3]])) / 2

                #The changes of sliding window.
                for left in range(len(sequence_cur) - 180):
                    right = left + 180
                    name_output.append(name_cur)
                    location_output.append(str(left + 1) + ' to ' + str(left + 181))
                    sequence_output.append(sequence_cur[left:right + 1])
                    matrix_left = np.concatenate((np.asarray(
                        np.float32(One_hot[sequence_cur[left]] + NCP[sequence_cur[left]])),
                                                  np.asarray(np.float32(DPCP[sequence_cur[left:left + 2]])) / 2),
                        axis=0)
                    matrix_right1 = np.concatenate(
                        (np.asarray(np.float32(One_hot[sequence_cur[right - 1]] + NCP[sequence_cur[right - 1]])),
                         np.asarray(np.float32(DPCP[sequence_cur[right - 2:right]])) / 2 +
                         np.asarray(np.float32(DPCP[sequence_cur[right - 1:right + 1]])) / 2),
                        axis=0)
                    matrix_right2 = np.concatenate((np.asarray(
                        np.float32(One_hot[sequence_cur[right]] + NCP[sequence_cur[right]])), np.asarray(
                        np.float32(DPCP[sequence_cur[right - 1:right + 1]])) / 2), axis=0)
                    matrix_right = np.concatenate((matrix_right1[:, np.newaxis], matrix_right2[:, np.newaxis]), axis=1)
                    matrix_cur = np.concatenate((matrix_left[:, np.newaxis], matrix_middle, matrix_right), axis=1)
                    matrix_middle = matrix_cur[:, 2:180]
                    matrix_list.append(matrix_cur)
                    matrix_oneseq.append(matrix_cur)
            matrix_oneseq_all.append(matrix_oneseq)
    matrix_input = np.asarray([i for i in matrix_list], dtype=np.float32)
    matrix_input = torch.from_numpy(matrix_input)
    return name_list, sequence_list, matrix_oneseq_all, name_output, location_output, sequence_output, matrix_input


class DeepncPro(nn.Module):
    def __init__(self, configs):
        super(DeepncPro, self).__init__()
        self.sequence_lenth = configs.sequence_lenth
        self.input_channel = configs.input_channel
        self.out_channel1 = configs.out_channel1
        self.out_channel2 = configs.out_channel2
        self.filter_size = configs.filter_size
        self.filter_size2 = configs.filter_size2
        self.stride = configs.stride
        self.fc1_size = configs.fc1_size
        self.fc2_size = configs.fc2_size
        self.pool_1 = configs.pool_1
        self.pool_2 = configs.pool_2
        self.fc2 = configs.fc2

        flatten_size = (self.sequence_lenth - self.filter_size) // self.stride + 1
        if configs.pool_1:
            flatten_size = (flatten_size - self.filter_size2) // self.stride + 1
        flatten_size = (flatten_size - self.filter_size2) // self.stride + 1
        if configs.pool_2:
            flatten_size = (flatten_size - self.filter_size2) // self.stride + 1
        self.flatten_size = flatten_size

        self.conv1 = nn.Conv1d(in_channels=self.input_channel, out_channels=self.out_channel1,
                               kernel_size=self.filter_size, stride=self.stride)
        self.max_pool1 = nn.MaxPool1d(kernel_size=self.filter_size2, stride=self.stride)
        self.conv2 = nn.Conv1d(self.out_channel1, self.out_channel2, self.filter_size2, self.stride)
        self.max_pool2 = nn.MaxPool1d(self.filter_size2, self.stride)
        self.liner1 = nn.Linear(self.out_channel2 * self.flatten_size, self.fc1_size)
        self.liner2 = nn.Linear(self.fc1_size, self.fc2_size)
        if configs.fc2:
            self.liner3 = nn.Linear(self.fc2_size, 1)
        else:
            self.liner3 = nn.Linear(self.fc1_size, 1)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        if self.pool_1:
            x = self.max_pool1(x)
        x = F.relu(self.conv2(x))
        if self.pool_2:
            x = self.max_pool2(x)

        x = x.view(x.size(0), -1)
        x = F.relu(self.liner1(x))
        if self.fc2:
            x = F.relu(self.liner2(x))
        x = self.liner3(x)
        return torch.sigmoid(x)


def prediction_process(input_file, species):
    '''
    Prediction of input sequences using DeepLncPro model.
    :return: List of predicted results.
    '''
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    model = torch.load('model/DeepLncPro.pkl', map_location='cpu')
    full_sequence_name, full_sequence, matrix_input_fullseq, name_list, location_list, sequence_list, data_list = load_dataset(
        input_file)
    pre_list = predict(data_list, model, device)
    out_list1 = [name_list, location_list, sequence_list, pre_list]
    out_list2 = motif_check(full_sequence_name, full_sequence, matrix_input_fullseq, model, device, species)
    return out_list1, out_list2


def write_outputFile(output_list, outputFile, threshold):
    '''Write the results to outputFile.'''
    f = open(outputFile, 'w', encoding="utf-8")
    name_list = output_list[0]
    location_list = output_list[1]
    sequence_list = output_list[2]
    pre_list = output_list[3]
    prediction_list = [1 if i > threshold else 0 for i in pre_list]
    out1 = open("js/out1.txt", "r")
    out2 = open("js/out2.txt", "r")
    s = out1.read()
    f.write(s)
    out1.close()
    for i in range(len(name_list)):
        f.write('<tr>' + '\n')
        f.write('<td>' + str(i + 1) + '</td>' + '\n')
        f.write('<td>' + name_list[i] + '</td>' + '\n')
        f.write('<td>' + location_list[i] + '</td>' + '\n')
        f.write('<td>' + str(pre_list[i]) + '</td>' + '\n')
        f.write('<td>' + str(prediction_list[i]) + '</td>' + '\n')
        f.write('<td>' + sequence_list[i] + '</td>' + '\n')
        f.write('</tr>' + '\n')
    s = out2.read()
    f.write(s)
    out2.close()
    f.close()


def write_motif_check_results(output_list, outputFile):
    '''Write the results to outputFile.'''
    f = open(outputFile, 'w', encoding="utf-8")
    name_list = output_list[0]
    location_list = output_list[1]
    sequence_list = output_list[2]
    motif_list = output_list[3]
    TF_list = output_list[4]
    out1 = open("js/out3.txt", "r")
    out2 = open("js/out4.txt", "r")
    s = out1.read()
    f.write(s)
    out1.close()
    for i in range(len(name_list)):
        f.write('<tr>' + '\n')
        f.write('<td>' + str(i + 1) + '</td>' + '\n')
        f.write('<td>' + name_list[i] + '</td>' + '\n')
        f.write('<td>' + location_list[i] + '</td>' + '\n')
        f.write('<td>' + sequence_list[i] + '</td>' + '\n')
        f.write('<td>' + motif_list[i] + '</td>' + '\n')
        f.write('<td>' + TF_list[i] + '</td>' + '\n')
        f.write('</tr>' + '\n')
    s = out2.read()
    f.write(s)
    out2.close()
    f.close()


def parse_args():
    '''Parameters.'''
    description = "DeepLncPro is able to identify the promoter of non-coding RNA in Human and Mouse.\n" \
                  "Example: python DeepLncPro.py -i example.txt -o output.html -s Human -ts 0.5"
    parser = argparse.ArgumentParser(description=description)
    parser.add_argument('-i', '--inputFile',
                        help='-i example.txt (The input file is a complete Fasta format sequence.)')
    parser.add_argument('-o1', '--outputFile1',
                        help='-o1 output_prediction.html (Results of predicting lncRNA promoters are saved under results folder.)')
    parser.add_argument('-o2', '--outputFile2',
                        help='-o2 output_motifcheck.html (Results of predicting motifs are saved under results folder.')
    parser.add_argument('-s', '--species',
                        help='-s h/m (Choose Human/Mouse from two species to use.)')
    parser.add_argument('-ts', '--threshold',
                        help='-ts 0.5(Prediction result threshold)')
    args = parser.parse_args()
    return args


def preprocess(inputFile, outputFile1, outputFile2, species, threshold):
    '''Predicting lncRNA promoters and writing output files.'''
    print(datetime.datetime.now())
    out_list1, out_list2 = prediction_process(inputFile, species)
    write_outputFile(out_list1, outputFile1, threshold)
    write_motif_check_results(out_list2, outputFile2)
    print(datetime.datetime.now())


if __name__ == '__main__':
    args = parse_args()
    preprocess(args.inputFile, args.outputFile1, args.outputFile2, args.species, float(args.threshold))

带上你的千军万马,虽然终究难免孤军奋战