RNA的.fasta數據轉換爲數字數據

特徵轉換:.fasta->.numerical

將.fasta格式的數據轉換爲數字格式的數據


.fasta格式1

在生物信息學中,FASTA格式(又稱爲Pearson格式),是一種基於文本用於表示核苷酸序列或氨基酸序列的格式。在這種格式中鹼基對或氨基酸用單個字母來編碼,且允許在序列前添加序列名及註釋。— [百度百科]

RNA_m5c數據集

Supporting Information S1. The benchmark dataset consists of a positive dataset and a negative dataset. The former contains 120 true m5C site containing sequences with the m5C site in the center, while the latter contains 120 false m5C site containing sequences. Each of these segments is 41-bp long.
m5c_P.fasta
I. 120 true m5C site containing sequences
P_1
CGCCUCCCACGCGGGAGACCCGGGUUCAAUUCCCGGCCAAU
P_2
CCGGGUUCAAUUCCCGGCCACUGCACGUGGUUGUUUUUCAC
P_3
GGCCGUGGGUGUGUAGAGGCCUUGGUGGUGCAGUGGUAGAA
m5c_N.fasta
II. 120 false m5C site containing sequences
N_1
GGGAGUGGGAACAGGAUUUGCAAGACUCCUAGUACCUAAAU
N_2
GAAAUGGCCUCAUUUGAUAACUAGUAGGUUUUACACAGUGU
N_3
GGGCAGCCUCCUUCUUGUCUCUGUUGUUGAGGAGUGGAAUG

手動將.fasta數據集轉換爲.csv格式


只保留.fasta數據集中的RNA序列,並且添加標籤“serial”,方便進行下一步數據轉換,生成m5c_N.csv和m5c_P.csv兩個文件

使用Anaconda_Spyder_python實現數據轉換

import pandas as pd
import csv

m5c_N_data=pd.read_csv('.\\m5c_N.csv')
m5c_P_data=pd.read_csv('.\\m5c_P.csv')

csvfile=file('.\\data.csv','wb')
writer=csv.writer(csvfile)

data=[]
for i in range(120):
    temp=[]
    for j in range(41):
        if m5c_N_data['serial'][i][j]=='A':
            temp.append(0)
        elif m5c_N_data['serial'][i][j]=='C':
            temp.append(1)
        elif m5c_N_data['serial'][i][j]=='G':
            temp.append(2)
        else:
            temp.append(3)
    temp.append(0)
    data.append(temp)

for i in range(120):
    temp=[]
    for j in range(41):
        if m5c_P_data['serial'][i][j]=='A':
            temp.append(0)
        elif m5c_P_data['serial'][i][j]=='C':
            temp.append(1)
        elif m5c_P_data['serial'][i][j]=='G':
            temp.append(2)
        else:
            temp.append(3)
    temp.append(1)
    data.append(temp)

writer.writerows(data)
csvfile.close()
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章