下載TCGA數據的程序

今天研究了下怎麼寫腳本批量下載TCGA中的.svs文件。分兩部分,使用requests的python腳本和使用wget的shell腳本。shell腳本會更好些,Python的那個可能卡死。

任何下載工具都是人在一定的規則下寫出來的,並沒有什麼玄幻的。思考+實踐+優化=工具

requests

根據上面的代碼,進行簡單修改。主要添加了.svs文件名輸出和下載速度顯示。

# coding:utf-8
'''
from https://blog.csdn.net/qq_35203425/article/details/80992727
This tool is to simplify the steps to download TCGA data.The tool has two main parameters,
-m is the manifest file path.
-s is the location where the downloaded file is to be saved (it is best to create a new folder for the downloaded data).
This tool supports breakpoint resuming. After the program is interrupted, it can be restarted,and the program will download file after the last downloaded file. Note that this download tool converts the file in the past folder format directly into a txt file. The file name is the UUID of the file in the original TCGA. If necessary, press ctrl+c to terminate the program.
author: chenwi
date: 2018/07/10
mail: [email protected]

@cp
* change the save file name of the form e.g.  TCGA-J8-A3YE-01Z-00-DX1.83286B2F-6D9C-4C11-8224-24D86BF517FA.svs instead of
the id name 66fab868-0b7e-4eb1-885f-6c62d1e80936.txt
* show the download speed
'''
import os
import pandas as pd
import requests
import sys
import argparse
import signal
import time

print(__doc__)

requests.packages.urllib3.disable_warnings()


def download(url, file_path):
    r = requests.get(url, stream=True, verify=False)
    total_size = int(r.headers['content-length'])
    print(f"{total_size/1024/1024}MB")
    temp_size = 0
    size = 0

    with open(file_path, "wb") as f:
        time1 = time.time()
        for chunk in r.iter_content(chunk_size=1024):
            if chunk:
                temp_size += len(chunk)
                f.write(chunk)
                done = int(50 * temp_size / total_size)
                if time.time()-time1 > 5:
                    speed = (temp_size - size) / 1024 / 1024 / 5
                    sys.stdout.write("\r[%s%s] %d%%%s%sMB/s" %
                                     ('#' * done, ' ' * (50 - done), 100 * temp_size / total_size, ' '*10, speed))
                    sys.stdout.flush()
                    size = temp_size
                    time1 = time.time()
    print()


def get_UUID_list(manifest_path):
    UUID_list = pd.read_csv(manifest_path, sep='\t', encoding='utf-8')['id']
    UUFN_list = pd.read_csv(manifest_path, sep='\t', encoding='utf-8')['filename']
    UUID_list = list(UUID_list)
    UUFN_list = list(UUFN_list)
    return UUID_list, UUFN_list

def get_last_UUFN(file_path):
    dir_list = os.listdir(file_path)
    if not dir_list:
        return
    else:
        dir_list = sorted(dir_list, key=lambda x: os.path.getmtime(os.path.join(file_path, x)))

        return dir_list[-1]


def get_lastUUFN_index(UUFN_list, last_UUFN):
    for i, UUFN in enumerate(UUFN_list):
        if UUFN == last_UUFN:
            return i
    return 0


def quit(signum, frame):
    # Ctrl+C quit
    print()
    print('You choose to stop me.')
    exit()
    print()


if __name__ == '__main__':

    signal.signal(signal.SIGINT, quit)
    signal.signal(signal.SIGTERM, quit)

    parser = argparse.ArgumentParser()
    parser.add_argument("-m", "--manifest", dest="M", type=str, default="gdc_manifest.txt",
                        help="gdc_manifest.txt file path")
    parser.add_argument("-s", "--save", dest="S", type=str, default=os.curdir,
                        help="Which folder is the download file saved to?")
    args = parser.parse_args()

    link = r'https://api.gdc.cancer.gov/data/'

    # args
    manifest_path = args.M
    save_path = args.S

    print("Save file to {}".format(save_path))

    UUID_list, UUFN_list = get_UUID_list(manifest_path)
    last_UUFN = get_last_UUFN(save_path)
    print("Last download file {}".format(last_UUFN))
    last_UUFN_index = get_lastUUFN_index(UUFN_list, last_UUFN)
    print("last_UUFN_index:", last_UUFN_index)

    for UUID, UUFN in zip(UUID_list[last_UUFN_index:], UUFN_list[last_UUFN_index:]):
        url = os.path.join(link, UUID)
        file_path = os.path.join(save_path, UUFN)
        download(url, file_path)
        print(f'{UUFN} have been downloaded')

wget

使用wget的好處是,支持自動無限重連,支持斷點續傳,對於大文件的下載非常友好,不用擔心下到一半斷網了。下面的腳本是批量下載svs文件,會持續更新一段時間。

#!/usr/bin/env bash
# 文件名:wget_download.sh

# download and re-download
#location="cervix_uteri"
location=$1
if [ ! -d "${location}" ]
then
    mkdir "${location}"
fi

echo run script: $0 $*  > download_${location}.log  2>&1

manifest_file=`echo gdc_manifest_${location}.txt`
row_num=`awk 'END{print NR}' ${manifest_file}`
file_num=$((${row_num}-1))
echo file_num is ${file_num} >> download_${location}.log  2>&1
uuid_array=($(awk '{print $1}' ${manifest_file}))
uufn_array=($(awk '{print $2}' ${manifest_file}))
start=`ls  ./${location} | wc -l`
if [ ${start} -eq 0 ];then
    start=1
fi

echo start from ${start} >> download_${location}.log  2>&1
for k in `seq ${start} ${file_num}`
do
    echo ${uuid_array[$k]} ${uufn_array[$k]} >> download_${location}.log  2>&1
    wget -c -t 0 -O ./${location}/${uufn_array[$k]} https://api.gdc.cancer.gov/data/${uuid_array[$k]}
done

使用時,輸入

chmod +x wget_download.sh
./wget_download.sh cervix_uteri

TODO

想想怎麼可以將下載速度加快呢
Axel多線程下載工具
mwget 多線程版本wget下載工具

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章