requests

TCGA下載python源代碼

根據上面的代碼，進行簡單修改。主要添加了.svs文件名輸出和下載速度顯示。

# coding:utf-8
'''
from https://blog.csdn.net/qq_35203425/article/details/80992727
This tool is to simplify the steps to download TCGA data.The tool has two main parameters,
-m is the manifest file path.
-s is the location where the downloaded file is to be saved (it is best to create a new folder for the downloaded data).
This tool supports breakpoint resuming. After the program is interrupted, it can be restarted,and the program will download file after the last downloaded file. Note that this download tool converts the file in the past folder format directly into a txt file. The file name is the UUID of the file in the original TCGA. If necessary, press ctrl+c to terminate the program.
author: chenwi
date: 2018/07/10
mail: [email protected]

@cp
* change the save file name of the form e.g.  TCGA-J8-A3YE-01Z-00-DX1.83286B2F-6D9C-4C11-8224-24D86BF517FA.svs instead of
the id name 66fab868-0b7e-4eb1-885f-6c62d1e80936.txt
* show the download speed
'''
import os
import pandas as pd
import requests
import sys
import argparse
import signal
import time

print(__doc__)

requests.packages.urllib3.disable_warnings()


def download(url, file_path):
    r = requests.get(url, stream=True, verify=False)
    total_size = int(r.headers['content-length'])
    print(f"{total_size/1024/1024}MB")
    temp_size = 0
    size = 0

    with open(file_path, "wb") as f:
        time1 = time.time()
        for chunk in r.iter_content(chunk_size=1024):
            if chunk:
                temp_size += len(chunk)
                f.write(chunk)
                done = int(50 * temp_size / total_size)
                if time.time()-time1 > 5:
                    speed = (temp_size - size) / 1024 / 1024 / 5
                    sys.stdout.write("\r[%s%s] %d%%%s%sMB/s" %
                                     ('#' * done, ' ' * (50 - done), 100 * temp_size / total_size, ' '*10, speed))
                    sys.stdout.flush()
                    size = temp_size
                    time1 = time.time()
    print()


def get_UUID_list(manifest_path):
    UUID_list = pd.read_csv(manifest_path, sep='\t', encoding='utf-8')['id']
    UUFN_list = pd.read_csv(manifest_path, sep='\t', encoding='utf-8')['filename']
    UUID_list = list(UUID_list)
    UUFN_list = list(UUFN_list)
    return UUID_list, UUFN_list

def get_last_UUFN(file_path):
    dir_list = os.listdir(file_path)
    if not dir_list:
        return
    else:
        dir_list = sorted(dir_list, key=lambda x: os.path.getmtime(os.path.join(file_path, x)))

        return dir_list[-1]


def get_lastUUFN_index(UUFN_list, last_UUFN):
    for i, UUFN in enumerate(UUFN_list):
        if UUFN == last_UUFN:
            return i
    return 0


def quit(signum, frame):
    # Ctrl+C quit
    print()
    print('You choose to stop me.')
    exit()
    print()


if __name__ == '__main__':

    signal.signal(signal.SIGINT, quit)
    signal.signal(signal.SIGTERM, quit)

    parser = argparse.ArgumentParser()
    parser.add_argument("-m", "--manifest", dest="M", type=str, default="gdc_manifest.txt",
                        help="gdc_manifest.txt file path")
    parser.add_argument("-s", "--save", dest="S", type=str, default=os.curdir,
                        help="Which folder is the download file saved to?")
    args = parser.parse_args()

    link = r'https://api.gdc.cancer.gov/data/'

    # args
    manifest_path = args.M
    save_path = args.S

    print("Save file to {}".format(save_path))

    UUID_list, UUFN_list = get_UUID_list(manifest_path)
    last_UUFN = get_last_UUFN(save_path)
    print("Last download file {}".format(last_UUFN))
    last_UUFN_index = get_lastUUFN_index(UUFN_list, last_UUFN)
    print("last_UUFN_index:", last_UUFN_index)

    for UUID, UUFN in zip(UUID_list[last_UUFN_index:], UUFN_list[last_UUFN_index:]):
        url = os.path.join(link, UUID)
        file_path = os.path.join(save_path, UUFN)
        download(url, file_path)
        print(f'{UUFN} have been downloaded')

wget

wget幫助命令

使用wget的好處是，支持自動無限重連，支持斷點續傳，對於大文件的下載非常友好，不用擔心下到一半斷網了。下面的腳本是批量下載svs文件，會持續更新一段時間。

#!/usr/bin/env bash
# 文件名：wget_download.sh

# download and re-download
#location="cervix_uteri"
location=$1
if [ ! -d "${location}" ]
then
    mkdir "${location}"
fi

echo run script: $0 $*  > download_${location}.log  2>&1

manifest_file=`echo gdc_manifest_${location}.txt`
row_num=`awk 'END{print NR}' ${manifest_file}`
file_num=$((${row_num}-1))
echo file_num is ${file_num} >> download_${location}.log  2>&1
uuid_array=($(awk '{print $1}' ${manifest_file}))
uufn_array=($(awk '{print $2}' ${manifest_file}))
start=`ls  ./${location} | wc -l`
if [ ${start} -eq 0 ];then
    start=1
fi

echo start from ${start} >> download_${location}.log  2>&1
for k in `seq ${start} ${file_num}`
do
    echo ${uuid_array[$k]} ${uufn_array[$k]} >> download_${location}.log  2>&1
    wget -c -t 0 -O ./${location}/${uufn_array[$k]} https://api.gdc.cancer.gov/data/${uuid_array[$k]}
done

使用時，輸入

chmod +x wget_download.sh
./wget_download.sh cervix_uteri

TODO

想想怎麼可以將下載速度加快呢
Axel多線程下載工具
 mwget 多線程版本wget下載工具

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

下載TCGA數據的程序

目錄

requests

wget

TODO

Win10 LTSC 2019 安裝後的一些步驟

推薦2款開源、美觀的WinForm UI控件庫

NET9 AspnetCore將整合OpenAPI的文檔生成功能而無需三方庫

在Linux下管理MySQL的大小寫敏感性

命名實體識別NER

王昊奮知識圖譜行業應用課程筆記

知識圖譜實踐

每日一練6.23

每日一練6.22

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結