數據挖掘(五) -----基於Spark的可伸縮基因數據分析平臺開源存儲運算架構hail全面瞭解和安裝

hail簡介

hail是一個開源的、通用的、面向python數據類型的處理基因數據專用的分析庫和方法解決方案。

hail的存在是 爲了 支持 多維度的複雜的數據結構,比如 全基因組關聯數據研究(GWAS).

GWAS Tutorial

hail的底層是通過python,scala,java和apache spark來實現的。

hail官網

gitlab

官方文檔

hail的官方論壇

維護hail的團隊是 Neale lab,有相關疑問可以聯繫團隊通過郵箱: [email protected]

hail是一個比較新的項目,也一直在進步和增加新特性,相關發佈日誌

changelog

誰在用hail

hail 在學術界和基因產業廣泛被使用,比如

genome aggregation database

UK Biobank rapid GWAS

學術界也用hail來發布了很多文章。

相關參考

Hail-Powered Science

hail的安裝

環境準備

1、jre8或者jdk1.8,必須是java8,不支持向上兼容,比如不支持9+,主要是因爲內置的spark需要的java版本是8.

2、python3.6或者3.7,推薦Miniconda Python 3.7,不支持3.8

3、不論使用哪種方式,linux的系統內核需要包含最新版本的c和c++的標準包,比如:

GCC 5.0, and LLVM version 3.4 (which is Apple LLVM version 6.0) and later should install a compatible C++ standard library.

Most GNU/Linux distributions released since 2012 have a compatible C standard library

4、如果感覺自己的系統環境不是最新版本的gcc,可以更新g++包和下載lz4,我這裏使用比較新的centos系統,不需要運行以下命令

Debian

sudo apt-get install g++ liblz4-dev

centos

sudo yum install gcc-c++ lz4 lz4-devel

Mac OS X

xcode-select --install
brew install lz4

安裝conda命令行

詳情參考文章

數據挖掘----基礎–conda安裝—miniconda

安裝好後進入conda的環境中

假如conda的安裝目錄是/root/miniconda3,則裏面有個activate文件。

/root/miniconda3/bin/activate

使用命令

chmod 777 /root/miniconda3/bin/activate
source /root/miniconda3/bin/activate

安裝hail

使用命令

conda create -n hail python'>=3.6,<3.8'
conda activate hail
pip install hail

運行成功,完整輸出如下:

(base) [root@spark-client-test conda]# conda activate hail
(hail) [root@spark-client-test conda]# 
(hail) [root@spark-client-test conda]# 
(hail) [root@spark-client-test conda]# pip install hail
Collecting hail
  Downloading hail-0.2.32-py3-none-any.whl (32.5 MB)
     |????????????????????????????????| 32.5 MB 3.7 MB/s 
Collecting hurry.filesize==0.9
  Downloading hurry.filesize-0.9.tar.gz (2.8 kB)
Collecting requests<2.21.1,>=2.21.0
  Downloading requests-2.21.0-py2.py3-none-any.whl (57 kB)
     |????????????????????????????????| 57 kB 3.7 MB/s 
Collecting asyncinit<0.3,>=0.2.4
  Downloading asyncinit-0.2.4-py3-none-any.whl (2.8 kB)
Collecting bokeh<1.3,>1.1
  Downloading bokeh-1.2.0.tar.gz (17.6 MB)
     |????????????????????????????????| 17.6 MB 3.4 MB/s 
Collecting nest-asyncio
  Downloading nest_asyncio-1.2.3-py3-none-any.whl (4.5 kB)
Collecting gcsfs==0.2.1
  Downloading gcsfs-0.2.1.tar.gz (51 kB)
     |????????????????????????????????| 51 kB 393 kB/s 
Collecting tabulate==0.8.3
  Downloading tabulate-0.8.3.tar.gz (46 kB)
     |????????????????????????????????| 46 kB 4.1 MB/s 
Collecting python-json-logger==0.1.11
  Downloading python-json-logger-0.1.11.tar.gz (6.0 kB)
Collecting pyspark<2.4.2,>=2.4
  Downloading pyspark-2.4.1.tar.gz (215.7 MB)
     |????????????????????????????????| 215.7 MB 61 kB/s 
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)")': /simple/numpy/
Collecting numpy<2
  Downloading numpy-1.18.1-cp37-cp37m-manylinux1_x86_64.whl (20.1 MB)
     |????????????????????????????????| 20.1 MB 10.4 MB/s 
Collecting parsimonious<0.9
  Downloading parsimonious-0.8.1.tar.gz (45 kB)
     |????????????????????????????????| 45 kB 5.1 MB/s 
Collecting aiohttp<3.7,>=3.6
  Downloading aiohttp-3.6.2-cp37-cp37m-manylinux1_x86_64.whl (1.2 MB)
     |????????????????????????????????| 1.2 MB 15.0 MB/s 
Collecting scipy<1.4,>1.2
  Downloading scipy-1.3.3-cp37-cp37m-manylinux1_x86_64.whl (25.2 MB)
     |????????????????????????????????| 25.2 MB 8.2 MB/s 
Collecting decorator<5
  Downloading decorator-4.4.1-py2.py3-none-any.whl (9.2 kB)
Collecting PyJWT
  Downloading PyJWT-1.7.1-py2.py3-none-any.whl (18 kB)
Collecting pandas<0.26,>0.24
  Downloading pandas-0.25.3-cp37-cp37m-manylinux1_x86_64.whl (10.4 MB)
     |????????????????????????????????| 10.4 MB 13.4 MB/s 
Collecting aiohttp-session<2.8,>=2.7
  Downloading aiohttp_session-2.7.0-py3-none-any.whl (14 kB)
Requirement already satisfied: setuptools in /root/miniconda3/envs/hail/lib/python3.7/site-packages (from hurry.filesize==0.9->hail) (45.2.0.post20200210)
Requirement already satisfied: certifi>=2017.4.17 in /root/miniconda3/envs/hail/lib/python3.7/site-packages (from requests<2.21.1,>=2.21.0->hail) (2019.11.28)
Collecting chardet<3.1.0,>=3.0.2
  Downloading chardet-3.0.4-py2.py3-none-any.whl (133 kB)
     |????????????????????????????????| 133 kB 21.5 MB/s 
Collecting idna<2.9,>=2.5
  Downloading idna-2.8-py2.py3-none-any.whl (58 kB)
     |????????????????????????????????| 58 kB 10.3 MB/s 
Collecting urllib3<1.25,>=1.21.1
  Downloading urllib3-1.24.3-py2.py3-none-any.whl (118 kB)
     |????????????????????????????????| 118 kB 16.8 MB/s 
Collecting six>=1.5.2
  Downloading six-1.14.0-py2.py3-none-any.whl (10 kB)
Collecting PyYAML>=3.10
  Downloading PyYAML-5.3.tar.gz (268 kB)
     |????????????????????????????????| 268 kB 18.5 MB/s 
Collecting python-dateutil>=2.1
  Downloading python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB)
     |????????????????????????????????| 227 kB 11.4 MB/s 
Collecting Jinja2>=2.7
  Downloading Jinja2-2.11.1-py2.py3-none-any.whl (126 kB)
     |????????????????????????????????| 126 kB 11.8 MB/s 
Collecting pillow>=4.0
  Downloading Pillow-7.0.0-cp37-cp37m-manylinux1_x86_64.whl (2.1 MB)
     |????????????????????????????????| 2.1 MB 10.3 MB/s 
Collecting packaging>=16.8
  Downloading packaging-20.1-py2.py3-none-any.whl (36 kB)
Collecting tornado>=4.3
  Downloading tornado-6.0.3.tar.gz (482 kB)
     |????????????????????????????????| 482 kB 13.0 MB/s 
Collecting google-auth>=1.2
  Downloading google_auth-1.11.1-py2.py3-none-any.whl (76 kB)
     |????????????????????????????????| 76 kB 6.7 MB/s 
Collecting google-auth-oauthlib
  Downloading google_auth_oauthlib-0.4.1-py2.py3-none-any.whl (18 kB)
Collecting py4j==0.10.7
  Downloading py4j-0.10.7-py2.py3-none-any.whl (197 kB)
     |????????????????????????????????| 197 kB 13.0 MB/s 
Collecting multidict<5.0,>=4.5
  Downloading multidict-4.7.4-cp37-cp37m-manylinux1_x86_64.whl (149 kB)
     |????????????????????????????????| 149 kB 13.0 MB/s 
Collecting yarl<2.0,>=1.0
  Downloading yarl-1.4.2-cp37-cp37m-manylinux1_x86_64.whl (256 kB)
     |????????????????????????????????| 256 kB 9.8 MB/s 
Collecting async-timeout<4.0,>=3.0
  Downloading async_timeout-3.0.1-py3-none-any.whl (8.2 kB)
Collecting attrs>=17.3.0
  Downloading attrs-19.3.0-py2.py3-none-any.whl (39 kB)
Collecting pytz>=2017.2
  Downloading pytz-2019.3-py2.py3-none-any.whl (509 kB)
     |????????????????????????????????| 509 kB 11.3 MB/s 
Collecting MarkupSafe>=0.23
  Downloading MarkupSafe-1.1.1-cp37-cp37m-manylinux1_x86_64.whl (27 kB)
Collecting pyparsing>=2.0.2
  Downloading pyparsing-2.4.6-py2.py3-none-any.whl (67 kB)
     |????????????????????????????????| 67 kB 8.1 MB/s 
Collecting cachetools<5.0,>=2.0.0
  Downloading cachetools-4.0.0-py3-none-any.whl (10 kB)
Collecting pyasn1-modules>=0.2.1
  Downloading pyasn1_modules-0.2.8-py2.py3-none-any.whl (155 kB)
     |????????????????????????????????| 155 kB 13.0 MB/s 
Collecting rsa<4.1,>=3.1.4
  Downloading rsa-4.0-py2.py3-none-any.whl (38 kB)
Collecting requests-oauthlib>=0.7.0
  Downloading requests_oauthlib-1.3.0-py2.py3-none-any.whl (23 kB)
Collecting pyasn1<0.5.0,>=0.4.6
  Downloading pyasn1-0.4.8-py2.py3-none-any.whl (77 kB)
     |????????????????????????????????| 77 kB 7.3 MB/s 
Collecting oauthlib>=3.0.0
  Downloading oauthlib-3.1.0-py2.py3-none-any.whl (147 kB)
     |????????????????????????????????| 147 kB 11.5 MB/s 
Building wheels for collected packages: hurry.filesize, bokeh, gcsfs, tabulate, python-json-logger, pyspark, parsimonious, PyYAML, tornado
  Building wheel for hurry.filesize (setup.py) ... done
  Created wheel for hurry.filesize: filename=hurry.filesize-0.9-py3-none-any.whl size=4132 sha256=a6b6a98980ee1a921fc77fe3c6d76c26b53408afef398beb000dca534e5cc3cd
  Stored in directory: /root/.cache/pip/wheels/2c/99/7f/8c88c372b4bd642a731232e63cb89467554f6cea7708574e49
  Building wheel for bokeh (setup.py) ... done
  Created wheel for bokeh: filename=bokeh-1.2.0-py3-none-any.whl size=8233357 sha256=fadb5f1812eb81dea3d26393057a6ab08f0b269bea4a021a5b280ecee8e25bda
  Stored in directory: /root/.cache/pip/wheels/e9/2a/b6/0ffbf2eac54e489186db36a12cd6a2cebee9e6aba3dc8bff89
  Building wheel for gcsfs (setup.py) ... done
  Created wheel for gcsfs: filename=gcsfs-0.2.1-py3-none-any.whl size=27182 sha256=6f9b48133558c2bd1f06e85b8bca22db450be12cd509046648d904ccde891a9d
  Stored in directory: /root/.cache/pip/wheels/87/f6/97/5a407bea9c29e37733a1db25076fb17ce4b6d863034f52f898
  Building wheel for tabulate (setup.py) ... done
  Created wheel for tabulate: filename=tabulate-0.8.3-py3-none-any.whl size=23378 sha256=01710251a1960d89f05787fdd3e4fd02afa7f15f55a9c7ad45468d4ae04c20a6
  Stored in directory: /root/.cache/pip/wheels/b8/a2/a6/812a8a9735b090913e109133c7c20aaca4cf07e8e18837714f
  Building wheel for python-json-logger (setup.py) ... done
  Created wheel for python-json-logger: filename=python_json_logger-0.1.11-py2.py3-none-any.whl size=5075 sha256=f30f8962b78fdd72c7027842d3138bd3eb5750e9e5beae03456208d45709fc54
  Stored in directory: /root/.cache/pip/wheels/fa/7f/fd/92ccdbb9d1a65486406e0363d2ba5b4ce52f400a915f602ecb
  Building wheel for pyspark (setup.py) ... done
  Created wheel for pyspark: filename=pyspark-2.4.1-py2.py3-none-any.whl size=216054604 sha256=240829ce909e1c8e8c57a2483e638733b5ca755bd3ae296023d74882d701bb63
  Stored in directory: /root/.cache/pip/wheels/7e/9e/dd/d9be11f6d64e6d1d6c105e7e53e808f6d38788f0903e2bc72f
  Building wheel for parsimonious (setup.py) ... done
  Created wheel for parsimonious: filename=parsimonious-0.8.1-py3-none-any.whl size=42709 sha256=2b6693e8021d511f95f950bbdf4dddcf2f9e90462dacbf35e1837358e06b8836
  Stored in directory: /root/.cache/pip/wheels/88/5d/ba/f27d8af07306b65ee44f9d3f9cadea1db749a421a6db8a99bf
  Building wheel for PyYAML (setup.py) ... done
  Created wheel for PyYAML: filename=PyYAML-5.3-cp37-cp37m-linux_x86_64.whl size=44228 sha256=f9136de58f310b1005c22edd7196dbf3a5590cd11af8a78d457f7390d59bbf67
  Stored in directory: /root/.cache/pip/wheels/8a/55/a4/c0a81d27c33462cfdcb904db018f5550197e88b2b6b85beed2
  Building wheel for tornado (setup.py) ... done
  Created wheel for tornado: filename=tornado-6.0.3-cp37-cp37m-linux_x86_64.whl size=418877 sha256=2294c7dd506acb7cd16a3a81c0690de38155a36aa913fe23066602f18332d24d
  Stored in directory: /root/.cache/pip/wheels/d0/31/2c/9406ed59f0dcdce0c453a8664124d738551590e74fc087f604
Successfully built hurry.filesize bokeh gcsfs tabulate python-json-logger pyspark parsimonious PyYAML tornado
Installing collected packages: hurry.filesize, chardet, idna, urllib3, requests, asyncinit, six, PyYAML, python-dateutil, MarkupSafe, Jinja2, numpy, pillow, pyparsing, packaging, tornado, bokeh, nest-asyncio, cachetools, pyasn1, pyasn1-modules, rsa, google-auth, oauthlib, requests-oauthlib, google-auth-oauthlib, decorator, gcsfs, tabulate, python-json-logger, py4j, pyspark, parsimonious, multidict, yarl, async-timeout, attrs, aiohttp, scipy, PyJWT, pytz, pandas, aiohttp-session, hail
Successfully installed Jinja2-2.11.1 MarkupSafe-1.1.1 PyJWT-1.7.1 PyYAML-5.3 aiohttp-3.6.2 aiohttp-session-2.7.0 async-timeout-3.0.1 asyncinit-0.2.4 attrs-19.3.0 bokeh-1.2.0 cachetools-4.0.0 chardet-3.0.4 decorator-4.4.1 gcsfs-0.2.1 google-auth-1.11.1 google-auth-oauthlib-0.4.1 hail-0.2.32 hurry.filesize-0.9 idna-2.8 multidict-4.7.4 nest-asyncio-1.2.3 numpy-1.18.1 oauthlib-3.1.0 packaging-20.1 pandas-0.25.3 parsimonious-0.8.1 pillow-7.0.0 py4j-0.10.7 pyasn1-0.4.8 pyasn1-modules-0.2.8 pyparsing-2.4.6 pyspark-2.4.1 python-dateutil-2.8.1 python-json-logger-0.1.11 pytz-2019.3 requests-2.21.0 requests-oauthlib-1.3.0 rsa-4.0 scipy-1.3.3 six-1.14.0 tabulate-0.8.3 tornado-6.0.3 urllib3-1.24.3 yarl-1.4.2
(hail) [root@spark-client-test conda]# 

如果報錯找不到包或者版本不對應可以根據提示嘗試升級pip命令

pip install --upgrade pip

再重新運行

pip install hail

從安裝流程裏可以看到 hail自帶了一個pyspark。

啓動和測試hail

安裝成功後可以嘗試運行hail

使用命令,首先要進入python環境

conda activate hail

python

import hail as hl
mt = hl.balding_nichols_model(n_populations=3, n_samples=50, n_variants=100)
mt.count()

成功運行輸出如下:

(hail) [root@spark-client-test conda]# python
Python 3.7.6 (default, Jan  8 2020, 19:59:22) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hail as hl
>>> mt = hl.balding_nichols_model(n_populations=3, n_samples=50, n_variants=100)
Initializing Spark and Hail with default parameters...
20/02/14 10:45:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running on Apache Spark version 2.4.4
SparkUI available at http://spark-client-test:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.32-a5876a0a2853
LOGGING: writing to /spark/spark-2.4.4-bin-hadoop2.7/conda/hail-20200214-1045-0.2.32-a5876a0a2853.log
2020-02-14 10:45:13 Hail: INFO: balding_nichols_model: generating genotypes for 3 populations, 50 samples, and 100 variants...
>>> mt.count()
(100, 50)

關於spark

hail默認使用的是本地的spark,也就是local-mode模式的spark。

如有需要啓用遠程的spark集羣,可以通過設置環境變量 PYSPARK_SUBMIT_ARGS來指定
如下:

export PYSPARK_SUBMIT_ARGS="--master k8s://https://${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}   --deploy-mode client  --jars /usr/local/python37/lib/python3.7/site-packages/hail/hail-all-spark.jar,/spark/jars/hadoop-aws-2.7.6.jar,/spark/jars/aws-java-sdk-1.7.4.jar,/spark/jars/mongo-java-driver-3.4.2.jar,/spark/jars/mongo-spark-connector_2.11-2.2.0.jar  --conf spark.driver.extraClassPath=/usr/local/python37/lib/python3.7/site-packages/hail/hail-all-spark.jar --conf spark.executor.extraClassPath=/usr/local/python37/lib/python3.7/site-packages/hail/hail-all-spark.jar  --conf spark.executor.instances=3  --conf spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator  --conf spark.kubernetes.container.image=123.dkr.ecr.cn-northwest-1.amazonaws.com.cn/spark/spark-py:vs245py37  --conf spark.kubernetes.submission.waitAppCompletion=false  --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem  --conf spark.hadoop.fs.s3a.fast.upload=true  --conf spark.hadoop.fs.s3a.endpoint=https://s3.cn-northwest-1.amazonaws.com.cn --conf spark.hadoop.fs.s3a.access.key=123  --conf spark.hadoop.fs.s3a.secret.key=456  --conf spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4   --conf spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4  
--conf spark.kubernetes.namespace=spark
--conf spark.driver.host=`hostname -i`
pyspark-shell 
"

或者在 使用submit時指定參數如下:

HAIL_HOME=$(pip3 show hail | grep Location | awk -F' ' '{print $2 "/hail"}')
spark-submit \
  --jars $HAIL_HOME/hail-all-spark.jar \
  --conf spark.driver.extraClassPath=$HAIL_HOME/hail-all-spark.jar \
  --conf spark.executor.extraClassPath=./hail-all-spark.jar \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
  --conf spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator \
  your-hail-python-script-here.py

擴大hail的內存

在linux的shell中使用命令

export PYSPARK_SUBMIT_ARGS="--driver-memory <YOUR_MEMORY_HERE>G pyspark-shell"

增加Annotation Database的信息

Annotation Database數據庫是一個註釋數據庫,用於對接hail的分析流程。以下命令可以給我們的mt表格增加到 註釋數據庫中

使用命令

conda activate hail
python
import hail as hl
db = hl.experimental.DB()
mt = hl.balding_nichols_model(n_populations=3, n_samples=50, n_variants=100)
mt = db.annotate_rows_db(mt)
mt.show()

數據集Datasets

hail內置了一些基因的數據集,可以給我們使用,使用命令load_dataset()
可以加載到
在python中使用

>>> # Load 1000 Genomes MatrixTable with GRCh38 coordinates
>>> mt_1kg = hl.experimental.load_dataset(name='1000_genomes',                                        version='phase3',                                      reference_genome='GRCh38')

包含了千人基因組數據集1000_genomes。

更多數據集詳情參考 Datasets

數據類型Types

正常情況下,我們不需要關注數據類型,因爲hail會自動幫我們針對數據的值進行分類。

但有幾點需要注意

1、如果我們使用import_table()導入表格時,hail自動分類的數據類型不是我們想要的,我們可以指定它的數據類型。

2、使用literal()命令把python的值轉換成 hail的表達式時不會返回數據類型

3、類型缺失的值使用 null()進行聲明

以下記錄一下跟數據類型相關的命令

conda activate hail
python
import hail as hl
hl.int32(3).dtype

輸出結果

dtype('int32')

完整輸出:

(base) [root@spark-client-test spark]# conda activate hail
(hail) [root@spark-client-test spark]# python
Python 3.7.6 (default, Jan  8 2020, 19:59:22) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hail as hl
>>> hl.int32(3).dtype
dtype('int32')
>>> 

當我們初始化一個值時,也會輸出數據類型

>>> hl.int32(3)
<Int32Expression of type int32>

>>> hl.struct(name='Hail', dob=2015)
<StructExpression of type struct{name: str, dob: int32}>

>>> hl.struct(name='Hail', dob=2015).dtype
dtype('struct{name: str, dob: int32}')

需要注意的是

hail的sets和arrays 以及dicts 還有tuples 只能存同數據類型的值,區別與python。

python可以混合存儲 比如 [‘1’, 2, 3.0] ,既有數值,也有字符串。

hail是不允許的。

字典類型的key和value的類型可以不同,但是 key和key之間,value和value之間也需要 相同數據類型。

例如:
{‘a’: 1, ‘b’: 2}

類型爲

dict<str, int32> 

hail還支持一種叫 tstruct 的數據類型。tstruct可以包含多種組合的數據類型,比如

hl.struct(name='Hail', dob=2015)

數據類型爲

dtype('struct{name: str, dob: int32}')

其中name是字符串類型,dob是整型。

Structs在hail中是很常見的,比如如下:

>>> new_table = table1.annotate(table2_fields = table2[table1.key])

在table1中增加一個名字叫 table2_fields的字段
把增加了新字段的表 賦值給 new_table
table2_fields這個字段就是一個struct類型的數據,它包含了來自table2的 嵌套字段。

更多數據集詳情參考Types

表達式Expressions

hail的表達式是一種對數據的簡潔表達

在hail中,每一種數據類型type都有自己的表達式class。
比如 Int32Expression 代表 整型 32-bit integer。
BooleanExpression 代表 一個 布爾值 boolean value of True or False。

如下:

>>> hl.int32(5)
<Int32Expression of type int32>

>>> hl.bool(True)
<BooleanExpression of type bool>

同一表達式可以相加減乘除如下:

>>> hl.int32(5) + hl.int32(6)
<Int32Expression of type int32>

兩個Int32Expression相加後 得到一個 新的Int32Expression的對象。

表達式是懶加載的,只有到它們被使用的時候 纔會去計算裏面的值。

我們可以通過python和hail的表達式對比來理解。如下。

在python中 表達式5+6會馬上進行計算,我們進行表達式結果打印時,看到的是計算後的結果

>>> x = 5
>>> y = 6
>>> z = x + y
>>> z
11

hail的5+6表達式不會計算結果,而是把表達式保存起來,當使用的時候纔會進行計算。如下

>>> x = hl.int32(5)
>>> y = hl.int32(6)
>>> z = x + y
>>> z
<Int32Expression of type int32>

>>> hl.eval(z)
11
>>> z.show()
+--------+
| <expr> |
+--------+
|  int32 |
+--------+
|     11 |
+--------+

hail表達式在hail的tables和matrix tables中很常用,比如用於行列計算,新增字段

>>> ht2 = ht.annotate(C4 = ht.C3 + 3 * ht.C2 ** 2)

上面的代碼新增一個字段C4,由ht表格中的C3和C2運算獲得

hail的表達式不能用於 numpy和scipy。

hail的表達式還支持數組等類型如下:

>>> a = hl.array([1, 2, -3, 0, 5])
>>> a
<ArrayNumericExpression of type array<int32>>

>>> a.dtype
dtype('array<int32>')

>>> a[1]
<Int32Expression of type int32>


>>> a[1:-1]
<ArrayNumericExpression of type array<int32>>


>>> hl.literal([0,1,2])
<ArrayNumericExpression of type array<int32>>

邏輯表達

hail的邏輯表達跟python不一樣

python使用 and,or,not
hail使用&,|,~

如下

>>> s1 = hl.int32(3) == 4
>>> s2 = hl.int32(3) != 4

>>> s1 & s2
<BooleanExpression of type bool>

>>> s1 | s2
<BooleanExpression of type bool>

>>> ~s1
<BooleanExpression of type bool>

>>> hl.eval(~s1)
True

if判斷

python的if和else在hail中不能使用

hail使用cond(),case(),switch()

比如python代碼

if (x > 0):
    return 1
else:
    return 0

需要轉化爲

>>> hl.cond(x > 0, 1, 0)
 <Int32Expression of type int32>

這個表達式也可以用於運算

>>> a + hl.cond(x > 0, 1, 0)
<ArrayNumericExpression of type array<int32>>

case用法

實現
return 1 if x < -1, 2 if -1 <= x <= 2 and 3 if x > 2.

>>> (hl.case()
...   .when(x < -1, 1)
...   .when((x >= -1) & (x <= 2), 2)
...   .when(x > 2, 3)
...   .or_missing())
<Int32Expression of type int32>

or_missiong表示如果不在該範圍則返回一個missing value

switch語句

>>> csq = hl.str('nonsense')

>>> (hl.switch(csq)
...    .when("synonymous", False)
...    .when("intron", False)
...    .when("nonsense", True)
...    .when("indel", True)
...    .or_missing())
<BooleanExpression of type bool>

missingness語句

聲明一個空的表達式

>>> hl.null('float64')
<Float64Expression of type float64>

>>> hl.cond(x > 2.0, x, hl.null(hl.tfloat))
<Float64Expression of type float64>


>>> cnull = hl.null('call')
>>> hl.eval(cnull.is_het())
None

更多用法可參考

表達式方法

表達式介紹

表格Tables

hail的表格等同於sql的table,pandas的Dataframe,R的Dataframe,dyplr Tibble以及Spark Dataframe。

導入

hail支持各種數據源導入table,最常見的是TSV或者CSV文件

如下:

>>> ht = hl.import_table("data/kt_example1.tsv", impute=True)

很多用於基因分析的導入方法返回的都是tables,比如import_locus_intervals(), import_fam(), and import_bed()

table的內容顯示如下

>>> ht.show()
+-------+-------+-----+-------+-------+-------+-------+-------+
|    ID |    HT | SEX |     X |     Z |    C1 |    C2 |    C3 |
+-------+-------+-----+-------+-------+-------+-------+-------+
| int32 | int32 | str | int32 | int32 | int32 | int32 | int32 |
+-------+-------+-----+-------+-------+-------+-------+-------+
|     1 |    65 | "M" |     5 |     4 |     2 |    50 |     5 |
|     2 |    72 | "M" |     6 |     3 |     2 |    61 |     1 |
|     3 |    70 | "F" |     7 |     3 |    10 |    81 |    -5 |
|     4 |    60 | "F" |     8 |     2 |    11 |    90 |   -10 |
+-------+-------+-----+-------+-------+-------+-------+-------+

全局字段

hail的表格除了跟普通表格一樣有行列,還有全局字段,它是一列數據完全一樣的 列。比如G字段是全局字段,內容是5.

+-------+-------+-----+-------+-------+-------+-------+-------+-------+
|    ID |    HT | SEX |     X |     Z |    C1 |    C2 |    C3 |     G |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+
| int32 | int32 | str | int32 | int32 | int32 | int32 | int32 | int32 |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+
|     1 |    65 | M   |     5 |     4 |     2 |    50 |     5 |     5 |
|     2 |    72 | M   |     6 |     3 |     2 |    61 |     1 |     5 |
|     3 |    70 | F   |     7 |     3 |    10 |    81 |    -5 |     5 |
|     4 |    60 | F   |     8 |     2 |    11 |    90 |   -10 |     5 |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+

不過 5這個值 只保存了一次 而不是每行都存一次。

可以使用describe查看。

>>> ht.describe()  
----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'ID': int32
    'HT': int32
    'SEX': str
    'X': int32
    'Z': int32
    'C1': int32
    'C2': int32
    'C3': int32
----------------------------------------
Key:
    None
----------------------------------------

關鍵字–Keys

行字段可以用table.key_by() 進行定義,在進行表格的joining時很重要

獲取列內容

>>> ht  
<hail.table.Table at 0x110791a20>

>>> ht.ID  
<Int32Expression of type int32>

ht.G


ht['G']

更新字段

刪除字段

>>> ht.drop('C1', 'C2')
>>> ht.drop(*['C1', 'C2'])

選中字段

>>> ht.select(ht.ID, ht.SEX)
>>> ht.select(*['ID', 'C3'])

過濾內容和新增字段

保留

>>> table_result = table1.filter(table1.C1 == 5) 

去除

>>> table_result = table1.filter(table1.C1 == 10, keep=False)

新增字段

>>> ht_new = ht_new.annotate(id_times_2 = ht_new.ID * 2)

Aggregation

hail提供了一些方法可以運算行列得值

>>> ht.aggregate(hl.agg.fraction(ht.SEX == 'F'))
0.5

>>> ht_agg = (ht.group_by(ht.SEX)
...             .aggregate(mean = hl.agg.mean(ht.HT)))
>>> ht_agg.show()
+-----+----------+
| SEX |     mean |
+-----+----------+
| str |  float64 |
+-----+----------+
| "F" | 6.50e+01 |
| "M" | 6.85e+01 |
+-----+----------+

更多Aggregators用法

joins

合併兩個表,但需要設置key

如下:

>>> ht = ht.key_by('ID')
>>> ht2 = hl.import_table("data/kt_example2.tsv", impute=True).key_by('ID')

>>> ht_join = ht.join(ht2)
>>> ht_join.show(width=120)
+-------+-------+-----+-------+-------+-------+-------+-------+-------+----------+
|    ID |    HT | SEX |     X |     Z |    C1 |    C2 |    C3 |     A | B        |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+----------+
| int32 | int32 | str | int32 | int32 | int32 | int32 | int32 | int32 | str      |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+----------+
|     1 |    65 | "M" |     5 |     4 |     2 |    50 |     5 |    65 | "cat"    |
|     2 |    72 | "M" |     6 |     3 |     2 |    61 |     1 |    72 | "dog"    |
|     3 |    70 | "F" |     7 |     3 |    10 |    81 |    -5 |    70 | "mouse"  |
|     4 |    60 | "F" |     8 |     2 |    11 |    90 |   -10 |    60 | "rabbit" |
+-------+-------+-----+-------+-------+-------+-------+-------+-------+----------+

也可以把表2得列新增到表1中

>>> ht1 = ht.annotate(B = ht2[ht.ID].B)
>>> ht1.show(width=120)
+-------+-------+-----+-------+-------+-------+-------+-------+----------+
|    ID |    HT | SEX |     X |     Z |    C1 |    C2 |    C3 | B        |
+-------+-------+-----+-------+-------+-------+-------+-------+----------+
| int32 | int32 | str | int32 | int32 | int32 | int32 | int32 | str      |
+-------+-------+-----+-------+-------+-------+-------+-------+----------+
|     1 |    65 | "M" |     5 |     4 |     2 |    50 |     5 | "cat"    |
|     2 |    72 | "M" |     6 |     3 |     2 |    61 |     1 | "dog"    |
|     3 |    70 | "F" |     7 |     3 |    10 |    81 |    -5 | "mouse"  |
|     4 |    60 | "F" |     8 |     2 |    11 |    90 |   -10 | "rabbit" |
+-------+-------+-----+-------+-------+-------+-------+-------+----------+

table的基本方法

>>> first3 = ht.take(3)
>>> first3
[Struct(ID=1, HT=65, SEX='M', X=5, Z=4, C1=2, C2=50, C3=5),
 Struct(ID=2, HT=72, SEX='M', X=6, Z=3, C1=2, C2=61, C3=1),
 Struct(ID=3, HT=70, SEX='F', X=7, Z=3, C1=10, C2=81, C3=-5)]
 
 
>>> first3[0].ID
1

>>> first3[0]['ID']
1

Table.head()

Table.describe()

Table.count() 

導出

導出TSV格式

Table.export()

用hail的格式文件寫入硬盤保存

Table.write() 

這些文件可以用read_table()讀取。

導出pandas格式

Table.to_pandas()

導出spark格式pyspark

Table.to_spark()

表格介紹

複合表格MatrixTables

複合表格是有兩個分佈式延伸區域的表格

普通hail表格只有兩部分,行字段和全局字段

但複合表格有四個部分

1、entry fields 行字段和列字段的索引
2、row fields 行字段
3、column fields 列字段
4、global fields 全局字段

操作上也會有些不一樣
table有 Table.select() 和 Table.select_globals()

matrixTable有MatrixTable.select_rows(), MatrixTable.select_cols(), MatrixTable.select_entries(), and MatrixTable.select_globals().

以及 MatrixTable.rows() 和 MatrixTable.cols() 和 MatrixTable.entries()

複合表格的keys

獲取
MatrixTable.row_key 和 MatrixTable.col_key

重新指定key
MatrixTable.key_rows_by()和 MatrixTable.key_cols_by()

獲取字段

>>> mt  
<hail.matrixtable.MatrixTable at 0x1107e54a8>

>>> mt.locus  
<LocusExpression of type locus<GRCh37>>

>>> mt.DP.describe()  
--------------------------------------------------------
Type:
    int32
--------------------------------------------------------
Source:
    <class 'hail.matrixtable.MatrixTable'>
Index:
    ['row', 'column']
--------------------------------------------------------

導入複合表格

常用import_matrix_table()

基因分析常用 import_vcf(), import_plink(), import_bgen(), 和 import_gen()

>>> mt = hl.import_vcf('data/sample.vcf.bgz')

>>> mt.describe()  
----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        NEGATIVE_TRAIN_SITE: bool,
        AC: array<int32>,
        ...
        DS: bool
    }
----------------------------------------
Entry fields:
    'GT': call
    'AD': array<int32>
    'DP': int32
    'GQ': int32
    'PL': array<int32>
----------------------------------------
Column key:
    's': str
Row key:
    'locus': locus<GRCh37>
    'alleles': array<str>
----------------------------------------

常用操作

過濾

>>> filt_mt = mt.filter_rows(hl.len(mt.alleles) == 2)

>>> filt_mt = mt.filter_cols(hl.agg.mean(mt.GQ) < 20)

>>> filt_mt = mt.filter_entries(mt.DP < 5)

增加字段

MatrixTable.annotate_globals()
MatrixTable.annotate_rows()
MatrixTable.annotate_cols()
MatrixTable.annotate_entries()

>>> mt_new = mt.annotate_globals(foo = 5)
>>> print(mt_new.globals.dtype.pretty())
struct {
    foo: int32
}

>>> mt_new = mt.annotate_rows(call_rate = hl.agg.fraction(hl.is_defined(mt.GT)))

>>> mt_new = mt.annotate_entries(GT = hl.or_missing(mt.GQ >= 20, mt.GT))

選擇

MatrixTable.select_globals()
MatrixTable.select_rows()
MatrixTable.select_cols()
MatrixTable.select_entries()

>>> mt_new = mt.select_entries('GT')
>>> print(mt_new.entry.dtype.pretty())
struct {
    GT: call
}

>>> mt_new = mt.select_rows(**mt['info']) 


>>> mt_new = mt.select_rows(AC = mt.info.AC,
...                         n_filters = hl.len(mt['filters']))

刪除字段

>>> mt_new = mt.drop('GQ')

解開array和set
MatrixTable.explode_rows()
MatrixTable.explode_cols()

比如複製行

>>> mt_new = mt.annotate_rows(replicate_num = [1, 2])
>>> mt_new = mt_new.explode_rows(mt_new['replicate_num'])
>>> mt.count_rows()
346
>>> mt_new.count_rows()
692

>>> mt_new.replicate_num.show() 
+---------------+------------+---------------+
| locus         | alleles    | replicate_num |
+---------------+------------+---------------+
| locus<GRCh37> | array<str> |         int32 |
+---------------+------------+---------------+
| 20:10019093   | ["A","G"]  |             1 |
| 20:10019093   | ["A","G"]  |             2 |
| 20:10026348   | ["A","G"]  |             1 |
| 20:10026348   | ["A","G"]  |             2 |
| 20:10026357   | ["T","C"]  |             1 |
| 20:10026357   | ["T","C"]  |             2 |
| 20:10030188   | ["T","A"]  |             1 |
| 20:10030188   | ["T","A"]  |             2 |
| 20:10030452   | ["G","A"]  |             1 |
| 20:10030452   | ["G","A"]  |             2 |
+---------------+------------+---------------+
showing top 10 rows

Aggregation運算

MatrixTable.aggregate_rows()
MatrixTable.aggregate_cols()
MatrixTable.aggregate_entries()

>>> mt.aggregate_entries(hl.agg.mean(mt.GQ))  
67.73196915777027

>>> mt.aggregate_entries((hl.agg.stats(mt.DP), hl.agg.stats(mt.GQ)))  
(Struct(mean=41.83915800445897, stdev=41.93057654787303, min=0.0, max=450.0, n=34537, sum=1444998.9999999995),
Struct(mean=67.73196915777027, stdev=29.80840934057741, min=0.0, max=99.0, n=33720, sum=2283922.0000000135))

分組group-by

MatrixTable.group_rows_by()
MatrixTable.group_cols_by()

如下:

>>> mt_ann = mt.annotate_cols(case_status = hl.cond(hl.rand_bool(0.5),
...                                                 "CASE",
...                                                 "CONTROL"))


>>> mt_grouped = (mt_ann.group_cols_by(mt_ann.case_status)
...                 .aggregate(gq_stats = hl.agg.stats(mt_ann.GQ)))
>>> print(mt_grouped.entry.dtype.pretty())
struct {
    gq_stats: struct {
        mean: float64,
        stdev: float64,
        min: float64,
        max: float64,
        n: int64,
        sum: float64
    }
}
>>> print(mt_grouped.col.dtype)
struct{case_status: str}

joins

MatrixTable.union_cols()
MatrixTable.union_rows()

>>> mt_new = mt.annotate_rows(gnomad_ann = gnomad_data[mt.locus, mt.alleles])

>>> mt_new = mt.annotate_rows(gnomad_af = gnomad_data[mt.locus, mt.alleles]['AF'])

>>> mt_new = mt.annotate_rows(**gnomad_data[mt.locus, mt.alleles])

常用操作

MatrixTable.describe()
MatrixTable.head()
MatrixTable.sample()
MatrixTable.count_rows()
MatrixTable.count_cols()

導出

MatrixTable.write()導出的文件可以使用 read_matrix_table()讀取

複合表格介紹

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章