1. 魂斷藍橋的起因
一個兩天的bug,終於解決了,竟然是很基礎的bug,花費了我兩天時間,不免感嘆:爲什麼相愛的人不能在一起。
事情是這個樣子的:
最近測試一個R包breedR
的動物模型的功能,用了它的測試數據:
> library(breedR)
> data("globulus")
> ped <- globulus[,1:3]
> str(ped)
'data.frame': 1021 obs. of 3 variables:
$ self: int 69 70 71 72 73 74 75 76 77 78 ...
$ dad : int 0 0 0 0 0 0 0 0 0 4 ...
$ mum : int 64 41 56 55 22 50 67 59 49 8 ...
> res <- remlf90( fixed = phe_X ~ gg,genetic = list(model = 'add_animal',pedig = ped,id = 'self'),data = globulus)
Using default initial variances given by default_initial_variance()
See ?breedR.getOption.
> summary(res)
Formula: phe_X ~ 0 + gg + pedigree
Data: globulus
AIC BIC logLik
5799 5809 -2898
Parameters of special components:
Variance components:
Estimated variances S.E.
genetic 3.397 1.595
Residual 14.453 1.529
Estimate S.E.
Heritability 0.1887 0.08705
這裏,Va爲3.39,Ve爲14.45,然後我使用asreml-r
作爲對比:
> library(asreml)
> head(dd)
self dad mum gen gg bl phe_X x y
1 69 0 64 1 14 13 15.756 0 0
2 70 0 41 1 4 13 11.141 3 0
3 71 0 56 1 14 13 19.258 6 0
4 72 0 55 1 14 13 4.775 9 0
5 73 0 22 1 8 13 19.099 12 0
6 74 0 50 1 14 13 19.258 15 0
> dd$self = as.factor(dd$self)
> ainv = asreml.Ainverse(ped)$ginv
> mod1.as = asreml(phe_X ~ gg , random = ~ ped(self),ginverse = list(self = ainv), data=dd)
LogLikelihood Converged
> summary(mod1.as)$varcomp
gamma component std.error z.ratio constraint
ped(self)!ped 0.2349996 3.396488 1.595445 2.128865 Positive
R!variance 1.0000000 14.453164 1.529262 9.451070 Positive
結果是一樣的。
2. 當你以爲一帆風順時,生活來了
於是我用另外一個數據集,進行測試,數據是使用的我編寫的R包:learnasreml
中的數據:
> library(learnasreml)
> dat = animalmodel.dat
> ped = animalmodel.ped
> # asreml
> ainv = asreml.Ainverse(ped)$ginv
> mod2.as = asreml(BWT ~ SEX, random = ~ ped(ANIMAL), ginverse = list(ANIMAL = ainv), data=dat)
LogLikelihood Converged
> summary(mod2.as)$varcomp
gamma component std.error z.ratio constraint
ped(ANIMAL)!ped 0.2160062 2.494254 0.9180669 2.716855 Positive
R!variance 1.0000000 11.547140 0.9386043 12.302458 Positive
方差組分Va爲2.49,Ve爲11.54。
使用breedR
進行測試:
> dd2 = dat
> mod2.br = remlf90(BWT ~ SEX, genetic = list(model = "add_animal",pedigree = ped, id="ANIMAL"),data=dd2)
> summary(mod2.br)
Formula: BWT ~ 0 + SEX + pedigree
Data: dd2
AIC BIC logLik
5941 5951 -2968
Parameters of special components:
Variance components:
Estimated variances S.E.
genetic 0.6651 0.6728
Residual 13.3380 0.8602
納尼?方差組分Va爲0.66,Ve爲13.33,這是什麼鬼?和asreml不一樣,據我對asreml的熟練程度,只有一種可能:那肯定是breedR有錯誤。
3. 你大爺永遠是你大爺
於是我找到breedR的github中的issue:
上面問題描述:
Hi there,
I recently tried to fit an animal model using the remlf90() function. My model was simple and contained 4 fixed effects, 1 random non-genetic effect and the genetic additive effect (pedigree). I compared the results (h2 + se) to those of BLUPF90 (airemlf90) and they were the same as they should be. Then, I changed the class of the ‘id’ variable in the genetic part of the model from integer to factor and I re-ran the model. The h2 was considerabley different from that I got when the ‘id’ was of class integer.
dat399animal) # h2 = 0.44, se = 0.012 (correct)
dat399animal) # h2 = 0.08, se = 0.006 (wrong)
dat399animal)) # h2 = 0.44, se = 0.012 (correct again)
So, is this normal, should the ‘id’ part of the genetic effect be always coded as integer or there is a bug that needs to be corrected?
作者回答,breedR
需要個體的ID是數字型,如果是因子的話,會報錯提醒啊。。。
Hi Nabeel.
Yes. The variables encoding individuals (i.e., id and progenitors) should be integers.
However, the pedigree-building function should have raisen an error whenever the user tries with a factor.
How did you specify the pedigree in the model?
Thanks for your report and help.
然後作者問了一句開發者經常問道的問題:你這個bug是如何得到的。。。
然而,有時候個體是factor時,真的沒有報錯,我也想提交一個issue,算了,還是自己解決吧!
我就把因子轉化爲了數字,運行breedR:
> dd2$ANIMAL = as.numeric(dd2$ANIMAL)
> mod2.br = remlf90(BWT ~ SEX, genetic = list(model = "add_animal",pedigree = ped, id="ANIMAL"),data=dd2)
> summary(mod2.br)
Formula: BWT ~ 0 + SEX + pedigree
Data: dd2
AIC BIC logLik
5941 5951 -2968
Parameters of special components:
Variance components:
Estimated variances S.E.
genetic 0.6651 0.6728
Residual 13.3380 0.8602
這個。。。也太真實了,結果不變,依舊是錯誤的。
4. 黯然銷魂掌
於是,我陷入了深深的職業自我懷疑中:我是愛它的,爲什麼相愛的人不能在一起?
我左看右看,上看下看,還是沒有找到問題的所在,我翻遍了breedR的issue,發現了這麼一句話:
Regarding the issue of the variable type of the animal id, note that in the genetic component, the pedigree is taken from ped399, where the variable animal is presumably integer or numeric. However, if you change the corresponding variable in dat399 as you have been doing, this breaks the correspondence between the animal codes in the pedigree and the dataset.
大意就是說,breedR中,系譜和個體ID需要是數字,因爲系譜的數據會在breedR
中重新編碼,如果你改變了數據中ID的編碼,那麼系譜構建的矩陣就和數據中的ID對應不了,結果就可能是錯誤的。
這一段正確的廢話,並沒有激起我什麼想法,我還是繼續沉浸於深深的自我懷疑中,一定是我不夠好,所以它纔想要逃。。。
5. 夢裏傳來你的呼喚
靈感總是在夢中醒來,半夜忽然一個想法,是不是我轉化數字的時候,變了?
但是數據中本來就是數字的因子類型啊,我把它轉化爲數字的數字類型時會變麼?
我早知道R中有這種坑,在factor轉化爲number時,一定要通過character,否則會有各種不可預知的坑
難道
難道說
這個坑被我遇到了麼???
第二天上班,我迫切的測試了一下:
> tt = dat
> head(tt$ANIMAL)
[1] 1029 1299 643 1183 1238 891
1084 Levels: 1 2 3 5 6 7 8 9 10 11 12 14 15 16 17 20 21 22 24 25 26 27 28 29 30 32 33 34 35 36 37 38 40 41 42 43 44 47 48 49 50 51 52 ... 1309
> head(as.numeric(tt$ANIMAL))
[1] 864 1076 549 989 1030 751
可以看到,變得面目全非,本來是1029,現在是864,本來是1299,現在是1076。
6. 恍然大迷瞪
看完之後,我激動的心無法平靜,竟然想起了“爲何相愛的人不能在一起的旋律”
,我也太難了,竟然是這個原因。。。
腦子裏想起祥林嫂的語句:
我早知道,R中factor轉化爲number時有可能出錯。。。
然後我用character作爲中間元素,再測試了一下:
> tt = dat
> head(tt$ANIMAL)
[1] 1029 1299 643 1183 1238 891
1084 Levels: 1 2 3 5 6 7 8 9 10 11 12 14 15 16 17 20 21 22 24 25 26 27 28 29 30 32 33 34 35 36 37 38 40 41 42 43 44 47 48 49 50 51 52 ... 1309
> head(as.numeric(as.character(tt$ANIMAL)))
[1] 1029 1299 643 1183 1238 891
這就是對的了!
最後我用正確的形式,測試breedR
中的動物模型:
> dd2 = dat
> dd2$ANIMAL = as.numeric(as.character(dd2$ANIMAL))
> mod2.br = remlf90(BWT ~ SEX, genetic = list(model = "add_animal",pedigree = ped, id="ANIMAL"),data=dd2)
> summary(mod2.br)
Formula: BWT ~ 0 + SEX + pedigree
Data: dd2
AIC BIC logLik
5931 5941 -2964
Parameters of special components:
Variance components:
Estimated variances S.E.
genetic 2.494 0.9181
Residual 11.547 0.9386
終於看到了正確的結果,Va爲2.49,Ve爲11.54.
7. 多少人愛你青春歡暢的時辰
多麼痛的領悟啊!
R中factor和number相互轉化時,一定要經過character,這不是二手車市場,一定要有中間商賺差價!!!