爲什麼用ls和du顯示出來的文件大小有差別？

曾經有幾次，我用ls和du查看一個文件的大小，發現二者顯示出來的大小並不一致，例如：

bl@d3:~/test/sparse_file$ ls -l fs.img
-rw-r--r-- 1 bl bl 1073741824 2012-02-17 05:09 fs.img

bl@d3:~/test/sparse_file$ du -sh fs.img
0       fs.img

這裏ls顯示出fs.img的大小是1073741824字節（1GB），而du顯示出fs.img的大小是0。

原來一直沒有深究這個問題，今天特來補上。

造成這二者不同的原因主要有兩點：

稀疏文件（sparse file）
ls和du顯示出的size有不同的含義

先來看一下稀疏文件。稀疏文件只文件中有“洞”（hole）的文件，例如有C寫一個創建有“洞”的文件：

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
int fd = open("sparse.file", O_RDWR|O_CREAT);
    lseek(fd, 1024, SEEK_CUR);
    write(fd, "\0", 1);

return 0;
}

從這個文件可以看出，創建一個有“洞”的文件主要是用lseek移動文件指針超過文件末尾，然後write，這樣就形成了一個“洞”。

用Shell也可以創建稀疏文件：

$ dd if=/dev/zero of=sparse_file.img bs=1M seek=1024 count=0
0+0 records in
0+0 records out

使用稀疏文件的優點如下（Wikipedia上的原文）：

The advantage of sparse files is that storage is only allocated when actually needed: disk space is saved, and large files can be created even if there is insufficient free space on the file system.

即稀疏文件中的“洞”可以不佔存儲空間。

再來看一下ls和du輸出的文件大小的含義（Wikipedia上的原文）：

The du command which prints the occupied space, while ls print the apparent size。

換句話說，ls顯示文件的“邏輯上”的size，而du顯示文件“物理上”的size，即du顯示的size是文件在硬盤上佔據了多少個block計算出來的。舉個例子：

bl@d3:~/test/sparse_file$ echo -n 1 > 1B.txt
bl@d3:~/test/sparse_file$ ls -l 1B.txt
-rw-r--r-- 1 bl bl 1 2012-02-19 05:17 1B.txt
bl@dl3:~/test/sparse_file$ du -h 1B.txt
4.0K    1B.txt

這裏我們先創建一個文件1B.txt，大小是一個字節，ls顯示出的size就是1Byte，而1B.txt這個文件在硬盤上會佔用N個block，然後根據每個block的大小計算出來的。這裏之所以用了N，而不是一個具體的數字，是因爲隱藏在幕後的細節還很多，例如Fragment size，我們以後再討論。

當然，上述這些都是ls和du的缺省行爲，ls和du分別提供了不同參數來改變這些行爲。比如ls的-s選項（print the allocated size of each file, in blocks）和du的--apparent-size選項（print apparent sizes, rather than disk usage; although the apparent size is usually smaller, it may be larger due to holes in (`sparse') files, internal fragmentation, indirect blocks, and the like）。

此外，對於拷貝稀疏文件，cp缺省情況下會做一些優化，以加快拷貝的速度。例如：

strace cp fs.img fs.img.copy >log 2>&1

打開log文件，我們發現cp命令只是read和lseek，並沒有write。

stat("fs.img.copy", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
stat("fs.img", {st_mode=S_IFREG|0644, st_size=1073741824, ...}) = 0
stat("fs.img.copy", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
open("fs.img", O_RDONLY)                = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=1073741824, ...}) = 0
open("fs.img.copy", O_WRONLY|O_TRUNC)   = 4
fstat(4, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
mmap(NULL, 532480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f90df965000
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 524288) = 524288
lseek(4, 524288, SEEK_CUR)              = 524288
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 524288) = 524288
lseek(4, 524288, SEEK_CUR)              = 1048576
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 524288) = 524288
lseek(4, 524288, SEEK_CUR)              = 1572864

這和cp的關於sparse的選項有關，看cp的manpage：

By default, sparse SOURCE files are detected by a crude heuristic and the corresponding DEST file is made sparse as well. That is the behavior selected by --sparse=auto. Specify --sparse=always to create a sparse DEST file whenever the SOURCE file contains a long enough sequence of zero bytes. Use --sparse=never to inhibit creation of sparse files.

看了一下cp的源代碼，發現每次read之後，cp會判斷讀到的內容是不是都是0，如果是就只lseek而不write。

當然對於sparse文件的處理，對於用戶都是透明的。

爲什麼用ls和du顯示出來的文件大小有差別？

ElasticSearch 2 (13) - 深入搜索系列之結構化搜索

漫談MySQL中的事務及其實現

爲什麼用ls和du顯示出來的文件大小有差別？

vmtouch命令

個人聯繫方式--歡迎關注微信公衆號"解憂胖酒屋"，或搜索"focusit" 或者掃碼關注

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結