直方圖與直方圖繪製

From Wikipedia.org

In statistics, a histogram is a graphical display of tabulated frequencies, shown as bars. It shows what proportion of cases fall into each of several categories. The categories are usually specified as non-overlapping intervals of some variable. The categories (bars) must be adjacent. The intervals (or bands) should ideally be of the same size ^[1].

Histograms are used to plot density. The total area of a histogram always equals 1. If the length of the intervals on the x-axis are all 1, then a histogram is identical to a relative frequency plot.

The word histogram is derived from the Greek histos 'anything set upright' (as the masts of a ship, the bar of a loom, or the vertical bars of a histogram); and gramma 'drawing, record, writing'. The histogram is one of the seven basic tools of quality control, which also include the Pareto chart, check sheet, control chart, cause-and-effect diagram, flowchart, and scatter diagram. A generalization of the histogram is kernel smoothing techniques. This will construct a very smooth probability density function from the supplied data.

Activities and demonstrations

The SOCR resource pages contain a number of hands-on interactive activities demonstrating the concept of a histogram, histogram construction and manipulation using Java applets and charts.

[edit] Mathematical definition

In a more general mathematical sense, a histogram is a mapping $m i$ that counts the number of observations that fall into various disjoint categories (known as bins), whereas the graph of a histogram is merely one way to represent a histogram. Thus, if we let $n$ be the total number of observations and $k$ be the total number of bins, the histogram $m i$ meets the following conditions:

$n = /sum_{i=1}^k{m_i}.$

[edit] Cumulative histogram

A cumulative histogram is a mapping that counts the cumulative number of observations in all of the bins up to the specified bin. That is, the cumulative histogram $M i$ of a histogram $m i$ is defined as:

$M_i = /sum_{j=1}^i{m_j}.$

[edit] Number of bins and width

There is no "best" number of bins, and different bin sizes can reveal different features of the data. Some theoreticians have attempted to determine an optimal number of bins, but these methods generally make strong assumptions about the shape of the distribution. You should always experiment with bin widths before choosing one (or more) that illustrate the salient features in your data.

The number of bins $k$ can be calculated directly, or from a suggested bin width $h$ :

$k = /left /lceil /frac{/max x - /min x}{h} /right /rceil.$

The braces indicate the ceiling function.

Sturges' formula^[2]: $k = /lceil /log_2 n + 1 /rceil$ ,

which implicitly bases the bin sizes on the range of the data, and can perform poorly if $n < 30$ .

Scott's choice^[3]: $h = /frac{3.5 /sigma}{n^{1/3}}$ ,

where $σ$ is the sample standard deviation.

Freedman-Diaconis' choice^[4]: $h = 2 /frac{/operatorname{IQR}(x)}{n^{1/3}}$ ,

which is based on the interquartile range.

[edit] Continuous data

The idea of a histogram can be generalized to continuous data. Let $f /in L^1(R)$ (see Lebesgue space), then the cumulative histogram operator $H$ can be defined by:

H (f)(y) =

with only finitely many intervals of monotony this can be rewritten as

$h(f)(y) = /sum_{/xi/in/{x : f(x)=y/}} /frac{1}{|f'(/xi)|}.$

$h (f)(y)$ is undefined if $y$ is the value of a stationary point.

什麼是直方圖？

直方圖(Histogram)也叫柱狀圖，是一種統計報告圖，由一系列高度不等的縱向條紋表示數據分佈的情況。

假設我們有一堆硬幣，如下圖所示，我們想知道一共有多少錢。

我們當然可以一枚一枚地數，但這樣如果硬幣多了可能會搞亂，因此我們需要先把硬幣分類，然後分別統計每種硬幣的數量。

把統計的結果圖示出來，就成了直方圖。下圖的橫向數軸標示出硬幣的面額(Kind of Coins)，縱向標示出硬幣的數量(Number of Coins)。

圖像的直方圖

以灰度圖爲例，假設我們的圖中一共只有0,1,2,3,4,5,6,7這8種灰度，0代表黑色，7代表白色，其它數字代表0~7之間不同深淺的灰度。

統計的結果如下，橫軸標示灰度級別(0~7)，縱軸標示每種灰度的數量。

Photoshop(PS)中的顯示。

直方圖統計數據

Photoshop CS提供了動態的直方圖面板，CS之前的版本要通過圖象>直方圖來察看。

橫軸標示亮度值(0~255)，縱軸標示每種像素的數量。

像素(Pixels) - 圖像的大小，圖像的像素總數。[5*3=15]

色階、數量、百分位這三項根據鼠標指針的位置來顯示橫座標當前位置的統計數據。

色階(Level) - 鼠標指針所在位置的亮度值，亮度值範圍是0~255。[181]

數量(Count) - 鼠標指針所在位置的像素數量。[4]

百分位(Percentile) - 從最左邊到鼠標指針位置的所有像素數量÷圖像像素總數。[(1+2+1+2+3+4)/15 = 13/15 = 0.8667 = 86.67%]

當鼠標拖動，選中直方圖的一段範圍時，色階、數量、百分位將顯示選中範圍的統計數據。

下面舉個簡單的例子來說明平均值、標準偏差、中間值。

例如圖像A只有4個像素，亮度分別是200、50、100、200。

平均值(算術平均數,Mean,Average) - 圖像的平均亮度值，高於128偏亮，低於128偏暗。平均值的算法是：圖像的亮度總值÷圖像像素總數。

平均值公式：

公式中，(讀作X撥,X-bar)代表X的平均值，∑(讀作西格瑪,Sigma)代表總和，n代表X的總數。

[圖像A的平均值 = (200+50+100+200)/4 = 550/4 = 137.5]

中間值(中值,中位數,Median) - 中間值是把圖像所有像素的亮度值通過從從小到大排列後，位置處在中間的數。（如果有偶數個像素，就有兩個位於中間的數，取前面的一個）

[圖像A的中間值：亮度排序後50<=100<=200<=200，100和200是位於中間的，取前面的100作爲中間值。]

標準偏差(標準差,Std Dev,Standard Deviation,Sample Variance) - 指圖像所有像素的亮度值與平均值之間的偏離幅度。標準偏差越小，圖像的亮度變化就越小，反之亮度變化就越大。

標準偏差公式：

圖像A的標準偏差：(已知平均值 =137.5)

標準方差 S^2 = ((200-137.5)^2+(50-137.5)^2+(100-137.5)^2+(200-137.5)^2)/(4-1) = (62.5^2+(-87.5)^2+(-37.5)^2+62.52)/3 = (3906.25+7656.25+1406.25+3906.25)/3 = 16875/3 = 5625

標準偏差 S = 5625開平方 = 75

通道

Photoshop可以根據不同的通道來顯示直方圖，這些通道分別是RGB、紅、綠、藍、亮度和顏色，它們分別統計了不同的顏色數據。

下面舉例來說明這些通道代表的含義。例如圖像B只有3個像素，顏色值分別是RGB(0,10,20)、RGB(30,40,50)、RGB(60,70,80)。

紅、綠、藍通道 - 紅色通道只統計顏色的R值，對於圖像B只統計0、30、60這三個數字，然後根據這三個數字繪出直方圖。綠色和藍色通道也用相同的方法處理。

RGB通道 - 統計圖像的所有RGB數據。對於圖像B，要統計0、10、20、30、40、50、60、70、80這9個數字。

亮度通道 - 先用亮度公式Gray=0.3*R+0.59*G+0.11*B求出每個像素的亮度值，然後對這些亮度值進行統計。對於圖像B，第一個像素的亮度值爲 0.3*0+0.59*10+0.11*20=8.1，用同樣的方法求出另兩個像素的亮度值分別爲38.1和68.1，然後四捨五入取整，對8、38、 68這三個數字進行統計。

[注] 有些軟件用Gray=(R+G+B)/3作爲亮度公式進行統計，例如GIMP。

顏色通道 - 把紅、綠、藍和RGB通道的統計數據繪製在同一幅圖中。

高速緩存級別(Cache level)

爲什麼要用緩存(Cache)？
我們在處理一些超大的圖片時，會因爲像素數量過多而導致每一步操作都很慢，爲了提高圖片的顯示與刷新速度，Photoshop利用圖片的縮小版本來進行計算並顯示。

當你載入一幅圖片時，Photoshop會自動生成許多這幅圖片的縮小版本放在臨時文件中，這些縮小圖就是高速緩存，其中有縮小到50%的、25%的、 12.5%的等等。當你把視圖縮小時，比如縮小到25%，Photoshop就利用25%那張緩存小圖的計算結果來更新視圖。

緩存小圖的不同縮小比例代表不同的緩存級別。緩存級別1爲100%的原圖，級別2爲50%的小圖，級別3爲25%的小圖，以此類推。

若要設定緩存級別，可以選擇菜單編輯>預置>內存與圖像高速緩存(Edit > Preferences > Memory and image cache)。Photoshop CS2的緩存級別默認值爲6，適用於處理高質量的數碼相機照片。

[注] Photoshop CS2之前的版本有一個“使用直方圖高速緩存”(Use Cache for histogram)的選項，可以提高直方圖的繪製速度，但會降低精確度。建議不要勾選此項，否則當你縮小視圖時，Photoshop就會使用相應的緩存小圖來生成直方圖。

緩存級別的算法
取上一級的四個相鄰像素的平均值來獲得一個像素，像素數是它上一級的1/4。

例如有張圖片的大小是8*8，緩存級別2的大小爲4*4，級別3爲2*2，級別4爲1*1。

下圖是一張2*2像素的棋盤格圖片，緩存級別2的灰度值取原圖的平均值(0+255+255+0)/4=500/4=128。

直方圖面板中的緩存級別
當圖片較大時，爲了提高直方圖的刷新速度，Photoshop會自動選擇適當的緩存級別來生成直方圖。

現在來研究一下直方圖面板中的緩存級別。新建一個2*2像素的圖片，畫成棋盤格圖案，然後執行編輯>定義圖案。

新建一個1000*1000像素的圖片，灰度模式。

使用編輯>填充命令和前面定義的圖案來填充這張圖片。

這時看一下直方圖面板，會發現右上角多了一個警告圖標。這是緩存級別2的圖片。通過直方圖我們可以看出緩存級別2是張純灰色的圖片，灰度值128，像素數量250000，正好是原圖的1/4。

單擊那個警告圖標，現在已經更新爲原圖的直方圖。像素數量爲1000*1000=1000000，有0和255兩種灰度，平均值爲255/2=127.50。這些都是原圖的正確數據，正是我們所期望的。

對於一般的圖像，爲了提高速度而使用緩存生成直方圖基本上是可行的，但要注意原圖相鄰像素的亮度變化越大，緩存小圖造成的誤差就越大。

直方圖的Y軸範圍

前面我們已經知道，Photoshop直方圖的X軸標示亮度值，Y軸標示每種像素的數量。X軸的數值範圍是0~255，現在來討論一下Y軸的範圍。

以bracket.bmp的藍色通道爲例，注意看下圖中所顯示的數據。

這幅圖像的像素總數爲19200。當鼠標光標移到直方圖最左邊（色階爲0）時，顯示出數量的最大值684，這個最大值應該就是最長的一條豎線。當鼠標向右移動到色階2時，數量爲526，但豎線仍是同樣長度，這說明色階0的豎線被Photoshop裁去了一部分。那麼Photoshop是如何確定該裁去多少呢？

我們可以給Y軸一個固定的範圍，例如0~1120，這時的直方圖顯示如下。可以看出，最大值684以上的部分都是空白，而圖形下部很纖細，看不清細節。

我們也可以把Y軸的範圍縮小到0~135，這時圖形被縱向拉長了，我們可以看出很多圖形下部的細節，但圖形的上部有太多被裁掉了。

比較常見的一種方法是把圖形縮短或拉長到正好適合視圖的大小，此時範圍是0~684(最大值)。大多數軟件都使用這種方法，Photoshop在一般情況下也是這樣。

上面的方法有一個最大的缺陷就是如果有幾條特別長的豎線的話，其它的豎線會變得很短而不利於我們觀察它們的細節，所以Photoshop會把這些長豎線裁去一些。

Photoshop確定Y軸範圍的原則是如果最大值超過像素總數的64分之一，那麼就用像素總數÷64作爲Y軸範圍的上限，否則以最大值作爲上限。

bracket.bmp的像素總數爲19200，19200/64=300，因爲最大值684>300，所以Y軸的範圍是0~300。

直方圖的算法

看到論壇上有人認爲計算直方圖開銷最大的是標準偏差，還有人認爲有必要取消標準偏差的顯示項以提高速度。其實標準偏差的計算量並不大，就看你怎麼去計算了。前面講過的一些算法的確會讓人誤以爲計算這些數據時要對每個像素都進行加減乘除乘方等運算，所以我覺得有必要討論一下直方圖的具體計算過程。

要繪製直方圖，首先要建立一個儲存每種灰度數量的數據表(GrayTable)，GrayTable是一個大小爲256的數組。然後要對圖像的每一個像素進行統計，把每種灰度的像素數量記錄到GrayTable中。這一步是必須的，不能偷工減料，這部分的計算量與像素的數量成正比，因此圖像越大，計算的速度就越慢。要想提高速度，唯一的方法就是使用緩存中的小圖進行計算，但這樣會降低精確度，不過對於實時更新的動態直方圖來說，這麼做還是很有必要的。

數據表GrayTable準備好了已後，平均值、中間值和標準偏差都可以利用GrayTable裏的數據來計算，也就是說沒必要再對每個像素進行計算了。無論圖像有多大，GrayTable裏都只存放256個數字，所以計算起來相當快。

舉例來說，例如GrayTable中存放了以下的數據。

平均值 = (0*3 + 1*2 + 2*1 + 3*5 + 0 + 255*1) / (3+2+1+5+0+1) = 274/12 = 22.8

看到了吧，用乘法就行了，灰度值×數量再相加。標準偏差也是用類似的方法計算。GrayTable的灰度值是從小到大排列的，所以計算中間值也很方便。

對於動態直方圖來說，也可以利用GrayTable來優化。有些時候，我們沒有必要重新統計整幅圖像的像素來更新直方圖，只要把先前直方圖的 GrayTable裏的數據調換一下位置就行了，這種方法僅限於單幅圖像單像素的處理(點運算)。例如做反色運算後，新的直方圖其實就是把原來的直方圖左右翻轉了一下。這樣無論圖像有多大，都可以做到精確的實時更新。

Photoshop圖像菜單的功能幾乎都可以使用這種優化，但Photoshop目前還沒做到這一點，估計是怕麻煩吧。一些高畫質數碼照片處理軟件可以考慮把這個功能加進去。

上文書轉載自：灰鹿色彩筆記 http://hi.baidu.com/graydeer 感謝灰鹿同志超一流的教學水準！

Matlab中含有一個函數 Hist

HIST Histogram.
    N = HIST(Y) bins the elements of Y into 10 equally spaced containers
    and returns the number of elements in each container. If Y is a
    matrix, HIST works down the columns.

    N = HIST(Y,M), where M is a scalar, uses M bins.

    N = HIST(Y,X), where X is a vector, returns the distribution of Y
    among bins with centers specified by X. The first bin includes
    data between -inf and the first center and the last bin
    includes data between the last bin and inf. Note: Use HISTC if
    it is more natural to specify bin edges instead.

    [N,X] = HIST(...) also returns the position of the bin centers in X.

    HIST(...) without output arguments produces a histogram bar plot of
    the results. The bar edges on the first and last bins may extend to
    cover the min and max of the data unless a matrix of data is supplied.

    HIST(AX,...) plots into AX instead of GCA.

直方圖與直方圖繪製

Activities and demonstrations

[edit] Mathematical definition

[edit] Cumulative histogram

[edit] Number of bins and width

[edit] Continuous data

Python實現大麥網搶票的四大關鍵技術點解析

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

DDS現行標準

The Gaussian Processes Web Site

支持向量機（SVM）特輯 & Michael Jordan

特徵提取和選擇

直方圖與直方圖繪製

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結