kfence源碼分析【轉】

轉自:https://www.cnblogs.com/pengdonglin137/p/16342898.html

參考

作者

[email protected]

內核版本

linux-5.14

實現分析

Kfence (Kernel Electric Fence) 是 Linux 內核引入的一種低開銷的內存錯誤檢測機制,因爲是低開銷的所以它可以在運行的生產環境中開啓,同樣由於是低開銷所以它的功能相比較 KASAN 會偏弱。

  • Kfence是一種基於採樣的低開銷的內存安全錯誤檢測技術。可以檢測UAF非法釋放OOB三種內存錯誤,目前支持x86和ARM64,它在slab和slub內存分配器中添加了hook函數。

  • Kfence的設計理念:如果有足夠長的總的運行時間,kfence可以在非生產環境的測試程序無法充分測試的代碼路徑上檢測到bug。可以通過大範圍部署kfence來快速達到足夠長的總運行時間

  • Kfence管理的每個object都分別存放在一個單獨的內存頁的左邊或者右邊,跟這個內存頁緊鄰的左右兩側的內存頁被成爲保護頁,這些保護頁的內存屬性被設置成保護狀態(PTE頁表項的P位),如果訪問這些保護頁,就會導致缺頁異常,而kfence在缺頁異常中會解析和報告發生的錯誤。
    image

  • 從kfence內存池中分配object是基於一個採樣間隔,這個間隔可以通過內核啓動參數kfence.sample_interval來修改。當經過了一個採樣間隔的時間,下一次從slab或slub中分配的object將會來自kfence內存池。然後需要再經過一個採樣間隔,slab或者slub才能從kfence內存池中分配一個object。

  • 由於採用了static key機制,可以省去判斷邏輯,所以不管是否開啓kfence,從slub或者slab的的快速路徑分配內存時的性能都不會受到影響。

  • Kfence內存池的大小是固定的,如果Kfence內存池被用光了,那麼就不能再從kfence內存池分配內存了。默認的內核配置是kfence內存池大小爲2MB,可以分配到255的object,每個object對應一個內存頁。

初始化

kfence內存池框圖:
image

其中data區域是用來分配的,fence區域是用來檢測內存越界的。metadata數組的元素跟data區域一一對應,用於描述data區域的信息。

  start_kernel
  -> mm_init
  -> kfence_alloc_pool
  // 將memblock分配器中的空閒頁面釋放給夥伴分配器,之前被memblock分配出去還沒有釋放的內存也就不會出現在夥伴系統裏,雖然如此,這部分內存還是有
  // 與之對應的page結構體
  -> mem_init
  -> kfence_init
  • kfence_alloc_pool [mm\kfence\core.c]
  void __init kfence_alloc_pool(void)
  {
  // 如果採樣間隔爲0的話,不初始化kfence。需要通過內核配置選項CONFIG_KFENCE_SAMPLE_INTERVAL或者內核啓動參數kfence.sample_interval來設置
  if (!kfence_sample_interval)
  return;
   
  // 申請kfence pool內存池,大小爲:((CONFIG_KFENCE_NUM_OBJECTS + 1) * 2 * PAGE_SIZE),對齊到PAGE_SIZE
  // CONFIG_KFENCE_NUM_OBJECTS最大爲65535,最小爲1.
  __kfence_pool = memblock_alloc(KFENCE_POOL_SIZE, PAGE_SIZE);
  }

此時夥伴分配器不能使用,所以給kfence的內存在夥伴系統之外,不屬於夥伴系統管理,所以也就不用擔心被夥伴系統分配出去。

  • kfence_init
  void __init kfence_init(void)
  {
  /* 如果採樣間隔爲0,那麼會關閉kfence */
  if (!kfence_sample_interval)
  return;
   
  // 初始化kfence內存池
  kfence_init_pool();
   
  // 表示kfence可以工作了
  WRITE_ONCE(kfence_enabled, true);
  /*
  用於週期性開啓kfence內存池的任務,這裏delay時間爲0,表示立刻開啓,見下文toggle_allocation_gate
  */
  queue_delayed_work(system_unbound_wq, &kfence_timer, 0);
   
  pr_info("initialized - using %lu bytes for %d objects at 0x%p-0x%p\n", KFENCE_POOL_SIZE,
  CONFIG_KFENCE_NUM_OBJECTS, (void *)__kfence_pool,
  (void *)(__kfence_pool + KFENCE_POOL_SIZE));
  }
  • kfence_init_pool [kfence_init -> kfence_init_pool]
  static bool __init kfence_init_pool(void)
  {
  unsigned long addr = (unsigned long)__kfence_pool;
  struct page *pages;
  int i;
   
  /* 對於x86架構,會檢查__kfence_pool是否映射到物理地址了 */
  arch_kfence_init_pool();
   
  /* 獲取將kfence內存池首地址對應的page結構體 */
  pages = virt_to_page(addr);
   
  for (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) {
  if (!i || (i % 2)) // 跳過第0頁和所有的奇數頁
  continue;
  /* 1. 設置所有的偶數頁的struct page結構體的slab標誌,因爲在調用kmem_cache_free時會檢查
  虛擬地址對應的page結構體是否設置了slab標誌,如果沒有設置,那麼無法釋放
  2. 如果用kfree釋放,這個標誌可以保證調用slab_free -> __slab_free -> kfence_free
  */
  __SetPageSlab(&pages[i]);
  }
   
  // 將前兩頁在頁表中的PTE項的Present標誌去掉,這樣當cpu訪問前兩頁時,就會觸發缺頁異常
  for (i = 0; i < 2; i++) {
  kfence_protect(addr);
  addr += PAGE_SIZE;
  }
   
  // kfence_metadata是一個數據類型爲struct kfence_metadata的數組,元素個數是CONFIG_KFENCE_NUM_OBJECTS
  // 從這裏可以看出,每一個kfence_metadata數組成員管理一個object
  for (i = 0; i < CONFIG_KFENCE_NUM_OBJECTS; i++) {
  struct kfence_metadata *meta = &kfence_metadata[i];
   
  /* Initialize metadata. */
  INIT_LIST_HEAD(&meta->list);
  raw_spin_lock_init(&meta->lock);
  meta->state = KFENCE_OBJECT_UNUSED; // object的初始狀態爲UNUSED
  meta->addr = addr; /* object所在的4KB內存的起始地址 */
  list_add_tail(&meta->list, &kfence_freelist); // 添加到全局鏈表中
   
  // 將object所在的4KB內存的下一個4KB的頁表映射信息置爲無效,用來檢測內存越界訪問
  kfence_protect(addr + PAGE_SIZE);
   
  addr += 2 * PAGE_SIZE;
  }
   
  // 之前在調用memblock_alloc時在kmemleak中有記錄,這裏先刪除這部分記錄,防止後面調用kfence_alloc出現衝突
  kmemleak_free(__kfence_pool);
   
  return true;
  }
摺疊

週期性開啓kfence內存池

在kfence_init中還添加了一個kfence_timer的延遲任務,用於週期性開啓kfence內存分配,實現如下:

  • toggle_allocation_gate
  /*
  * Set up delayed work, which will enable and disable the static key. We need to
  * use a work queue (rather than a simple timer), since enabling and disabling a
  * static key cannot be done from an interrupt.
  *
  * Note: Toggling a static branch currently causes IPIs, and here we'll end up
  * with a total of 2 IPIs to all CPUs. If this ends up a problem in future (with
  * more aggressive sampling intervals), we could get away with a variant that
  * avoids IPIs, at the cost of not immediately capturing allocations if the
  * instructions remain cached.
  */
  static struct delayed_work kfence_timer;
  static void toggle_allocation_gate(struct work_struct *work)
  {
  if (!READ_ONCE(kfence_enabled))
  return;
   
  // 週期性將kfence_allocation_gate設置爲0,這個作爲一個kfence內存池開啓的標誌位,0表示開啓,非0表示關閉,
  // 保證每隔一定時間最多隻允許從kfence內存池分配一次內存
  atomic_set(&kfence_allocation_gate, 0);
  // 使用static key來優化性能,因爲直接通過讀取kfence_allocation_gate的值是否爲0來判斷的性能開銷比較大
  #ifdef CONFIG_KFENCE_STATIC_KEYS
  /* 打開static key,並且等待從kfence內存池分配 */
  static_branch_enable(&kfence_allocation_key);
   
  if (sysctl_hung_task_timeout_secs) { // 內核發出hang task警告的時間最短時間長度,一般爲120秒
  /*
  * 如果內存分配沒有那麼頻繁,就有可能出現等待時間過長的問題,這裏將等待超時時間設置爲hang task警告時間的一半,
  這樣內核就不會因爲處於D狀態過長導致內核出現警告。
   
  被喚醒的原因:
  1. 當有人從kfence分配了內存,會將kfence_allocation_gate設置爲1,然後喚醒阻塞在allocation_wait裏的任務
  2. 超時
  */
  wait_event_idle_timeout(allocation_wait, atomic_read(&kfence_allocation_gate),
  sysctl_hung_task_timeout_secs * HZ / 2);
  } else {
  /* 如果hangtask檢測時間爲0,表示時間無限長,那麼可以放心地等待下去,直到有人從kfence分配了內存,會將kfence_allocation_gate
  設置爲1,然後喚醒阻塞在allocation_wait裏的任務
  */
  wait_event_idle(allocation_wait, atomic_read(&kfence_allocation_gate));
  }
   
  /* 將static keys關閉,保證不會進入__kfence_alloc */
  static_branch_disable(&kfence_allocation_key);
  #endif
  // 等待kfence_sample_interval,單位時毫秒,然後再此開啓kfence內存池
  queue_delayed_work(system_unbound_wq, &kfence_timer,
  msecs_to_jiffies(kfence_sample_interval));
  }
  static DECLARE_DELAYED_WORK(kfence_timer, toggle_allocation_gate);
摺疊

分配內存

框圖:
image

  • 入口1:
  kmalloc
  -> kmem_cache_alloc_trace
  -> slab_alloc
  -> return
  -> __kmalloc
  -> slab_alloc
  -> return
  • 入口2
  kmem_cache_alloc
  -> slab_alloc

上面兩個路徑最後都會調用到slab_alloc:

  slab_alloc
  -> slab_alloc_node
  -> kfence_alloc
  -> 如果kfence_alloc返回NULL的話,走常規的slub分配
  • kfence_alloc
  static __always_inline void *kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags)
  {
  #ifdef CONFIG_KFENCE_STATIC_KEYS
  /* 如果內核配置了kfence_static_keys,那麼走這個優化分支 */
  if (static_branch_unlikely(&kfence_allocation_key))
  #else
  /* 常規的判斷分支,性能比static key分支差 */
  if (unlikely(!atomic_read(&kfence_allocation_gate)))
  #endif
  return __kfence_alloc(s, size, flags);
  return NULL;
  }
  • __kfence_alloc
  void *__kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags)
  {
  /*
  目前kfence內存池僅支持大小不超過一頁的內存大小object分配
  */
  if (size > PAGE_SIZE)
  return NULL;
   
  /*
  * 需要從DMA、DMA32、HIGHMEM分配內存的話,kfence內存池不支持。因爲kfence內存池的內存
  屬性不一定滿足需求,比如dma一般要求內存是不帶cache的,而kfence內存池中的內存不能保證這一點。
  */
  if ((flags & GFP_ZONEMASK) ||
  (s->flags & (SLAB_CACHE_DMA | SLAB_CACHE_DMA32)))
  return NULL;
   
  /*
  下面判斷可以保證只有一個分配者可以進入,進入後kfence內存池就關閉後,在下次開啓之前,所有的分配者
  都無法進入,只能返回NULL,從而走常規的slub分配器。
  */
  if (atomic_read(&kfence_allocation_gate) || atomic_inc_return(&kfence_allocation_gate) > 1)
  return NULL;
  #ifdef CONFIG_KFENCE_STATIC_KEYS
  /*
  * 檢查allocation_wait中是否有進程在阻塞,有的話,會起一個work來喚醒被阻塞的進程
  */
  if (waitqueue_active(&allocation_wait)) {
  /*
  * Calling wake_up() here may deadlock when allocations happen
  * from within timer code. Use an irq_work to defer it.
  */
  irq_work_queue(&wake_up_kfence_timer_work);
  }
  #endif
  // 判斷kfence功能是否使能了
  if (!READ_ONCE(kfence_enabled))
  return NULL;
   
  // 從kfence內存池中分配object
  return kfence_guarded_alloc(s, size, flags);
  }
  • kfence_guarded_alloc [kfence_alloc -> __kfence_alloc -> kfence_guarded_alloc]
  static void *kfence_guarded_alloc(struct kmem_cache *cache, size_t size, gfp_t gfp)
  {
  struct kfence_metadata *meta = NULL;
  unsigned long flags;
  struct page *page;
  void *addr;
   
  // 檢查kfence內存池是否還有空閒的內存頁
  if (!list_empty(&kfence_freelist)) {
  // 獲取空閒內存頁對應的kfence_metadata數據結構
  meta = list_entry(kfence_freelist.next, struct kfence_metadata, list);
  list_del_init(&meta->list);
  }
   
  // 如果爲空,表示kfence內存池已經分配完了。需要用常規的slub分配器分配。
  if (!meta)
  return NULL;
   
  // 獲取meta對應的空閒內存頁的虛擬首地址
  meta->addr = metadata_to_pageaddr(meta);
  /* 如果是空閒的,那麼需要恢復這個內存頁在頁表的PTE的present標誌,保證cpu可以正常訪問這頁內存而不發生缺頁異常
   
  這裏爲什麼要判斷freed呢?因爲在初始函數kfence_init_pool中設置的初始狀態是KFENCE_OBJECT_UNUSED,表示還
  這頁內存還沒有使用過,而且初始化時也沒有調用kfence_protect來保護該頁,所以對於UNUSED的頁就沒有必要kfence_unprotect
   
  只有當這頁被分配出去,然後釋放的時候會將該頁設置爲freed,並且調用kfence_protect來保護該頁,用於檢查use after free。
  所以對於free的內存頁在下次分配的時候當然要進行kfence_unprotect處理。
  */
  if (meta->state == KFENCE_OBJECT_FREED)
  kfence_unprotect(meta->addr);
   
  /*
  * Note: for allocations made before RNG initialization, will always
  * return zero. We still benefit from enabling KFENCE as early as
  * possible, even when the RNG is not yet available, as this will allow
  * KFENCE to detect bugs due to earlier allocations. The only downside
  * is that the out-of-bounds accesses detected are deterministic for
  * such allocations.
  如果隨機數發生器初始化之前分配,那麼object的地址是從這頁內存的起始位置開始。當隨機數
  發生器可以工作了,那麼將object放到這頁內存的最右側
  */
  if (prandom_u32_max(2)) {
  /* Allocate on the "right" side, re-calculate address. */
  meta->addr += PAGE_SIZE - size;
  meta->addr = ALIGN_DOWN(meta->addr, cache->align);
  }
   
  // object起始地址
  addr = (void *)meta->addr;
   
  /*
  這個函數做了幾件事:
  1. 將當前進程的調用棧記錄到meta的alloc_track中,即內存分配棧
  2. 將當前進程的pid記錄到meta的pid中
  3. 設置meta的狀態爲KFENCE_OBJECT_ALLOCATED,表示meta描述的一頁內存已經被分配
  */
  metadata_update_state(meta, KFENCE_OBJECT_ALLOCATED);
  /* 將當前kmem_cache記錄到meta中 */
  WRITE_ONCE(meta->cache, cache);
  /* 記錄object的大小 */
  meta->size = size;
  /* 將這頁內存中除了給object用的size大小的空間之外的填充成一個跟地址相關的pattern數
  目的是在釋放時檢查有沒有發生內存越界訪問
  */
  for_each_canary(meta, set_canary_byte);
   
  /* 獲取這頁內存對應的struct page結構 */
  page = virt_to_page(meta->addr);
  /* 在page中記錄對應的kmem_cache,將來釋放的時候要用到 */
  page->slab_cache = cache;
  /* 由於kfence內存池中一個頁只放了一個object,所以這裏將objects設置爲1 */
  if (IS_ENABLED(CONFIG_SLUB))
  page->objects = 1;
  // 如果是slab分配器,s_smem會記錄第一個object的地址
  if (IS_ENABLED(CONFIG_SLAB))
  page->s_mem = addr;
   
  /* Memory initialization. */
   
  /*
  * We check slab_want_init_on_alloc() ourselves, rather than letting
  * SL*B do the initialization, as otherwise we might overwrite KFENCE's
  * redzone.
  */
  if (unlikely(slab_want_init_on_alloc(gfp, cache))) // 如果設置了__GFP_ZERO標誌,返回true
  memzero_explicit(addr, size); // 將object使用的那部分區域清零
  if (cache->ctor) // 如果有構造函數
  cache->ctor(addr);
   
  /* KFENCE_COUNTER_ALLOCATED 表示kfence內存池中有多少object被分配出去了,在釋放的時候會減一 */
  atomic_long_inc(&counters[KFENCE_COUNTER_ALLOCATED]);
  /* KFENCE_COUNTER_ALLOCS 表示發生從kfence內存池分配內存的次數,單調遞增 */
  atomic_long_inc(&counters[KFENCE_COUNTER_ALLOCS]);
   
  return addr;
  }
摺疊

釋放內存

  • 路徑1:
  kfree
  -> slab_free
  -> slab_free_hook
  -> do_slab_free
  -> __slab_free
  -> kfence_free
  • 路徑2
  kmem_cache_free
  -> slab_free

釋放內存時,最終會調用到kfence_free

  • kfence_free
  static __always_inline __must_check bool kfence_free(void *addr)
  {
  // 檢查要釋放的虛擬地址是否在kfence內存池的虛擬地址範圍內
  if (!is_kfence_address(addr))
  return false;
  __kfence_free(addr);
  return true;
  }
  • __kfence_free
  void __kfence_free(void *addr)
  {
  /*
  根據object的地址可以獲取對應的meta。根據addr跟kfence內存池起始地址的偏移可以計算出一個索引,然後從kfence_metadata數組
  中就可以得到索引對應的meta
  */
  struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr);
   
  /*
  * 如果meta對應的kmem_cache有SLAB_TYPESAFE_BY_RCU,那麼不能立刻釋放,需要異步處理,當過了一個寬限期再釋放
  在rcu_guarded_free會直接調用kfence_guarded_free
  */
  if (unlikely(meta->cache && (meta->cache->flags & SLAB_TYPESAFE_BY_RCU)))
  call_rcu(&meta->rcu_head, rcu_guarded_free);
  else
  kfence_guarded_free(addr, meta, false);
  }
  • kfence_guarded_free [kfence_free -> __kfence_free -> kfence_guarded_free]
  static void kfence_guarded_free(void *addr, struct kfence_metadata *meta, bool zombie)
  {
  struct kcsan_scoped_access assert_page_exclusive;
  unsigned long flags;
   
  // 如果meta的狀態不是已分配的話或者地址不匹配,或者是釋放了兩次,或者是釋放時傳的地址跟申請時獲得的不一樣
  if (meta->state != KFENCE_OBJECT_ALLOCATED || meta->addr != (unsigned long)addr) {
  /* Invalid or double-free, bail out. */
  atomic_long_inc(&counters[KFENCE_COUNTER_BUGS]); // 將kfence檢測到的內存問題的個數加1
  kfence_report_error((unsigned long)addr, false, NULL, meta,
  KFENCE_ERROR_INVALID_FREE);
  raw_spin_unlock_irqrestore(&meta->lock, flags);
  return;
  }
   
  /* 如果在缺頁異常中檢測到OOB內存錯誤,那麼unprotected_page會記錄發生異常的地址 */
  if (meta->unprotected_page) {
  // 將發生OOB的地址所在的page頁清零
  memzero_explicit((void *)ALIGN_DOWN(meta->unprotected_page, PAGE_SIZE), PAGE_SIZE);
  // 將發生OOB的地址所在的內存頁設置爲保護,因爲缺頁異常的最後會取消保護髮生異常的地址所在的頁
  kfence_protect(meta->unprotected_page);
  meta->unprotected_page = 0;
  }
   
  /* 檢查object所在的內存頁的空閒區域的pattern值是否發生了改變,以此來判斷是否發生了OOB
  for_eatch_canary首先檢查object左側的pattern,將第一個pattern不一致的信息輸出。然後檢查object右側
  的pattern,也只輸出第一個pattern不一致的信息輸出
  */
  for_each_canary(meta, check_canary_byte);
   
  /*
  * Clear memory if init-on-free is set. While we protect the page, the
  * data is still there, and after a use-after-free is detected, we
  * unprotect the page, so the data is still accessible.
  */
  if (!zombie && unlikely(slab_want_init_on_free(meta->cache)))
  memzero_explicit(addr, meta->size);
   
  /* 這個函數做如下幾件事:
  1. 將當前進程的調用棧存放到meta的free_track中,即內存釋放棧
  2. 記錄當前進程的pid到meta的pid成員中
  3. 設置meta的狀態爲KFENCE_OBJECT_FREED,表示對應的內存頁空閒了
  */
  metadata_update_state(meta, KFENCE_OBJECT_FREED);
   
  /* 將這頁內存保護起來,用來檢測use after free類型的內存訪問錯誤 */
  kfence_protect((unsigned long)addr);
   
  if (!zombie) {
  /* 將meta重新放回空閒鏈表 */
  list_add_tail(&meta->list, &kfence_freelist);
   
  // 將KFENCE_COUNTER_ALLOCATED的計數減1,表示當前有多少kfence內存池裏有多少object被分配出去了
  atomic_long_dec(&counters[KFENCE_COUNTER_ALLOCATED]);
  // 將KFENCE_COUNTER_FREES的計數加1,表示kfence內存池發生了多少次object釋放,單調遞增
  atomic_long_inc(&counters[KFENCE_COUNTER_FREES]);
  } else {
  /* 當kmem_cache被銷燬時,所有尚未釋放的object個數會記錄到KFENCE_COUNTER_ZOMBIES中
  處於zombie的object也時free的,但是不能被分配了
  */
  atomic_long_inc(&counters[KFENCE_COUNTER_ZOMBIES]);
  }
  }
摺疊

檢查pattern區

  • for_each_canary [kfence_free -> __kfence_free -> kfence_guarded_free -> for_each_canary]
  /* __always_inline this to ensure we won't do an indirect call to fn. */
  static __always_inline void for_each_canary(const struct kfence_metadata *meta, bool (*fn)(u8 *))
  {
  const unsigned long pageaddr = ALIGN_DOWN(meta->addr, PAGE_SIZE);
  unsigned long addr;
   
  /* 檢查object所在的內存頁的左側的pattern區域 */
  for (addr = pageaddr; addr < meta->addr; addr++) {
  if (!fn((u8 *)addr)) // 如果不匹配,會輸出kfence錯誤log,並返回false
  break;
  }
   
  /* 檢查object所在的內存頁的右側的pattern區域 */
  for (addr = meta->addr + meta->size; addr < pageaddr + PAGE_SIZE; addr++) {
  if (!fn((u8 *)addr)) // 如果不匹配,會輸出kfence錯誤log,並返回false
  break;
  }
  }
  • check_canary_byte [kfence_free -> __kfence_free -> kfence_guarded_free -> for_each_canary -> check_canary_byte ]
  /* Check canary byte at @addr. */
  static inline bool check_canary_byte(u8 *addr)
  {
  if (likely(*addr == KFENCE_CANARY_PATTERN(addr)))
  return true;
   
  // 如果內存頁中的空閒區域的值跟之前的pattern值不同,表示在該頁內部發生了越界,這種越界不會觸發缺頁
  // KFENCE_COUNTER_BUGS的計數加1,表示kfence檢測到的內存問題的個數
  atomic_long_inc(&counters[KFENCE_COUNTER_BUGS]);
  kfence_report_error((unsigned long)addr, false, NULL, addr_to_metadata((unsigned long)addr),
  KFENCE_ERROR_CORRUPTION);
  return false;
  }

kmem_cache銷燬

  kmem_cache_destroy
  -> shutdown_cache
  -> kfence_shutdown_cache
  • kfence_shutdown_cache
  void kfence_shutdown_cache(struct kmem_cache *s)
  {
  unsigned long flags;
  struct kfence_metadata *meta;
  int i;
   
  for (i = 0; i < CONFIG_KFENCE_NUM_OBJECTS; i++) {
  bool in_use;
   
  meta = &kfence_metadata[i];
   
  /* 跳過不跟指定kmem_cache匹配的meta以及狀態不是已分配的meta
  */
  if (READ_ONCE(meta->cache) != s ||
  READ_ONCE(meta->state) != KFENCE_OBJECT_ALLOCATED)
  continue;
   
  raw_spin_lock_irqsave(&meta->lock, flags);
  in_use = meta->cache == s && meta->state == KFENCE_OBJECT_ALLOCATED;
  raw_spin_unlock_irqrestore(&meta->lock, flags);
   
  if (in_use) {
  /*
  * This cache still has allocations, and we should not
  * release them back into the freelist so they can still
  * safely be used and retain the kernel's default
  * behaviour of keeping the allocations alive (leak the
  * cache); however, they effectively become "zombie
  * allocations" as the KFENCE objects are the only ones
  * still in use and the owning cache is being destroyed.
  *
  * We mark them freed, so that any subsequent use shows
  * more useful error messages that will include stack
  * traces of the user of the object, the original
  * allocation, and caller to shutdown_cache().
  */
  kfence_guarded_free((void *)meta->addr, meta, /*zombie=*/true);
  // 將zombie設置爲true,被釋放的meta並不會加入到kfence_freelist中,也就不會分分配出去
  // 處於zombie的object也屬於free,但是不能再被分配
  }
  }
   
  for (i = 0; i < CONFIG_KFENCE_NUM_OBJECTS; i++) {
  meta = &kfence_metadata[i];
   
  /* See above. */
  if (READ_ONCE(meta->cache) != s || READ_ONCE(meta->state) != KFENCE_OBJECT_FREED)
  continue;
   
  raw_spin_lock_irqsave(&meta->lock, flags);
  // 將meta的cache字段清除,這樣通過/sys/kernel/debug/kfence/objects知道哪些object是zombie的
  if (meta->cache == s && meta->state == KFENCE_OBJECT_FREED)
  meta->cache = NULL;
  raw_spin_unlock_irqrestore(&meta->lock, flags);
  }
  }
摺疊

缺頁異常

  • 當發生內存越界訪問導致被protect的頁被訪問,此時會發生缺頁。
    image

  • 當發生了use after free,即object被釋放後在沒有申請的情況下,又訪問這個object,也會發生缺頁。因爲在釋放時,空閒object所在的內存頁已經被保護了。
    image

路徑:

  handle_page_fault
  -> do_kern_addr_fault
  -> bad_area_nosemaphore
  -> __bad_area_nosemaphore
  -> kernelmode_fixup_or_oops
  -> page_fault_oops
  -> kfence_handle_page_fault
  • kfence_handle_page_fault
  /*
  addr是導致缺頁的地址
  is_write表示是否是寫訪問
  regs記錄缺頁發生時的cpu寄存器上下文
  */
  bool kfence_handle_page_fault(unsigned long addr, bool is_write, struct pt_regs *regs)
  {
  /*
  根據缺頁發生的地址計算在kfence內存池中的索引
  */
  const int page_index = (addr - (unsigned long)__kfence_pool) / PAGE_SIZE;
  struct kfence_metadata *to_report = NULL;
  enum kfence_error_type error_type;
  unsigned long flags;
   
  // 判斷是否爲kfence內存池的地址範圍
  if (!is_kfence_address((void *)addr))
  return false;
   
  // 檢查kfence是否被關閉了,可以向/sys/module/kfence/parameters/sample_interval寫入0關閉kfence
  if (!READ_ONCE(kfence_enabled)) /* If disabled at runtime ... */
  return kfence_unprotect(addr); /* ... unprotect and proceed. */
   
  // KFENCE_COUNTER_BUGS計數加1,表示檢測到的內存錯誤的個數
  atomic_long_inc(&counters[KFENCE_COUNTER_BUGS]);
   
  if (page_index % 2) {
  /*
  如果是在kfence內存池中奇數頁上發生的缺頁,表示發生了內存越界。因爲在初始化時,已經將奇數頁保護起來了
  */
   
  /* This is a redzone, report a buffer overflow. */
  struct kfence_metadata *meta;
  int distance = 0;
   
  // 獲取缺頁地址左邊的一頁對應的meta,因爲奇數頁不用來存放object。
  meta = addr_to_metadata(addr - PAGE_SIZE);
  if (meta && READ_ONCE(meta->state) == KFENCE_OBJECT_ALLOCATED) { // 檢查左邊的頁是否分配了
  to_report = meta;
  /* Data race ok; distance calculation approximate.
  計算髮生缺頁的地址跟左邊被分配出去的object的結尾地址之間的距離
  */
  distance = addr - data_race(meta->addr + meta->size);
  }
   
  // 檢查缺頁地址右邊的頁對應的meta
  meta = addr_to_metadata(addr + PAGE_SIZE);
  if (meta && READ_ONCE(meta->state) == KFENCE_OBJECT_ALLOCATED) { // 檢查右邊的頁是否分配了
  /* Data race ok; distance calculation approximate.
  如果to_report是空,表示左邊的頁沒有分配,那麼當前右邊的頁就是發生越界的object所在的頁
  如果左邊的頁也分配了,需要比較右邊的的頁中object的起始地址距離缺頁發生的地址之間的距離跟左邊頁計算來的
  的距離,距離小的一邊就是發生越界的object所在的頁
  */
  if (!to_report || distance > data_race(meta->addr) - addr)
  to_report = meta;
  }
   
  // 如果左邊和右邊的頁都沒有分配出去,這是一種kfence也不敢確定的異常行爲,可能是UAF或者OOB
  if (!to_report)
  goto out;
   
  raw_spin_lock_irqsave(&to_report->lock, flags);
  // 記錄缺頁發生的地址
  to_report->unprotected_page = addr;
  // kfence檢測到的錯誤類型爲越界訪問
  error_type = KFENCE_ERROR_OOB;
   
  /*
  * If the object was freed before we took the look we can still
  * report this as an OOB -- the report will simply show the
  * stacktrace of the free as well.
  */
  } else {
  // 表示發生了UAF,在偶數頁上發生了缺頁,只有一種可能,就是object被釋放後,沒有申請的情況下,又訪問了這個object。
  // 在前面的分析中直到,對於偶數頁,只有在free後纔會被protect起來。
  to_report = addr_to_metadata(addr);
  if (!to_report)
  goto out;
   
  raw_spin_lock_irqsave(&to_report->lock, flags);
  // kfence檢測到UAF內存訪問錯誤
  error_type = KFENCE_ERROR_UAF;
  /*
  * We may race with __kfence_alloc(), and it is possible that a
  * freed object may be reallocated. We simply report this as a
  * use-after-free, with the stack trace showing the place where
  * the object was re-allocated.
  */
  }
   
  out:
  if (to_report) {
  // 報告OOB內存訪問錯誤
  kfence_report_error(addr, is_write, regs, to_report, error_type);
  raw_spin_unlock_irqrestore(&to_report->lock, flags);
  } else {
  /* 觸發OOB的左側和右側的內存頁都沒有分配,既可能使UAF,也可能是OOB
  This may be a UAF or OOB access, but we can't be sure. */
  kfence_report_error(addr, is_write, regs, NULL, KFENCE_ERROR_INVALID);
  }
   
  // 執行到這裏,說明kfence不希望系統宕機,所以撤銷發生缺頁的地址所在的內存區的保護,保證系統還可以正常跑下去
  return kfence_unprotect(addr); /* Unprotect and let access proceed. */
  }
摺疊

錯誤報告

當檢測到內存錯誤訪問時,會調用kfence_report_error輸出錯誤log。

錯誤種類分爲如下幾種:

  1. 缺頁異常中檢測到的訪問了protect頁的oob:KFENCE_ERROR_OOB
    image

  2. 釋放內存時檢測到的訪問了object所在的內存區的空閒區域的OOB:KFENCE_ERROR_CORRUPTION
    image

  3. 缺頁異常中檢測到的訪問了被釋放的object所在的內存頁的UAF:KFENCE_ERROR_UAF
    image

  4. 釋放內存時檢測到的kfence到重複釋放或者申請和釋放的地址不一致:KFENCE_ERROR_INVALID_FREE

  5. 缺頁異常中檢測到的kfence無法確定的內存訪問錯誤,比如發生OOB時但是protect頁左右的內存頁都沒有分配出去:KFENCE_ERROR_INVALID

  • kfence_report_error
  /*
  address: 導致內存問題的地址
  is_write: 是不是寫訪問、
  regs: 發生缺頁異常時的cpu上下文
  meta:跟導致內存異常的地址關聯的meta,對於訪問protect區域的oob來說,meta表示的是因爲訪問那個object導致的oob,這個object對應的meta
  type:內存問題的類型
  */
   
  void kfence_report_error(unsigned long address, bool is_write, struct pt_regs *regs,
  const struct kfence_metadata *meta, enum kfence_error_type type)
  {
  unsigned long stack_entries[KFENCE_STACK_DEPTH] = { 0 };
  const ptrdiff_t object_index = meta ? meta - kfence_metadata : -1;
  int num_stack_entries;
  int skipnr = 0;
   
  /*
  對於regs非空,是因爲觸發了缺頁的情況,此時根據regs得到的調用棧不需要skip任何一項,所以skipnr爲0,因爲regs記錄的就是異常發生那
  一刻的棧的狀態;
   
  對於regs爲空的場景,是通過釋放內存觸發的,記錄調用棧的時候,調用棧裏不可避免的會出現kfence、slab以及kmem_cache相關的函數,這些
  函數對於分析問題沒啥幫助,所以對分析問題有幫助的是誰調用了這些函數,即誰在哪裏執行了釋放內存的操作,因爲需要將這部分的調用棧輸出出來,
  以節省開發人員時間,所以skipnr非0
  */
  if (regs) {
  /* 根據pt_regs獲取發生異常時的調用棧,並且存放到stack_entries中,深度爲64 */
  num_stack_entries = stack_trace_save_regs(regs, stack_entries, KFENCE_STACK_DEPTH, 0);
  } else {
  /* 如果沒有傳遞pt_regs,那麼記錄的當前的調用棧,但是會將堆棧的去掉調用棧的第一項,即stack_trace_save */
  num_stack_entries = stack_trace_save(stack_entries, KFENCE_STACK_DEPTH, 1);
  /* 解析調用棧,目的是儘量得到導致內存問題的業務邏輯的位置,跳過kfence、slab、kfree、kmem_cache、kmalloc相關的函數
  這樣更加方便定位問題
  */
  skipnr = get_stack_skipnr(stack_entries, num_stack_entries, &type);
  }
   
  /* Require non-NULL meta, except if KFENCE_ERROR_INVALID. */
  if (WARN_ON(type != KFENCE_ERROR_INVALID && !meta))
  return;
   
  if (meta)
  lockdep_assert_held(&meta->lock);
  /*
  * Because we may generate reports in printk-unfriendly parts of the
  * kernel, such as scheduler code, the use of printk() could deadlock.
  * Until such time that all printing code here is safe in all parts of
  * the kernel, accept the risk, and just get our message out (given the
  * system might already behave unpredictably due to the memory error).
  * As such, also disable lockdep to hide warnings, and avoid disabling
  * lockdep for the rest of the kernel.
  */
  lockdep_off();
   
  pr_err("==================================================================\n");
  /* Print report header. */
  switch (type) {
  case KFENCE_ERROR_OOB: { // 訪問了protect的內存頁導致的OOB
   
  // 如果觸發異常的地址小於meta對應的object地址,意味着訪問了與object所在的內存頁緊鄰的左邊的protect內存頁
  // 否則,意味着訪問的是與object所在的內存頁緊鄰的右邊的protect內存頁
  const bool left_of_object = address < meta->addr;
   
  pr_err("BUG: KFENCE: out-of-bounds %s in %pS\n\n", get_access_type(is_write),
  (void *)stack_entries[skipnr]);
   
  // 輸出訪問類型,缺頁地址,缺頁地址跟object之間的字節偏移,缺頁地址在object的左邊內存頁還是右邊內存頁,以及object的索引
  pr_err("Out-of-bounds %s at 0x%p (%luB %s of kfence-#%td):\n",
  get_access_type(is_write), (void *)address,
  left_of_object ? meta->addr - address : address - meta->addr,
  left_of_object ? "left" : "right", object_index);
  break;
  }
  case KFENCE_ERROR_UAF: // object被釋放了,沒有申請,又訪問了
  pr_err("BUG: KFENCE: use-after-free %s in %pS\n\n", get_access_type(is_write),
  (void *)stack_entries[skipnr]);
  pr_err("Use-after-free %s at 0x%p (in kfence-#%td):\n",
  get_access_type(is_write), (void *)address, object_index);
  break;
  case KFENCE_ERROR_CORRUPTION: // object所在的內存頁的空閒區域的pattern被破壞,也屬於OOB
  pr_err("BUG: KFENCE: memory corruption in %pS\n\n", (void *)stack_entries[skipnr]);
  pr_err("Corrupted memory at 0x%p ", (void *)address); // 發生pattern不一致的地址
  print_diff_canary(address, 16, meta); // 顯示pattern不一致的地址右側16字節地址範圍內的數據的匹配信息
  pr_cont(" (in kfence-#%td):\n", object_index); // object的索引
  break;
  case KFENCE_ERROR_INVALID: // 缺頁異常裏檢測到的無效的錯誤
  pr_err("BUG: KFENCE: invalid %s in %pS\n\n", get_access_type(is_write),
  (void *)stack_entries[skipnr]);
  pr_err("Invalid %s at 0x%p:\n", get_access_type(is_write),
  (void *)address);
  break;
  case KFENCE_ERROR_INVALID_FREE: // kfence_free檢測到的重複釋放以及申請和釋放的地址不一致的錯誤
  pr_err("BUG: KFENCE: invalid free in %pS\n\n", (void *)stack_entries[skipnr]);
  pr_err("Invalid free of 0x%p (in kfence-#%td):\n", (void *)address,
  object_index);
  break;
  }
   
  /* 輸出內存錯誤發生的調用棧,其中skipnr用於幫助跳過一些對分析問題沒有幫助的mm內部函數 */
  stack_trace_print(stack_entries + skipnr, num_stack_entries - skipnr, 0);
   
  if (meta) {
  pr_err("\n");
  /*
  1. 輸出meta的狀態信息,object的地址範圍,kmem_cache以及進程pid
  2. 輸出object被分配出去時的調用棧
  3. 如果meta是free狀態,那麼還會輸出內存釋放時的調用棧,以及調用者的pid
  */
  kfence_print_object(NULL, meta);
  }
   
  /* Print report footer. */
  pr_err("\n");
  if (no_hash_pointers && regs) // 可以通過啓動參數no_hash_pointers來設置爲1
  show_regs(regs); // 輸出缺頁異常發生時的CPU寄存器內容以及調用棧
  else
  dump_stack_print_info(KERN_ERR); // 簡略的debug信息
  trace_error_report_end(ERROR_DETECTOR_KFENCE, address);
  pr_err("==================================================================\n");
   
  lockdep_on();
   
  if (panic_on_warn) // 可以通過將/proc/sys/kernel/panic_on_warn設置爲1讓系統宕機
  panic("panic_on_warn set ...\n");
   
  /* We encountered a memory safety error, taint the kernel!
  可以通過給啓動參數設置'panic_on_taint=0x20',這樣當添加TAINT_BAD_PAGE類型的taint時,會發生宕機
  */
  add_taint(TAINT_BAD_PAGE, LOCKDEP_STILL_OK);
  }
摺疊
  • get_stack_skipnr [kfence_report_error -> get_stack_skipnr ]

從調用棧裏將mm的內部函數跳過。

  /*
  * Get the number of stack entries to skip to get out of MM internals. @type is
  * optional, and if set to NULL, assumes an allocation or free stack.
  */
  static int get_stack_skipnr(const unsigned long stack_entries[], int num_entries,
  const enum kfence_error_type *type)
  {
  char buf[64];
  int skipnr, fallback = 0;
   
  if (type) {
  /* Depending on error type, find different stack entries. */
  switch (*type) {
  case KFENCE_ERROR_UAF:
  case KFENCE_ERROR_OOB:
  case KFENCE_ERROR_INVALID:
  /*
  * kfence_handle_page_fault() may be called with pt_regs
  * set to NULL; in that case we'll simply show the full
  * stack trace.
  */
  return 0;
  case KFENCE_ERROR_CORRUPTION:
  case KFENCE_ERROR_INVALID_FREE:
  break;
  }
  }
   
  for (skipnr = 0; skipnr < num_entries; skipnr++) {
  int len = scnprintf(buf, sizeof(buf), "%ps", (void *)stack_entries[skipnr]);
   
  if (str_has_prefix(buf, ARCH_FUNC_PREFIX "kfence_") ||
  str_has_prefix(buf, ARCH_FUNC_PREFIX "__kfence_") ||
  !strncmp(buf, ARCH_FUNC_PREFIX "__slab_free", len)) {
  /*
  * In case of tail calls from any of the below
  * to any of the above.
  */
  fallback = skipnr + 1;
  }
   
  /* Also the *_bulk() variants by only checking prefixes. */
  if (str_has_prefix(buf, ARCH_FUNC_PREFIX "kfree") ||
  str_has_prefix(buf, ARCH_FUNC_PREFIX "kmem_cache_free") ||
  str_has_prefix(buf, ARCH_FUNC_PREFIX "__kmalloc") ||
  str_has_prefix(buf, ARCH_FUNC_PREFIX "kmem_cache_alloc"))
  goto found;
  }
  if (fallback < num_entries)
  return fallback;
  found:
  skipnr++;
  return skipnr < num_entries ? skipnr : 0;
  }
摺疊
  • print_diff_canary [kfence_report_error -> print_diff_canary]
  /*
  * Show bytes at @addr that are different from the expected canary values, up to
  * @max_bytes.
   
  address: pattern不一致的地址,這個地址可能是左側pattern區域或者右側pattern區域的,通過跟meta->addr比較就可以知道,參考下圖
  bytes_to_show: 最長輸出多少個地址的的匹配信息
  meta:pattern區所在的內存頁對應的meta信息
  */
  static void print_diff_canary(unsigned long address, size_t bytes_to_show,
  const struct kfence_metadata *meta)
  {
  const unsigned long show_until_addr = address + bytes_to_show; //
  const u8 *cur, *end;
   
  /* 計算結束地址,不能越出pattern區的範圍。比如左側的pattern區,最長輸出到meta->addr-1。
  對於右側的pattern區,最長到右邊保護區起始地址-1 */
  end = (const u8 *)(address < meta->addr ? min(show_until_addr, meta->addr)
  : min(show_until_addr, PAGE_ALIGN(address)));
   
  pr_cont("[");
  for (cur = (const u8 *)address; cur < end; cur++) {
  if (*cur == KFENCE_CANARY_PATTERN(cur))
  pr_cont(" ."); // 對於pattern一致的地址,輸出 '.'
  else if (no_hash_pointers) // 可以通過啓動參數no_hash_pointers來設置爲1
  pr_cont(" 0x%02x", *cur);
  else /* Do not leak kernel memory in non-debug builds. */
  pr_cont(" !"); // 對於pattern不一致的地址,輸出 '!'
  }
  pr_cont(" ]");
  }

image

內存異常log分析

OOB錯誤

  • 讀左側保護區導致的OOB: KFENCE_ERROR_OOB

示例:

  size = kmalloc_cache_alignment(size);
  buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_LEFT);
  expect.addr = buf - 1;
  READ_ONCE(*expect.addr);
  KUNIT_EXPECT_TRUE(test, report_matches(&expect));
  test_free(buf);

log:

  ==================================================================
  BUG: KFENCE: out-of-bounds read in test_out_of_bounds_read+0xad/0x1f2 [kfence_test]
   
  # 觸發異常時的內核棧
  Out-of-bounds read at 0x000000008e1b5d12 (1B left of kfence-#109):
  test_out_of_bounds_read+0xad/0x1f2 [kfence_test]
  kunit_try_run_case+0x51/0x80
  kunit_generic_run_threadfn_adapter+0x16/0x30
  kthread+0x11a/0x140
  ret_from_fork+0x22/0x30
   
  # 分配object的調用棧
  kfence-#109 [0x00000000753194ac-0x000000000d237ced, size=32, cache=kmalloc-32] allocated by task 35779:
  test_alloc+0xe9/0x36f [kfence_test]
  test_out_of_bounds_read+0x86/0x1f2 [kfence_test]
  kunit_try_run_case+0x51/0x80
  kunit_generic_run_threadfn_adapter+0x16/0x30
  kthread+0x11a/0x140
  ret_from_fork+0x22/0x30
   
  CPU: 5 PID: 35779 Comm: kunit_try_catch Kdump: loaded Not tainted 5.14.0+ #4
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  ==================================================================
  • 讀右側保護區導致的OOB: KFENCE_ERROR_OOB

示例:

  size = kmalloc_cache_alignment(size);
  buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_RIGHT);
  expect.addr = buf + size;
  READ_ONCE(*expect.addr);
  KUNIT_EXPECT_TRUE(test, report_matches(&expect));
  test_free(buf);

log:

  ==================================================================
  BUG: KFENCE: out-of-bounds read in test_out_of_bounds_read+0x14a/0x1f2 [kfence_test]
   
  # 觸發異常的調用棧
  Out-of-bounds read at 0x0000000002d76451 (32B right of kfence-#111):
  test_out_of_bounds_read+0x14a/0x1f2 [kfence_test]
  kunit_try_run_case+0x51/0x80
  kunit_generic_run_threadfn_adapter+0x16/0x30
  kthread+0x11a/0x140
  ret_from_fork+0x22/0x30
   
  # 分配object的調用棧
  kfence-#111 [0x00000000432dce97-0x000000008d6138c3, size=32, cache=kmalloc-32] allocated by task 35779:
  test_alloc+0xe9/0x36f [kfence_test]
  test_out_of_bounds_read+0x140/0x1f2 [kfence_test]
  kunit_try_run_case+0x51/0x80
  kunit_generic_run_threadfn_adapter+0x16/0x30
  kthread+0x11a/0x140
  ret_from_fork+0x22/0x30
   
  CPU: 5 PID: 35779 Comm: kunit_try_catch Kdump: loaded Tainted: G B 5.14.0+ #4
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  ==================================================================
  • 寫左側保護區導致的OOB: KFENCE_ERROR_OOB

示例:

  buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_LEFT);
  expect.addr = buf - 1;
  WRITE_ONCE(*expect.addr, 42);

log:

  ==================================================================
  BUG: KFENCE: out-of-bounds write in test_out_of_bounds_write+0x7a/0x116 [kfence_test]
   
  # 觸發異常的調用棧
  Out-of-bounds write at 0x000000003f50719f (1B left of kfence-#134):
  test_out_of_bounds_write+0x7a/0x116 [kfence_test]
  kunit_try_run_case+0x51/0x80
  kunit_generic_run_threadfn_adapter+0x16/0x30
  kthread+0x11a/0x140
  ret_from_fork+0x22/0x30
   
  # 分配object的調用棧
  kfence-#134 [0x0000000080436418-0x0000000052b079df, size=32, cache=kmalloc-32] allocated by task 35781:
  test_alloc+0xe9/0x36f [kfence_test]
  test_out_of_bounds_write+0x65/0x116 [kfence_test]
  kunit_try_run_case+0x51/0x80
  kunit_generic_run_threadfn_adapter+0x16/0x30
  kthread+0x11a/0x140
  ret_from_fork+0x22/0x30
   
  CPU: 5 PID: 35781 Comm: kunit_try_catch Kdump: loaded Tainted: G B 5.14.0+ #4
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  ==================================================================

UAF

KFENCE_ERROR_UAF

示例:

  expect.addr = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY);
  test_free(expect.addr);
  READ_ONCE(*expect.addr);

log:

  ==================================================================
  BUG: KFENCE: use-after-free read in test_use_after_free_read+0x89/0x10b [kfence_test]
   
  # 觸發UAF時的調用棧
  Use-after-free read at 0x0000000067fb284c (in kfence-#152):
  test_use_after_free_read+0x89/0x10b [kfence_test]
  kunit_try_run_case+0x51/0x80
  kunit_generic_run_threadfn_adapter+0x16/0x30
  kthread+0x11a/0x140
  ret_from_fork+0x22/0x30
   
  # 分配object的調用棧
  kfence-#152 [0x0000000067fb284c-0x00000000cd45daeb, size=32, cache=kmalloc-32] allocated by task 35783:
  test_alloc+0xe9/0x36f [kfence_test]
  test_use_after_free_read+0x63/0x10b [kfence_test]
  kunit_try_run_case+0x51/0x80
  kunit_generic_run_threadfn_adapter+0x16/0x30
  kthread+0x11a/0x140
  ret_from_fork+0x22/0x30
   
  # 釋放object的調用棧
  freed by task 35783:
  test_use_after_free_read+0x85/0x10b [kfence_test]
  kunit_try_run_case+0x51/0x80
  kunit_generic_run_threadfn_adapter+0x16/0x30
  kthread+0x11a/0x140
  ret_from_fork+0x22/0x30
   
  CPU: 7 PID: 35783 Comm: kunit_try_catch Kdump: loaded Tainted: G B 5.14.0+ #4
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  ==================================================================

pattern區不一致

  • 右側pattern區不一致:KFENCE_ERROR_CORRUPTION

示例:

  buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_LEFT);
  expect.addr = buf + size;
  WRITE_ONCE(*expect.addr, 42);
  test_free(buf);

log:

  ==================================================================
  BUG: KFENCE: memory corruption in test_corruption+0x9c/0x1cb [kfence_test]
   
  # 輸出pattern不一致的地址及其右側一共16個地址(不超出右側pattern區)的匹配結果,'!'表示不一致,'.'表示一致。
  Corrupted memory at 0x000000003b880c36 [ ! . . . . . . . . . . . . . . . ] (in kfence-#139):
  test_corruption+0x9c/0x1cb [kfence_test]
  kunit_try_run_case+0x51/0x80
  kunit_generic_run_threadfn_adapter+0x16/0x30
  kthread+0x11a/0x140
  ret_from_fork+0x22/0x30
   
  # 分配object的調用棧
  kfence-#139 [0x0000000084320c94-0x00000000ebf5c6c5, size=32, cache=kmalloc-32] allocated by task 35789:
  test_alloc+0xe9/0x36f [kfence_test]
  test_corruption+0x72/0x1cb [kfence_test]
  kunit_try_run_case+0x51/0x80
  kunit_generic_run_threadfn_adapter+0x16/0x30
  kthread+0x11a/0x140
  ret_from_fork+0x22/0x30
   
  CPU: 5 PID: 35789 Comm: kunit_try_catch Kdump: loaded Tainted: G B 5.14.0+ #4
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  ==================================================================
  • 左側pattern區不一致:KFENCE_ERROR_CORRUPTION

示例:

  buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_RIGHT);
  expect.addr = buf - 1;
  WRITE_ONCE(*expect.addr, 42);
  test_free(buf);

log:

  ==================================================================
  BUG: KFENCE: memory corruption in test_corruption+0x14e/0x1cb [kfence_test]
   
  # 輸出pattern不一致的地址及其右側一共16個地址(不超出左側pattern區)的匹配結果,'!'表示不一致,'.'表示一致。
  Corrupted memory at 0x00000000d7861e9d [ ! ] (in kfence-#155):
  test_corruption+0x14e/0x1cb [kfence_test]
  kunit_try_run_case+0x51/0x80
  kunit_generic_run_threadfn_adapter+0x16/0x30
  kthread+0x11a/0x140
  ret_from_fork+0x22/0x30
   
  kfence-#155 [0x000000009acdf655-0x00000000008cbfb7, size=32, cache=kmalloc-32] allocated by task 35789:
  test_alloc+0xe9/0x36f [kfence_test]
  test_corruption+0x124/0x1cb [kfence_test]
  kunit_try_run_case+0x51/0x80
  kunit_generic_run_threadfn_adapter+0x16/0x30
  kthread+0x11a/0x140
  ret_from_fork+0x22/0x30
   
  CPU: 5 PID: 35789 Comm: kunit_try_catch Kdump: loaded Tainted: G B 5.14.0+ #4
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  ==================================================================

無效的釋放

  • 重複釋放:KFENCE_ERROR_INVALID_FREE

示例:

  expect.addr = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY);
  test_free(expect.addr);
  test_free(expect.addr); /* Double-free. */

log:

  ==================================================================
  BUG: KFENCE: invalid free in test_double_free+0x9a/0x124 [kfence_test]
   
  # 觸發重複釋放的調用棧
  Invalid free of 0x000000007fb6a8f8 (in kfence-#136):
  test_double_free+0x9a/0x124 [kfence_test]
  kunit_try_run_case+0x51/0x80
  kunit_generic_run_threadfn_adapter+0x16/0x30
  kthread+0x11a/0x140
  ret_from_fork+0x22/0x30
   
  # 分配objcet的調用棧
  kfence-#136 [0x000000007fb6a8f8-0x00000000d967e9cd, size=32, cache=test] allocated by task 35786:
  test_alloc+0xdf/0x36f [kfence_test]
  test_double_free+0x63/0x124 [kfence_test]
  kunit_try_run_case+0x51/0x80
  kunit_generic_run_threadfn_adapter+0x16/0x30
  kthread+0x11a/0x140
  ret_from_fork+0x22/0x30
   
  # 釋放object的調用棧
  freed by task 35786:
  test_double_free+0x7b/0x124 [kfence_test]
  kunit_try_run_case+0x51/0x80
  kunit_generic_run_threadfn_adapter+0x16/0x30
  kthread+0x11a/0x140
  ret_from_fork+0x22/0x30
   
  CPU: 5 PID: 35786 Comm: kunit_try_catch Kdump: loaded Tainted: G B 5.14.0+ #4
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  ==================================================================
  • 申請和釋放的地址不一致:KFENCE_ERROR_INVALID_FREE

示例:

  buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY);
  expect.addr = buf + 1; /* Free on invalid address. */
  test_free(expect.addr); /* Invalid address free. */
  test_free(buf); /* No error. */

log:

  ==================================================================
  BUG: KFENCE: invalid free in test_invalid_addr_free+0x8b/0x12b [kfence_test]
   
  Invalid free of 0x0000000000b3e82d (in kfence-#124):
  test_invalid_addr_free+0x8b/0x12b [kfence_test]
  kunit_try_run_case+0x51/0x80
  kunit_generic_run_threadfn_adapter+0x16/0x30
  kthread+0x11a/0x140
  ret_from_fork+0x22/0x30
   
  kfence-#124 [0x000000002aecf77f-0x0000000046ff045a, size=32, cache=kmalloc-32] allocated by task 35787:
  test_alloc+0xe9/0x36f [kfence_test]
  test_invalid_addr_free+0x65/0x12b [kfence_test]
  kunit_try_run_case+0x51/0x80
  kunit_generic_run_threadfn_adapter+0x16/0x30
  kthread+0x11a/0x140
  ret_from_fork+0x22/0x30
   
  CPU: 5 PID: 35787 Comm: kunit_try_catch Kdump: loaded Tainted: G B 5.14.0+ #4
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  ==================================================================

其他無法識別的內存錯誤

如觸發缺頁的OOB區域左側和右側的內存頁都沒有分配出去:KFENCE_ERROR_INVALID

示例:

  READ_ONCE(__kfence_pool[10]);

log:

  ==================================================================
  BUG: KFENCE: invalid read in test_invalid_access+0x48/0xd0 [kfence_test]
   
  Invalid read at 0x0000000023713263:
  test_invalid_access+0x48/0xd0 [kfence_test]
  kunit_try_run_case+0x51/0x80
  kunit_generic_run_threadfn_adapter+0x16/0x30
  kthread+0x11a/0x140
  ret_from_fork+0x22/0x30
   
  CPU: 5 PID: 35936 Comm: kunit_try_catch Kdump: loaded Tainted: G B 5.14.0+ #4
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  ==================================================================

debugfs調試節點

/sys/kernel/debug/kfence下面有兩個用於查看kfence狀態的節點:objects和stats

stats節點

  # cat stats
  enabled: 1
  currently allocated: 47
  total allocations: 2416
  total frees: 2369
  zombie allocations: 0
  total bugs: 21

含義

名字含義
enabled kfence功能是否處於開啓狀態。可以通過內核啓動參數開啓,啓動後可以通過模塊參數關閉
currently allocated kfence內存池中有多少個object被分配出去了
total allocations 在kfence內存池中發生過object分配的總次數,當掉遞增
total frees 在kfence內存池中發生過object釋放的總次數,當掉遞增
zombie allocations 當某個kmem_cache被銷燬時,在kfence中與之對應的尚未釋放的object個數
total bugs kfence檢測到的內存錯誤的次數

實現

  static int stats_show(struct seq_file *seq, void *v)
  {
  int i;
   
  seq_printf(seq, "enabled: %i\n", READ_ONCE(kfence_enabled));
  for (i = 0; i < KFENCE_COUNTER_COUNT; i++)
  seq_printf(seq, "%s: %ld\n", counter_names[i], atomic_long_read(&counters[i]));
   
  return 0;
  }
  DEFINE_SHOW_ATTRIBUTE(stats);

其中用到的統計數據定義如下:

  /* Statistics counters for debugfs. */
  enum kfence_counter_id {
  KFENCE_COUNTER_ALLOCATED,
  KFENCE_COUNTER_ALLOCS,
  KFENCE_COUNTER_FREES,
  KFENCE_COUNTER_ZOMBIES,
  KFENCE_COUNTER_BUGS,
  KFENCE_COUNTER_COUNT,
  };
  static atomic_long_t counters[KFENCE_COUNTER_COUNT];
  static const char *const counter_names[] = {
  [KFENCE_COUNTER_ALLOCATED] = "currently allocated",
  [KFENCE_COUNTER_ALLOCS] = "total allocations",
  [KFENCE_COUNTER_FREES] = "total frees",
  [KFENCE_COUNTER_ZOMBIES] = "zombie allocations",
  [KFENCE_COUNTER_BUGS] = "total bugs",
  };

objects節點

輸出kfence中每個meta的信息,當前狀態以及調用棧。

  # cat objects
  kfence-#0 [0xffff89c43b202000-0xffff89c43b202067, size=104, cache=kmalloc-128] allocated by task 8:
  set_kthread_struct+0x30/0x40
  kthread+0x2e/0x140
  ret_from_fork+0x22/0x30
  ---------------------------------
  kfence-#1 [0xffff89c43b204000-0xffff89c43b20400f, size=16, cache=kmalloc-16] allocated by task 1:
  __smpboot_create_thread.part.9+0x3c/0x120
  smpboot_create_threads+0x67/0x90
  cpuhp_invoke_callback+0x105/0x400
  cpuhp_invoke_callback_range+0x40/0x80
  _cpu_up+0xd8/0x1e0
  cpu_up+0x85/0x90
  bringup_nonboot_cpus+0x4f/0x60
  smp_init+0x26/0x74
  kernel_init_freeable+0x10e/0x246
  kernel_init+0x16/0x120
  ret_from_fork+0x22/0x30
  ---------------------------------
  ...
  kfence-#40 [0xffff89c43b252dc0-0xffff89c43b252fff, size=576, cache=inode_cache] allocated by task 531:
  alloc_inode+0x87/0xa0
  new_inode_pseudo+0xb/0x50
  create_pipe_files+0x32/0x200
  __do_pipe_flags+0x2c/0xd0
  do_pipe2+0x2d/0xb0
  __x64_sys_pipe+0x10/0x20
  do_syscall_64+0x3a/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xae
   
  freed by task 531:
  destroy_inode+0x3b/0x70
  __dentry_kill+0xc5/0x150
  __fput+0xd9/0x230
  task_work_run+0x74/0xb0
  exit_to_user_mode_prepare+0x191/0x1a0
  syscall_exit_to_user_mode+0x19/0x30
  do_syscall_64+0x46/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xae
  ...
  ---------------------------------
  kfence-#254 unused
  ---------------------------------

含義

  • 對於被分配出去且尚未釋放的object,只顯示分配棧。
  • 對於當前處於free狀態的object,既顯示分配棧,也顯示釋放棧。處於zombie的object也屬於free。
  • 對於從來沒有被分配出去過的object,顯示unused
  • 對於zombie的object,雖然是free的,但是已經不能被分配了,對應的kmem_cache被銷燬的了,所以cache會顯示爲<destroyed>

實現

  static int show_object(struct seq_file *seq, void *v)
  {
  struct kfence_metadata *meta = &kfence_metadata[(long)v - 1];
  unsigned long flags;
   
  raw_spin_lock_irqsave(&meta->lock, flags);
  kfence_print_object(seq, meta);
  raw_spin_unlock_irqrestore(&meta->lock, flags);
  seq_puts(seq, "---------------------------------\n");
   
  return 0;
  }
  • kfence_print_object
  void kfence_print_object(struct seq_file *seq, const struct kfence_metadata *meta)
  {
  const int size = abs(meta->size);
  const unsigned long start = meta->addr;
  const struct kmem_cache *const cache = meta->cache;
   
  lockdep_assert_held(&meta->lock);
   
  if (meta->state == KFENCE_OBJECT_UNUSED) { // 尚未使用的meta
  seq_con_printf(seq, "kfence-#%td unused\n", meta - kfence_metadata);
  return;
  }
   
  seq_con_printf(seq,
  "kfence-#%td [0x%p-0x%p"
  ", size=%d, cache=%s] allocated by task %d:\n",
  meta - kfence_metadata, (void *)start, (void *)(start + size - 1), size,
  (cache && cache->name) ? cache->name : "<destroyed>", meta->alloc_track.pid);
  kfence_print_stack(seq, meta, true); // 輸出meta對應的object被分配出去時的調用棧
   
  if (meta->state == KFENCE_OBJECT_FREED) { // 如果meta對應的object被釋放了
  seq_con_printf(seq, "\nfreed by task %d:\n", meta->free_track.pid);
  kfence_print_stack(seq, meta, false); // 輸出meta對應的object被釋放時的調用棧
  }
  }

測試框架

kfence提供了測試用例,在mm\kfence\kfence_test.c中。

  static int __init kfence_test_init(void)
  {
  /* 遍歷內核中的tracepoint,在名爲"console"的tracepoint上掛載一個hook函數 */
  for_each_kernel_tracepoint(register_tracepoints, NULL);
   
  /* 執行測試用例 */
  return __kunit_test_suites_init(kfence_test_suites);
  }
  • register_tracepoints
  static void register_tracepoints(struct tracepoint *tp, void *ignore)
  {
  check_trace_callback_type_console(probe_console);
  if (!strcmp(tp->name, "console"))
  WARN_ON(tracepoint_probe_register(tp, probe_console, NULL));
  }

當kfence_report_error輸出錯誤log時,"console"這個tracepoint會觸發,然後會回調到probe_console,在probe_console中會過濾kfence_report_error中輸出的錯誤log,並記錄到observed,用於跟期望的錯誤類型比對,比對通過表示測試成功。

  • probe_console

過濾kfence_report_error中輸出的錯誤log,並記錄到observed,用於跟期望的錯誤類型比對,比對通過表示測試成功。

  /* Probe for console output: obtains observed lines of interest. */
  static void probe_console(void *ignore, const char *buf, size_t len)
  {
  unsigned long flags;
  int nlines;
   
  spin_lock_irqsave(&observed.lock, flags);
  nlines = observed.nlines;
   
  if (strnstr(buf, "BUG: KFENCE: ", len) && strnstr(buf, "test_", len)) {
  /*
  * KFENCE report and related to the test.
  *
  * The provided @buf is not NUL-terminated; copy no more than
  * @len bytes and let strscpy() add the missing NUL-terminator.
  */
  strscpy(observed.lines[0], buf, min(len + 1, sizeof(observed.lines[0])));
  nlines = 1;
  } else if (nlines == 1 && (strnstr(buf, "at 0x", len) || strnstr(buf, "of 0x", len))) {
  strscpy(observed.lines[nlines++], buf, min(len + 1, sizeof(observed.lines[0])));
  }
   
  WRITE_ONCE(observed.nlines, nlines); /* Publish new nlines. */
  spin_unlock_irqrestore(&observed.lock, flags);
  }
  • kfence_test_suites

記錄了測試case的具體內容:

  #define KFENCE_KUNIT_CASE(test_name) \
  { .run_case = test_name, .name = #test_name }, \
  { .run_case = test_name, .name = #test_name "-memcache" }
   
  static struct kunit_case kfence_test_cases[] = {
  KFENCE_KUNIT_CASE(test_out_of_bounds_read),
  KFENCE_KUNIT_CASE(test_out_of_bounds_write),
  KFENCE_KUNIT_CASE(test_use_after_free_read),
  KFENCE_KUNIT_CASE(test_double_free),
  KFENCE_KUNIT_CASE(test_invalid_addr_free),
  KFENCE_KUNIT_CASE(test_corruption),
  KFENCE_KUNIT_CASE(test_free_bulk),
  KFENCE_KUNIT_CASE(test_init_on_free),
  KUNIT_CASE(test_kmalloc_aligned_oob_read),
  KUNIT_CASE(test_kmalloc_aligned_oob_write),
  KUNIT_CASE(test_shrink_memcache),
  KUNIT_CASE(test_memcache_ctor),
  KUNIT_CASE(test_invalid_access),
  KUNIT_CASE(test_gfpzero),
  KUNIT_CASE(test_memcache_typesafe_by_rcu),
  KUNIT_CASE(test_krealloc),
  KUNIT_CASE(test_memcache_alloc_bulk),
  {},
  };
   
  static struct kunit_suite kfence_test_suite = {
  .name = "kfence",
  .test_cases = kfence_test_cases,
  .init = test_init,
  .exit = test_exit,
  };
  static struct kunit_suite *kfence_test_suites[] = { &kfence_test_suite, NULL };

以test_out_of_bounds_read爲例:

  static void test_out_of_bounds_read(struct kunit *test)
  {
  size_t size = 32;
  struct expect_report expect = { // 期望發生的結果
  .type = KFENCE_ERROR_OOB, // 期望發生的錯誤類型
  .fn = test_out_of_bounds_read, // 期望導致錯誤發生的函數
  .is_write = false, // 期望的讀寫方向,這裏是讀
  };
  char *buf;
   
  setup_test_cache(test, size, 0, NULL);
   
  /*
  * If we don't have our own cache, adjust based on alignment, so that we
  * actually access guard pages on either side.
  */
  if (!test_cache)
  size = kmalloc_cache_alignment(size);
   
  /* Test both sides. */
   
  // 從kfence中分配內存,構造訪問左邊保護頁的OOB,返回的是object所在頁的首地址
  buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_LEFT);
  expect.addr = buf - 1; // 期望在哪個地址上發生OOB,地址減1就是左邊保護頁的結尾地址
  READ_ONCE(*expect.addr); // 觸發OOB異常
  KUNIT_EXPECT_TRUE(test, report_matches(&expect)); // 調用report_matche比對實際發生的錯誤跟期望發生的錯誤是否一致
  test_free(buf);
   
  // 從kfence中分配內存,構造訪問右邊保護頁的OOB,返回的是object所在頁的首地址
  buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_RIGHT);
  expect.addr = buf + size; // 期望發生缺頁的地址,地址加上size就是右邊保護頁的首地址
  READ_ONCE(*expect.addr); // 觸發OOB異常
  KUNIT_EXPECT_TRUE(test, report_matches(&expect)); // 覈對結果
  test_free(buf);
  }
  • report_matches
  static bool report_matches(const struct expect_report *r)
  {
  bool ret = false;
  unsigned long flags;
  typeof(observed.lines) expect;
  const char *end;
  char *cur;
   
  /* Doubled-checked locking. */
  if (!report_available())
  return false;
   
  /* Generate expected report contents. */
   
  /* Title */
  cur = expect[0];
  end = &expect[0][sizeof(expect[0]) - 1];
  switch (r->type) {
  case KFENCE_ERROR_OOB:
  cur += scnprintf(cur, end - cur, "BUG: KFENCE: out-of-bounds %s",
  get_access_type(r));
  break;
  case KFENCE_ERROR_UAF:
  cur += scnprintf(cur, end - cur, "BUG: KFENCE: use-after-free %s",
  get_access_type(r));
  break;
  case KFENCE_ERROR_CORRUPTION:
  cur += scnprintf(cur, end - cur, "BUG: KFENCE: memory corruption");
  break;
  case KFENCE_ERROR_INVALID:
  cur += scnprintf(cur, end - cur, "BUG: KFENCE: invalid %s",
  get_access_type(r));
  break;
  case KFENCE_ERROR_INVALID_FREE:
  cur += scnprintf(cur, end - cur, "BUG: KFENCE: invalid free");
  break;
  }
   
  scnprintf(cur, end - cur, " in %pS", r->fn);
  /* The exact offset won't match, remove it; also strip module name. */
  cur = strchr(expect[0], '+');
  if (cur)
  *cur = '\0';
   
  /* Access information */
  cur = expect[1];
  end = &expect[1][sizeof(expect[1]) - 1];
   
  switch (r->type) {
  case KFENCE_ERROR_OOB:
  cur += scnprintf(cur, end - cur, "Out-of-bounds %s at", get_access_type(r));
  break;
  case KFENCE_ERROR_UAF:
  cur += scnprintf(cur, end - cur, "Use-after-free %s at", get_access_type(r));
  break;
  case KFENCE_ERROR_CORRUPTION:
  cur += scnprintf(cur, end - cur, "Corrupted memory at");
  break;
  case KFENCE_ERROR_INVALID:
  cur += scnprintf(cur, end - cur, "Invalid %s at", get_access_type(r));
  break;
  case KFENCE_ERROR_INVALID_FREE:
  cur += scnprintf(cur, end - cur, "Invalid free of");
  break;
  }
   
  cur += scnprintf(cur, end - cur, " 0x%p", (void *)r->addr);
   
  spin_lock_irqsave(&observed.lock, flags);
  if (!report_available())
  goto out; /* A new report is being captured. */
   
  /* Finally match expected output to what we actually observed. */
  ret = strstr(observed.lines[0], expect[0]) && strstr(observed.lines[1], expect[1]);
  out:
  spin_unlock_irqrestore(&observed.lock, flags);
  return ret;
  }
摺疊

完。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章