Extracting, transforming and selecting features

原創

2021-02-17 21:31

This section covers algorithms for working with features, roughly divided into these groups

本節介紹使用功能的算法，大致分爲以下幾組：

提取: 從數據中抽取特徵。
轉變: Scaling, converting, or modifying features
選擇: 在多個特徵中挑選比較重要的特徵。
局部敏感哈希(LSH): 這類算法將特徵變換的各個方面與其他算法結合起來。

Table of Contents

Feature Extractors 特徵提取
- TF-IDF
- Word2Vec
- CountVectorizer
Feature Transformers 特徵變換
- Tokenizer 分詞器
- StopWordsRemover 停用字清除
- nn-gram
- Binarizer 二元化方法
- PCA 主成成分分析
- PolynomialExpansion 多項式擴展
- Discrete Cosine Transform (DCT-離散餘弦變換)
- StringIndexer 字符串-索引變換
- IndexToString 索引-字符串變換
- OneHotEncoder 獨熱編碼
- VectorIndexer 向量類型索引化
- Interaction
- Normalizer 範數p-norm規範化
- StandardScaler 標準化基於特徵矩陣的列，將屬性值轉換至服從正態分佈
- MinMaxScaler 最大-最小歸一化[0,1]
- MaxAbsScaler 絕對值歸一化[-1,1]
- Bucketizer 分箱器
- ElementwiseProduct Hadamard乘積
- SQLTransformer SQL變換
- VectorAssembler 特徵向量合併
- QuantileDiscretizer 分位數離散化
- Imputer
Feature Selectors 特徵選擇
- VectorSlicer 向量選擇
- RFormula R模型公式
- ChiSqSelector 卡方特徵選擇
Locality Sensitive Hashing 局部哈希敏感
- LSH Operations
  - Feature Transformation 特徵轉換
  - Approximate Similarity Join 近似相似聯接
  - Approximate Nearest Neighbor Search 近似最近鄰搜索
- LSH Algorithms
  - Bucketed Random Projection for Euclidean Distance 歐式距離分桶隨機投影
  - MinHash for Jaccard Distance 傑卡德距離

Feature Extractors

TF-IDF

頻率逆文檔頻率（TF-IDF）是一種特徵向量化方法，廣泛用於文本挖掘中，以反映詞語對語料庫中文檔的重要性。用tt表示詞語，用dd表示文檔，用DD表示語料庫。詞語頻率TF(t,d)TF(t,d)是詞語tt在文檔dd中出現的次數，而文檔頻率DF(t,D)DF(t,D)是包含詞語的文檔數量 tt。如果我們僅使用詞語頻率來衡量重要性，則很容易過分強調那些經常出現但幾乎沒有有關文檔信息的詞語，e.g. “a”, “the”, and “of”。如果一個詞語在整個語料庫中經常出現，則表示該詞語不包含有關特定文檔的重要信息。反向文檔頻率是一個詞語提供多少信息的數字度量：

IDF(t,D)=log|D|+1DF(t,D)+1,

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Extracting, transforming and selecting features

Feature Extractors

TF-IDF

今天，昨天，近七天，近30天，近90天，js封裝

validate 驗證

Python爬蟲技術與數據可視化：Numpy、pandas、Matplotlib的黃金組合

ArkTS開發原生鴻蒙HarmonyOS短視頻應用

安全策略增量加速之對象

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結