Langchain 介紹與入門

官方介紹

LangChain 是一個利用LLM開發應用程序的框架。它讓應用程序具備：

上下文感知能力：將LLM連接到上下文源（提示說明、少量示例、用以形成其響應的內容等）
推理：依靠LLM進行推理（例如根據提供的上下文確定如何回答、採取什麼措施等）

LangChain 框架包含以下幾部分：

LangChain 庫：Python 和 JavaScript 庫。包含用於大量的接口、組件，可以將這些組件組合到鏈和agents運行。
LangChain 模板：針對常見不同任務的案例架構模版。
LangServe：部署 LangChain 鏈的庫，對外提供rest服務
LangSmith：一個開發人員平臺，可用於調試、測試、評估和監視以任何 LLM 框架內置的鏈，並與 LangChain 無縫集成。

安裝

最省事的做法是，直接pip安裝：

pip install langchain

安裝 LangChain CLI 和 LangServe，安裝langchain-cli會自動安裝LangServe

pip install langchain-cli

LLM調用

基本調用

手上暫時沒有ChatGPT的apikey，所以用之前獲取的google gemini llm。

需要先安裝：

pip install --upgrade  langchain-google-genai

開始第一個demo，api_key 需要先去google申請。

from langchain_google_genai import GoogleGenerativeAI

api_key = ""

llm = GoogleGenerativeAI(model="models/text-bison-001", google_api_key=api_key)
print(
    llm.invoke(
        "What are some of the pros and cons of Python as a programming language?"
    )
)

運行腳本，就能獲得LLM的響應結果：

[root@dev T2Ranking]#python lang_chain_demo.py 
**Pros of Python:**

* **Simplicity:** Python is a relatively easy-to-learn language, with a simple syntax that is easy to read and write. This makes it a good choice for beginners and experienced programmers alike.
* **Versatility:** Python can be used for a wide variety of applications, including web development, data science, machine learning, and artificial intelligence. This makes it a good choice for developers who want to work on a variety of projects.
* **Libraries:** Python has a large and active community of developers who have created a wide variety of libraries and frameworks that can be used to extend the functionality of the language. This makes it easy to add new features and functionality to Python applications.
* **Cross-platform:** Python is cross-platform, which means that it can be run on a variety of operating systems, including Windows, macOS, and Linux. This makes it a good choice for developers who want to develop applications that can be used on multiple platforms.
* **Open source:** Python is an open-source language, which means that it is free to use and modify. This makes it a good choice for developers who want to create custom applications or who want to contribute to the development of the language itself.

**Cons of Python:**

* **Speed:** Python is not as fast as some other programming languages, such as C or C++. This can be a disadvantage for applications that require high performance.
* **Memory usage:** Python can also be more memory-intensive than other programming languages. This can be a disadvantage for applications that need to run on devices with limited memory.
* **Lack of static typing:** Python is a dynamically typed language, which means that the type of a variable is not known until runtime. This can make it difficult to catch errors early on in the development process.
* **Lack of support for multithreading:** Python does not have built-in support for multithreading. This can make it difficult to develop applications that can take advantage of multiple processors.
* **Security:** Python is not as secure as some other programming languages. This can be a disadvantage for applications that need to handle sensitive data.

Streaming calls 流式調用LLM

通過stream接口，可以讓LLM流式返回結果，類似yield。

import sys  
from langchain_google_genai import GoogleGenerativeAI

api_key = ""
llm = GoogleGenerativeAI(model="gemini-pro", google_api_key=api_key)

  
for chunk in llm.stream("Tell me a short poem about snow"):  
	sys.stdout.write(chunk)  
	sys.stdout.flush()

Chains

Chain是LangChain的核心概念，先就1個簡單的chain來做基本的理解。

第一個chain

調整上面的demo代碼：

from langchain_google_genai import GoogleGenerativeAI
from langchain.prompts import PromptTemplate

api_key = ""

llm = GoogleGenerativeAI(model="gemini-pro", google_api_key=api_key)
# print(
#     llm.invoke(
#         "What are some of the pros and cons of Python as a programming language?"
#     )
# )

template = """Question: {question}  
  
Answer: Let's think step by step."""  
prompt = PromptTemplate.from_template(template)  
  
chain = prompt | llm  
  
question = "How much is 2+2?"  
print(chain.invoke({"question": question}))

template 是prompt模版，可以通過{變量}語法定義變量，調用時可以通過dict傳入數據。
chain的定義，先prompt，接一個 |，特殊而容易理解的語法

執行：

[root@dev T2Ranking]#python lang_chain_demo.py 
2+2 is a basic arithmetic problem. The answer is 4.

Retrieval

爲了更好的回答一些問題，我們需要向LLM提供更多的上下文信息，讓其參考以便更好的回答問題，langchain對這塊做了較好的封裝，這塊也是langchain的精華部分，我們來細看是如何設計的。

![[Pasted image 20240227101802.png]]

文檔加載器

可預知，企業內有各種各樣的文檔，所以這裏抽象一個Document loaders 文檔加載器，或者文檔解析器。LangChain 提供了 100 多種Document loaders ，另外與該領域的一些商用服務做了集成，例如 AirByte 和 Unstructured。LangChain 也支持了從各種位置（私有 S3 存儲桶、網站）加載各種類型的文檔（HTML、PDF、代碼）。

Text Splitting 文本拆分

源於用戶的問題，大部分只需要文檔的一小部分就能回答，另外現在embedding模型支持的長度普遍也不長，所以在RAG系統裏，通常需要對長文檔進行拆分，拆成一個一個的chunk。
LangChain 提供了幾種轉換算法來執行此操作，以及針對特定文檔類型（代碼、markdown 等）優化的邏輯。

Name	Splits On	Adds Metadata	Description
Recursive	A list of user defined characters		Recursively splits text. Splitting text recursively serves the purpose of trying to keep related pieces of text next to each other. This is the recommended way to start splitting text.
HTML	HTML specific characters	✅	Splits text based on HTML-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the HTML)
Markdown	Markdown specific characters	✅	Splits text based on Markdown-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the Markdown)
Code	Code (Python, JS) specific characters		Splits text based on characters specific to coding languages. 15 different languages are available to choose from.
Token	Tokens		Splits text on tokens. There exist a few different ways to measure tokens.
Character	A user defined character		Splits text based on a user defined character. One of the simpler methods.
[Experimental] Semantic Chunker	Sentences		First splits on sentences. Then combines ones next to each other if they are semantically similar enough. Taken from Greg Kamradt

文本embedding模型

檢索的另一個關鍵部分是爲文檔創建embedding。embedding可以捕獲文本的語義含義，通過ANN查詢快速找到相似的其他文本片段。LangChain 提供了與 25 種不同的embedding提供商和方法的集成，從開源到專有 API都有覆蓋。LangChain提供標準的統一接口，可以根據實際需要切換不同的model。

向量存儲 Vector stores

embedding是RAG的標配，因此用於向量存儲和ANN檢索的向量數據庫如雨後春筍不停湧現，LangChain 提供了與 50 多種不同的向量數據庫的集成，從開源的本地存儲到雲託管的專有存儲，用戶可以根據實際情況選擇最適合的向量數據庫。LangChain提供標準的統一接口，可以方便在不同stores之間切換。

檢索器 Retrievers

embedding存入數據庫後，需要通過檢索才能發揮最大作用。LangChain 支持多種不同的檢索算法，其中包括：

Parent Document Retriever父文檔檢索器：允許爲每個父文檔創建多個embedding，查詢時查找較小的塊，但會返回較大的上下文
Self Query Retriever：允許你從query中解析出語義部分和其他元數據，來對數據進行過濾，下面的圖可以很好的示意。
![[Pasted image 20240227124617.png]]
Ensemble Retriever集成檢索器：如果需要從多個不同的源或使用多個不同的算法來檢索文檔，可以使用集成檢索器
...

Agents 智能體

agent智能體的核心思想是使用LLM決策一系列的action並執行。在鏈中，執行的action是硬編碼的，而在agents智能體中，語言模型自行推理決策採用哪些action，以及action的執行順序。智能體最早是autogpt開始興起的，紅極一時。

在LangChain中，Agent可以根據用戶的輸入動態地調用chains，將問題拆分爲幾個步驟，每個步驟都可以根據提供的Agent來執行相關的操作。此外，LangChain提供了多種類型的代理（Agents）和工具（Tools），以支持不同的應用場景和需求。

具體到Agent的工作原理，它首先接收來自用戶的輸入，然後根據輸入的內容決定調用哪些工具（Tools）來完成任務。這些工具可以是內置的，也可以是自定義的，關鍵在於如何以對Agent有利的方式描述這些工具。例如，如果用戶詢問“本週的天氣”，Agent可能會調用一個天氣查詢工具來獲取答案，或者調用一個計算器來計算年齡等。

LangChain Agent的設計還考慮了泛化能力和Prompt控制，利用大型LLMs的強大few-shot和zero-shot泛化能力，以及Prompt控制的核心基礎。這種設計使得LangChain Agent能夠在沒有大量訓練數據的情況下，通過少量的提示就能生成有意義的回答，從而提高了其實用性和效率。

Langchain 介紹與入門

官方介紹

安裝

LLM調用

基本調用

Streaming calls 流式調用LLM

Chains

第一個chain

Retrieval

文檔加載器

Text Splitting 文本拆分

文本embedding模型

向量存儲 Vector stores

檢索器 Retrievers

Agents 智能體

AI模型 Llama 3體驗筆記

【面試準備】又一次失敗的面試經歷，題目離譜～資深軟件測試工程師

dotnet 8 版本與銀河麒麟V10和UOS系統的 glibc 兼容性

淺談sparse vec檢索工程化實現

BGE M3-Embedding 模型介紹

Sparse稀疏檢索介紹與實踐

知識圖譜增強的KG-RAG框架

知識圖譜在RAG中的應用探討

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結