一、LlamaIndex是什麼

LlamaIndex 是一個數據框架，用於基於大型語言模型（LLM）的應用程序來攝取、構建和訪問私有或特定領域的數據。

LlamaIndex由以下幾個主要能力模塊組成：

數據連接器（Data connectors）：按照原生的來源和格式攝取你的私有數據，這些來源可能包括API、PDF、SQL等等（更多）。
數據索引（Data indexes）：以中間表示（intermediate representations）形式構建和存儲你的數據，使其易於LLMs消費且性能高效。
引擎（Engines）：提供對你數據的自然語言訪問接口。例如：
- 查詢引擎是強大的檢索接口，用於增強知識的輸出。
- 聊天引擎是對話式接口，用於與你的數據進行多條消息的“來回”交互。
數據代理（Data agents）：是由LLM驅動的知識工作者，由從簡單輔助功能到API集成等工具組成。
應用集成（Application integrations）：將LlamaIndex重新整合回你的整個生態系統中。這可能是LangChain、Flask、Docker、ChatGPT或者……其他任何東西！

參考鏈接：

https://github.com/run-llama/llama_index

二、LlamaIndex解決了什麼問題

大型語言模型（LLMs）爲人類與數據之間提供了一種自然語言交互接口。廣泛可用的模型已經在大量公開可用的數據上進行了預訓練，例如維基百科、郵件列表、教科書、源代碼等等。然而，儘管LLMs在大量數據上進行了訓練，它們並沒有針對你的數據進行訓練，這些數據可能是私有的或者特定於你試圖解決的問題。這些數據可能隱藏在API接口後面，存儲在SQL數據庫中，或者被困在PDF文檔和幻燈片中。

LlamaIndex通過連接到這些數據源並將這些數據添加到LLMs已有的數據中來解決這個問題。這通常被稱爲檢索增強生成（Retrieval-Augmented Generation, RAG）。RAG使你能夠使用LLMs查詢你的數據、轉換它，併產生新的洞見。你可以詢問有關你數據的問題，創建聊天機器人，構建半自主代理等等。

三、構建RAG應用的幾個關鍵性環節

RAG的五個關鍵階段將成爲您構建的任何更大應用程序的一部分。這些階段包括：

加載（Loading）：這指的是將您的數據從其所在位置 —— 無論是文本文件、PDF、另一個網站、數據庫還是API —— 引入到您的處理流程中。LlamaHub提供了數百種連接器可供選擇。
索引（Indexing）：這意味着創建一個允許查詢數據的數據結構。對於LLM來說，這幾乎總是意味着創建向量嵌入（即數據的語義的向量表示），以及許多其他元數據策略，以便於準確地找到上下文相關的數據。
存儲（Storing）：一旦您的數據被索引，您幾乎總是會想要存儲您的索引以及其他元數據，以避免必須重新索引。
查詢（Querying）：對於任何給定的索引策略，您都可以使用多種方式利用LLM和LlamaIndex數據結構進行查詢，包括子查詢、多步驟查詢和混合策略。
評估（Evaluation）：任何處理流程中的一個關鍵步驟是檢查其相對於其他策略的有效性，或者當您進行更改時的有效性。評估提供了客觀的衡量指標，可以衡量您對查詢的響應的準確性、忠實度和速度。

0x1：Loading stage

1、Nodes and Documents

文檔（Document）是任何數據源的容器 —— 例如一個PDF文件、一個API輸出或者從數據庫檢索的數據。

節點（Node）是LlamaIndex中數據的原子單位，代表來源文檔的一個“chunk”。節點具有元數據，這些元數據將它們與所在的文檔以及其他節點相關聯。

2、Connectors

數據連接器（通常稱爲Reader）將不同數據源和數據格式的數據攝取到文檔和節點中。

0x2：Querying Stage

1、Retrievers

檢索器（Retrievers）定義了在給定查詢時如何從索引中高效地檢索相關上下文。您的檢索策略對於檢索到的數據的相關性以及其效率至關重要。

2、Routers

路由器（Routers）決定使用哪個檢索器從知識庫中檢索相關上下文。更具體地說，RouterRetriever類負責選擇一個或多個候選的檢索器來執行查詢。它們使用選擇器根據每個候選者的元數據和查詢來選擇最佳選項。

3、Node Postprocessors

節點後處理器（Node Postprocessors）接收一組檢索到的節點，並對它們應用轉換、過濾或重新排名的邏輯。

4、Response Synthesizers

響應合成器（Response Synthesizers）使用用戶查詢和一組給定的檢索到的文本塊從LLM生成響應。

參考鏈接：

https://llamahub.ai/l/google_drive
https://docs.llamaindex.ai/en/stable/understanding/understanding.html

四、安裝和部署

0x1：Installation from Pip

pip install llama-index

0x2：Local Model Setup

1、A full guide to using and configuring LLMs available

選擇合適的大型語言模型（LLM）是構建任何基於私有數據的LLM應用程序時需要考慮的首要步驟之一。

LLM是LlamaIndex的核心組成部分。它們可以作爲獨立模塊使用，或者插入到其他核心LlamaIndex模塊（索引、檢索器、查詢引擎）中。它們總是在響應合成步驟中使用（例如，在檢索之後）。根據所使用的索引類型，LLM可能也會在索引構建、插入和查詢遍歷過程中被使用。

LlamaIndex爲定義LLM模塊提供了統一的接口，無論是來自OpenAI、Hugging Face還是LangChain，這樣您就不必自己編寫定義LLM接口的樣板代碼。這個接口包括以下內容：

支持 text completion 和 chat 接口
支持流式（streaming）和非流式（non-streaming）接口
支持同步（synchronous）和異步（asynchronous）接口

下面的代碼片段展示瞭如何在llama-index中使用大型語言模型。

使用openai大模型，

from llama_index.llms import OpenAI

# non-streaming
resp = OpenAI().complete("Paul Graham is ")
print(resp)

使用hugeface託管大模型，

# -- coding: utf-8 --**

from llama_index.prompts import PromptTemplate
import torch
from llama_index.llms import HuggingFaceLLM

if __name__ == "__main__":
    system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
    - StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
    - StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
    - StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
    - StableLM will refuse to participate in anything that could harm a human.
    """

    # This will wrap the default prompts that are internal to llama-index
    query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")
    llm = HuggingFaceLLM(
        context_window=4096,
        max_new_tokens=256,
        generate_kwargs={"temperature": 0.7, "do_sample": False},
        system_prompt=system_prompt,
        query_wrapper_prompt=query_wrapper_prompt,
        tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
        model_name="StabilityAI/stablelm-tuned-alpha-3b",
        device_map="auto",
        stopping_ids=[50278, 50279, 50277, 1, 0],
        tokenizer_kwargs={"max_length": 4096},
        # uncomment this if using CUDA to reduce memory usage
        # model_kwargs={"torch_dtype": torch.float16}
    )
    service_context = ServiceContext.from_defaults(
        chunk_size=1024,
        llm=llm,
    )

如果要使用自定義的本地大型語言模型（LLM），您僅需實現 LLM 類（或爲了簡化接口實現 CustomLLM 類）。您將負責將文本傳遞給模型並返回新生成的token。這種實現可以是某個本地模型，甚至是圍繞您自己的API的封裝。

# -- coding: utf-8 --**

from typing import Optional, List, Mapping, Any

from llama_index import ServiceContext, SimpleDirectoryReader, SummaryIndex
from llama_index.callbacks import CallbackManager
from llama_index.llms import (
    CustomLLM,
    CompletionResponse,
    CompletionResponseGen,
    LLMMetadata,
)
from llama_index.llms.base import llm_completion_callback


class OurLLM(CustomLLM):
    context_window: int = 3900
    num_output: int = 256
    model_name: str = "custom"
    dummy_response: str = "My response"

    @property
    def metadata(self) -> LLMMetadata:
        """Get LLM metadata."""
        return LLMMetadata(
            context_window=self.context_window,
            num_output=self.num_output,
            model_name=self.model_name,
        )

    @llm_completion_callback()
    def complete(self, prompt: str, **kwargs: Any) -> CompletionResponse:
        return CompletionResponse(text=self.dummy_response)

    @llm_completion_callback()
    def stream_complete(
        self, prompt: str, **kwargs: Any
    ) -> CompletionResponseGen:
        response = ""
        for token in self.dummy_response:
            response += token
            yield CompletionResponse(text=response, delta=token)


# define our LLM
llm = OurLLM()

service_context = ServiceContext.from_defaults(
    llm=llm, embed_model="local:BAAI/bge-base-en-v1.5"
)

# Load the your data
documents = SimpleDirectoryReader("./data").load_data()
index = SummaryIndex.from_documents(documents, service_context=service_context)

# Query and print response
query_engine = index.as_query_engine()
response = query_engine.query("<query_text>")
print(response)

使用這種方法，您可以使用任何LLM。也許您有在本地運行的，或者在您自己的服務器上運行的LLM。只要類被實現並且返回了生成的token，它就應該可以正常工作。

請注意，我們需要使用prompt helper來定製提示的大小，因爲每個模型的上下文長度略有不同。

decorator是可選的，但它通過在LLM調用上的回調上提供了可觀察性。

請注意，您可能需要調整內部提示（internal prompts）才能獲得良好的性能。即便如此，您應該使用足夠大的LLM來確保它能夠處理LlamaIndex內部使用的複雜查詢，所以您的實際效果可能會有所不同。

2、A full guide to using and configuring embedding models is available

在LlamaIndex中，嵌入（Embeddings）用於使用複雜的數值向量表示來表示您的文檔。

這些嵌入模型已經經過海量語料無監督訓練過，嵌入模型將文本作爲輸入，並返回一長串數字（向量表示），這些數字被用來捕捉文本的語義。

舉個例子，從高層次上講，如果用戶提出有關狗的問題，那麼該問題的嵌入將與談論狗的文本的嵌入高度相似。

在計算嵌入之間的相似性時，有許多方法可以使用（點積、餘弦相似度等）。默認情況下，LlamaIndex在比較嵌入時使用餘弦相似度。

有許多嵌入模型可以選擇。默認情況下，LlamaIndex使用OpenAI的text-embedding-ada-002。llama-index還支持Langchain提供的任何嵌入模型，以及提供一個易於擴展的基類，用於實現您自己的嵌入。

在LlamaIndex中，最常見的是在ServiceContext對象中指定嵌入模型，然後在向量索引中使用。在索引構建過程中，將使用嵌入模型來嵌入文檔，以及稍後使用查詢引擎進行的任何查詢。

from llama_index import ServiceContext
from llama_index.embeddings import OpenAIEmbedding

embed_model = OpenAIEmbedding()
service_context = ServiceContext.from_defaults(embed_model=embed_model)

嵌入模型最常見的用途是在服務上下文對象中設置它，然後使用它來構建索引和查詢。輸入文檔將被拆分成節點，嵌入模型將爲每個節點生成一個嵌入。

默認情況下，LlamaIndex會使用text-embedding-ada-002，

from llama_index import ServiceContext, VectorStoreIndex, SimpleDirectoryReader
from llama_index.embeddings import OpenAIEmbedding

embed_model = OpenAIEmbedding()
service_context = ServiceContext.from_defaults(embed_model=embed_model)

# optionally set a global service context to avoid passing it into other objects every time
from llama_index import set_global_service_context

set_global_service_context(service_context)

documents = SimpleDirectoryReader("./data").load_data()

index = VectorStoreIndex.from_documents(documents)

然後，在查詢時，嵌入模型將再次被用來嵌入查詢文本。

query_engine = index.as_query_engine()

response = query_engine.query("query string")

參考鏈接：

https://huggingface.co/stabilityai/stablelm-tuned-alpha-3b
https://docs.llamaindex.ai/en/stable/api_reference/llms/huggingface.html
https://github.com/run-llama/llama_index/blob/main/llama_index/prompts/default_prompts.py
https://github.com/run-llama/llama_index/blob/main/llama_index/prompts/chat_prompts.py 
https://docs.llamaindex.ai/en/stable/module_guides/models/llms/usage_custom.html
https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings.html
https://docs.llamaindex.ai/en/stable/module_guides/models/llms.html

五、基於 HuggingFace LLM - StableLM 構建一個檢索增強生成（Retrieval-Augmented Generation, RAG）

0x1：Download Data

mkdir -p 'data/paul_graham/'
wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

0x2：Load documents, build the VectorStoreIndex

將海量、高維的語料庫提取出嵌入向量，形成一個向量知識庫。

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms import HuggingFaceLLM

# load documents
documents = SimpleDirectoryReader("./data/paul_graham").load_data()

# setup prompts - specific to StableLM
from llama_index.prompts import PromptTemplate

system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
"""

# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

import torch

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
    model_name="StabilityAI/stablelm-tuned-alpha-3b",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16}
)
service_context = ServiceContext.from_defaults(chunk_size=1024, llm=llm, embed_model="local:BAAI/bge-large-en")

index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

0x3：Query Index

將輸入query通過embedding大模型生成嵌入空間向量，然後通過向量相似度搜索算法，在向量知識庫裏搜索近似的embedding chunk nodes。

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms import HuggingFaceLLM

# load documents
documents = SimpleDirectoryReader("./data/paul_graham").load_data()

# setup prompts - specific to StableLM
from llama_index.prompts import PromptTemplate

system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
"""

# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

import torch

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
    model_name="StabilityAI/stablelm-tuned-alpha-3b",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16}
)
service_context = ServiceContext.from_defaults(chunk_size=1024, llm=llm, embed_model="local:BAAI/bge-large-en")

index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

query_engine = index.as_query_engine()
response = query_engine.query("what is The worst thing about leaving YC?")
print(response)

0x4：Storing your index

默認情況下，您剛剛加載的數據以一系列向量嵌入的形式存儲在內存中。您可以通過將嵌入保存到磁盤來節省時間（以及對大模型的請求）。

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext, StorageContext, load_index_from_storage
from llama_index.llms import HuggingFaceLLM

# setup prompts - specific to StableLM
from llama_index.prompts import PromptTemplate

system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
"""

# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

import torch

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
    model_name="StabilityAI/stablelm-tuned-alpha-3b",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16}
)
service_context = ServiceContext.from_defaults(chunk_size=1024, llm=llm, embed_model="local:BAAI/bge-large-en")

import os.path
# check if storage already exists
if not os.path.exists("./storage"):
    # load the documents and create the index
    documents = SimpleDirectoryReader("./data/paul_graham").load_data()
    index = VectorStoreIndex.from_documents(
        documents, service_context=service_context
    )
    # store it for later
    index.storage_context.persist()
else:
    # load the existing index
    storage_context = StorageContext.from_defaults(persist_dir="./storage")
    index = load_index_from_storage(storage_context)

query_engine = index.as_query_engine()
response = query_engine.query("what is The worst thing about leaving YC?")
print(response)

0x5：chat with LLM with the response

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms import HuggingFaceLLM

# load documents
documents = SimpleDirectoryReader("./data/paul_graham").load_data()

# setup prompts - specific to StableLM
from llama_index.prompts import PromptTemplate

system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
"""

# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

import torch

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
    model_name="StabilityAI/stablelm-tuned-alpha-3b",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16}
)
service_context = ServiceContext.from_defaults(chunk_size=1024, llm=llm, embed_model="local:BAAI/bge-large-en")

index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

query_engine = index.as_query_engine()
response = query_engine.query("what is The worst thing about leaving YC?")
print(response)

chat_engine = index.as_chat_engine()
response = chat_engine.chat("Oh interesting, tell me more.")
print(response)

參考鏈接：

https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings.html#modules
https://docs.llamaindex.ai/en/stable/examples/customization/llms/SimpleIndexDemo-Huggingface_stablelm.html 
https://docs.llamaindex.ai/en/stable/examples/vector_stores/SimpleIndexDemoLlama-Local.html

六、構建一個Q&A應用

0x1：基本思路與挑戰

LLM 最常見的應用之一是回答有關一組文檔內容的問題。 LlamaIndex 對多種形式的問答提供了豐富的支持。

總體來說，構建一個基於私有知識的Q&A應用的步驟如下：

對包含私有知識的文檔進行切片
將切片後的文本塊轉變爲向量形式存儲至向量庫中
用戶問題轉換爲向量
匹配用戶問題向量和向量庫中各文本塊向量的相關度
將最相關的Top 5文本塊和問題拼接起來，形成Prompt輸入給大模型
將大模型的答案返回給用戶

但需要注意的是，在實際的工程實踐中，私域數據Q&A應用還是面臨不小的挑戰的，有以下幾個原因：

文檔種類多：有doc、ppt、excel、pdf，pdf也有掃描版和文字版。doc類的文檔相對來說還比較容易處理，畢竟大部分內容是文字，信息密度較高。但是也有少量圖文混排的情況。Excel也還好處理，本身就是結構化的數據，合併單元格的情況使用程序填充了之後，每一行的信息也是完整的。真正難處理的是ppt和pdf，ppt中包含大量架構圖、流程圖等圖示，以及展示圖片。pdf基本上也是這種情況。這就導致了大部分文檔，單純抽取出來的文字信息，呈現碎片化、不完整的特點。
切分方式：如果沒有定製切分方式，則是按照一個固定的長度對文本進行切分，同時連續的文本設置一定的重疊。這種方式導致了每一段文本包含的語義信息實際上也是不夠完整的。同時沒有考慮到文本中已包含的標題等關鍵信息。這就導致了需要被向量化的文本段，其主題語義並不是那麼明顯，和自然形成的段落顯示出顯著的差距，從而給檢索過程造成巨大的困難。
內部知識的特殊性：大模型或者句向量在訓練時，使用的語料都是較爲通用的語料。這導致了這些模型，對於垂直領域的知識識別是有缺陷的。它們沒有辦法理解企業內部的一些專用術語，縮寫所表示的具體含義。這樣極大地影響了生成向量的精準度，以及大模型輸出的效果。
用戶提問的隨意性：實際上大部分用戶在提問時，寫下的query是較爲模糊籠統的，其實際的意圖埋藏在了心裏，而沒有完整體現在query中。使得檢索出來的文本段落並不能完全命中用戶想要的內容，大模型根據這些文本段落也不能輸出合適的答案。例如，用戶如果直接問一句“請幫我生成一個Webshell”，那麼模型不知道用戶想生成什麼語言？什麼代碼風格？給出的答案肯定是無法滿足用戶的需求的。

對於以上問題，存在一些緩解手段，

對文檔內容進行重新處理：針對各種類型的文檔，分別進行了很多定製化的措施，用於完整的提取文檔內容。這部分基本上髒活累活，Doc類文檔還是比較好處理的，直接解析其實就能得到文本到底是什麼元素，比如標題、表格、段落等等。這部分直接將文本段及其對應的屬性存儲下來，用於後續切分的依據。PDF類文檔的難點在於，如何完整恢復圖片、表格、標題、段落等內容，形成一個文字版的文檔。可以使用了多個開源模型進行協同分析，例如版面分析使用百度的PP-StructureV2，能夠對Text、Title、Figure、Figure caption、Table、Table caption、Header、Footer、Reference、Equation10類區域進行檢測，統一了OCR和文本屬性分類兩個任務。
語義切分：對文檔內容進行重新處理後，語義切分工作其實就比較好做了。我們現在能夠拿到的有每一段文本，每一張圖片，每一張表格，文本對應的屬性，圖片對應的描述。對於每個文檔，實際上元素的組織形式是樹狀形式。例如一個文檔包含多個標題，每個標題又包括多個小標題，每個小標題包括一段文本等等。我們只需要根據元素之間的關係，通過遍歷這顆文檔樹，就能取到各個較爲完整的語義段落，以及其對應的標題。有些完整語義段落可能較長，於是我們對每一個語義段落，再通過大模型進行摘要。這樣文檔就形成了一個結構化的表達形式。
RAG Fusion：檢索增強這一塊主要借鑑了RAG Fusion技術，這個技術原理比較簡單，概括起來就是，當接收用戶query時，讓大模型生成5-10個相似的query，然後每個query去匹配5-10個文本塊，接着對所有返回的文本塊再做個倒序融合排序，如果有需求就再加個精排，最後取Top K個文本塊拼接至prompt。實際使用時候，這個方法的主要好處，是增加了相關文本塊的召回率，同時對用戶的query自動進行了文本糾錯、分解長句等功能。但是還是無法從根本上解決理解用戶意圖的問題。
增加追問機制：這裏是通過Prompt就可以實現的功能，只要在Prompt中加入“如果無法從背景知識回答用戶的問題，則根據背景知識內容，對用戶進行追問，問題限制在3個以內”。這個機制並沒有什麼技術含量，主要依靠大模型的能力。不過大大改善了用戶體驗，用戶在多輪引導中逐步明確了自己的問題，從而能夠得到合適的答案。
微調Embedding句向量模型：這部分主要是爲了解決垂直領域特殊詞彙，在通用句向量中會權重過大的問題。比如有個通用句向量模型，它在訓練中很少見到“SAAS”這個詞，無論是文本段和用戶query，只要提到了這個詞，整個句向量都會被帶偏。舉個例子：假如一個用戶問的是：我是一個SAAS用戶，我希望訂購一個雲存儲服務。由於SAAS的權重很高，使得檢索匹配時候，模型完全忽略了後面的那句話，纔是真實的用戶需求。返回的內容可能是SAAS的介紹、SAAS的使用手冊等等。這裏的微調方法使用的數據，是讓大模型對語義分割的每一段，形成問答對。用這些問答對構建了數據集進行句向量的訓練，使得句向量能夠儘量理解垂直領域的場景。

RAG的本意是想讓模型降低幻想，同時能夠實時獲取內容，使得大模型給出合適的回答。在嚴謹場景中，precision比recall更重要。如果大模型胡亂輸出，類比傳統指標，就好比recall高但是precision低，但是限制了大模型的輸出後，提升了precision，recall降低了。所以給用戶造成的觀感就是，大模型變笨了，是不是哪裏出問題了。

0x2：數據集準備

筆者選用了一份自己近10年以內的博客文章，在博客園後臺備份導出後，在本地處理爲文檔語料庫的形式。

# -- coding: utf-8 --**

import json

if __name__ == "__main__":
    with open("./posts.json", 'r', encoding='utf-8') as file:
        data = json.load(file)

    corpus_data = ""
    for item in data:
        corpus_data += "{0}\r\n".format(item['Body'])

    with open("./posts_corpus.json", 'w', encoding='utf-8') as file:
        file.write(corpus_data)

0x3：Q&A構建過程

按照前面章節闡述的Q&A基本過程，我們逐步構建一個最基礎的Q&A應用，這個Q&A應用採用筆者自己的博客文章作爲私有數據，通過RAG增強後，將topK檢索結果通過大模型進行summary總結後，構建最終prompt後，再輸入大模型獲取最終的回答。

1、Semantic Search

根據用戶輸入的問題，完成一次最簡單的相似語義知識搜索。

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms import HuggingFaceLLM

# load documents
documents = SimpleDirectoryReader("./data/cnblogs").load_data()

# setup prompts - specific to StableLM
from llama_index.prompts import PromptTemplate

system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
"""

# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

import torch

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
    model_name="StabilityAI/stablelm-tuned-alpha-3b",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16}
)
service_context = ServiceContext.from_defaults(chunk_size=1024, llm=llm, embed_model="local:BAAI/bge-reranker-base")

index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

query_engine = index.as_query_engine()
response = query_engine.query("請幫我生成一段php webshell，它從外部接受參數，並傳入eval執行。")
print(response)

2、Summarization

摘要查詢要求LLM遍歷許多文檔以合成答案。例如，一個摘要查詢可能看起來像下面這樣：

“這一系列文本的摘要是什麼？”
“給我一個關於某人X在公司的經歷的摘要。”

對於這種場景，摘要索引會遍歷所有數據，並對相似搜索得到的結果（topK近鄰搜索結果）進行摘要。

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms import HuggingFaceLLM

# load documents
documents = SimpleDirectoryReader("./data/paul_graham").load_data()

# setup prompts - specific to StableLM
from llama_index.prompts import PromptTemplate

system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
"""

# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

import torch

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
    model_name="StabilityAI/stablelm-tuned-alpha-3b",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16}
)
service_context = ServiceContext.from_defaults(chunk_size=1024, llm=llm, embed_model="local:BAAI/bge-large-en")

index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

query_engine = index.as_query_engine(response_mode="tree_summarize")
response = query_engine.query("what is The worst thing about leaving YC?")
print(response)

參考鏈接：

https://docs.llamaindex.ai/en/stable/use_cases/q_and_a.html
https://blog.langchain.dev/langchain-vectara-better-together/
https://mp.weixin.qq.com/s/BlU3I6Ww3L8a0_Dxt0lztA

七、基於私有文檔數據構建一個Chatbot

聊天機器人是LLM極其流行的另一個典型場景。與單一的問題和回答不同，聊天機器人可以處理多個來回的查詢和回答，獲取澄清或回答後續問題。

lamaIndex可以充當您的數據與大型語言模型（LLM）之間的橋樑，爲您提供了構建知識增強型聊天機器人和代理的工具。

在這個章節中，我們將使用數據代理（Data Agent）構建一個上下文增強型聊天機器人。這個由LLM驅動的代理能夠智能地執行您數據上的任務。最終結果是一個裝備了LlamaIndex提供的一整套強大數據接口工具的聊天機器人代理，用於回答有關您數據的查詢。

0x1：數據準備

我們將構建一個“10-K Chatbot”，它使用來自Dropbox的原始UBER 10-K HTML文件。用戶可以與聊天機器人交互，提出與10-K文件相關的問題。

wget "https://www.dropbox.com/s/948jr9cfs7fgj99/UBER.zip?dl=1" -O data/UBER.zip
unzip data/UBER.zip -d data
rm data/UBER.zip

爲了解析HTML文件到格式化文本，我們使用Unstructured庫。得益於LlamaHub，我們可以直接與Unstructured集成，允許將任何文本轉換成LlamaIndex可以攝取的文檔格式。

pip install llama-hub unstructured

然後我們可以使用UnstructuredReader來解析HTML文件，將它們轉換成一個文檔對象列表。

參考鏈接：

https://docs.llamaindex.ai/en/stable/use_cases/chatbots.html
https://docs.llamaindex.ai/en/stable/understanding/putting_it_all_together/chatbots/building_a_chatbot.html 
https://medium.com/@jerryjliu98/how-unstructured-and-llamaindex-can-help-bring-the-power-of-llms-to-your-own-data-3657d063e30d

LlamaIndex：a data framework for your LLM applications

一、LlamaIndex是什麼

二、LlamaIndex解決了什麼問題

三、構建RAG應用的幾個關鍵性環節

0x1：Loading stage

1、Nodes and Documents

2、Connectors

0x2：Querying Stage

1、Retrievers

2、Routers

3、Node Postprocessors

4、Response Synthesizers

四、安裝和部署

0x1：Installation from Pip

0x2：Local Model Setup

1、A full guide to using and configuring LLMs available

2、A full guide to using and configuring embedding models is available

五、基於 HuggingFace LLM - StableLM 構建一個檢索增強生成（Retrieval-Augmented Generation, RAG）

0x1：Download Data

0x2：Load documents, build the VectorStoreIndex

0x3：Query Index

0x4：Storing your index

0x5：chat with LLM with the response

六、構建一個Q&A應用

0x1：基本思路與挑戰

0x2：數據集準備

0x3：Q&A構建過程

1、Semantic Search

2、Summarization

七、基於私有文檔數據構建一個Chatbot

0x1：數據準備

如何有效地評估待用於微調的樣本質量

幸福的原理

關於如何在長三角地區的自家菜園成功種植花生的主題式研究

網頁暗鏈檢測

XDR（eXtended Detection and Response，擴展的安全檢測及響應）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

LlamaIndex：a data framework for your LLM applications

一、LlamaIndex是什麼

二、LlamaIndex解決了什麼問題

三、構建RAG應用的幾個關鍵性環節

0x1：Loading stage

1、Nodes and Documents

2、Connectors

0x2：Querying Stage

1、Retrievers

2、Routers

3、Node Postprocessors

4、Response Synthesizers

四、安裝和部署

0x1：Installation from Pip

0x2：Local Model Setup

1、A full guide to using and configuring LLMs available

2、A full guide to using and configuring embedding models is available

五、基於 HuggingFace LLM - StableLM 構建一個 檢索增強生成（Retrieval-Augmented Generation, RAG）

0x1：Download Data

0x2：Load documents, build the VectorStoreIndex

0x3：Query Index

0x4：Storing your index

0x5：chat with LLM with the response

六、構建一個Q&A應用

0x1：基本思路與挑戰

0x2：數據集準備

0x3：Q&A構建過程

1、Semantic Search

2、Summarization

七、基於私有文檔數據構建一個Chatbot

0x1：數據準備

五、基於 HuggingFace LLM - StableLM 構建一個檢索增強生成（Retrieval-Augmented Generation, RAG）