多文档RAG实战：支持多文件检索与跨文档问答的系统设计

张

张建站

2026/4/30 6:17:41

10分钟阅读

读取多文档看我们的生活中吧平时我们在做研究的时候我们如果有某个Idea通常都会把相关文献都放在一个文件夹下这样当我们在写论文的时候遇到某个点给别人解释不通想要去借鉴那篇相关文献中原作者的表述去给别人讲某个原理的时候就可以去翻相关的参考文献文件夹这种文件夹通常是这样的论文A.pdf 论文B.pdf README.md Idea笔记.txt今天咱们先不做阅读pdf的程序先将程序实现可以阅读几个txt文件让程序能回答paper1和paper2有什么区别这样的简单问题。首先需要先做一个数据集合像这样在data目录下放几个txt文件里面可以写一些论文的摘要data/ paper1.txt paper2.txt paper3.txt随后我们就可以定义读取这些”论文“的函数了importosdefload_documents(folder_path):documents[]forfilenameinos.listdir(folder_path):iffilename.endswith(.txt):withopen(os.path.join(folder_path,filename),r,encodingutf-8)asf:textf.read()documents.append({text:text,source:filename})returndocuments这样我们就会得到带有论文内容和来源的文件字典。随后我们需要将字典中的论文内容切成chunks以便后面作为知识存储起来。defprocess_documents(documents):all_chunks[]fordocindocuments:chunkssplit_text(doc[text],chunk_size200,overlap50)forcinchunks:all_chunks.append({text:c,source:doc[source]})returnall_chunks通过这样的操作我们后面得到的chunk将会像这个样子{text:...,source:paper1.txt}改进RAGSystem类通过前面的操作我们的的数据类型将会不同于最早的写法咱们最早的只是是这样的contextVoice over IP (VoIP) steganography based on low-bit-rate speech codecs has attracted increasing attention due to its high imperceptibility and large embedding capacity, particularly in the fixed codebook (FCB) parameter domain. However, effective steganalysis remains challenging under extreme embedding conditions. At low embedding rates, steganographic artifacts are weak and sparsely distributed, making them difficult to distinguish from natural speech variations. In contrast, at high embedding rates, the recompression-based calibration process may introduce structural distortions that interfere with reliable feature extraction. To address these challenges, this paper proposes a calibration-aware cross-view steganalysis netword for VoIP steganalysis (CACVAN). An embedding-rate-aware data augmentation (ERADA) strategy is first introduced to construct cross-intensity training samples, which improves the robustness of the model under embedding-rate mismatch scenarios. Furthermore, a cross-view interaction backbone (CVIB) is designed to jointly analyze the original speech stream and its recompressed counterpart, enabling the network to capture subtle inconsistencies introduced by steganographic embedding while suppressing content-related variations. A hybrid attention refinement neck (HARN) is then employed to enhance discriminative feature responses and stabilize the modeling of sparse steganographic artifacts. Extensive experiments on public VoIP steganalysis datasets demonstrate that the proposed method consistently outperforms existing state-of-the-art approaches under various embedding rates and speech durations, especially in challenging scenarios involving low embedding rates and short speech segments. Moreover, the proposed framework achieves high computational efficiency and satisfies the real-time requirements of streaming VoIP steganalysis, indicating its practical applicability.现在咱们的知识已经是多文档的知识了数据的组织型态发生了变化以前context经过split_text(context, chunk_size200)得到的是以一段段字符串构成的数组但是现在的数据经过切片后的组织形式是这样的[{text:...,source:paper1.txt},{text:...,source:paper2.txt}]所以以前的类方法需要进行一些适当的调整对build_index和ask进行修改defbuild_index(self):texts[c[text]forcinself.chunks]ifself.embeddingsisNone:#加缓存避免重复计算消耗APIembeddings[get_embedding(t)fortintexts]self.embeddingsnp.vstack(embeddings)dimself.embeddings.shape[1]self.indexfaiss.IndexFlatL2(dim)self.index.add(self.embeddings)defask(self,question):retrievedself.retrieve(question,kself.top_k)# rerank用texttexts[c[text]forcinretrieved]sorted_indicesself.rerank(question,texts)best_chunks[retrieved[i]foriinsorted_indices[:self.rerank_k]]# 拼context加来源contextforcinbest_chunks:contextf[Source:{c[source]}]\n{c[text]}\n\nresponseclient.chat.completions.create(modeldeepseek-chat,messages[{role:system,content:Answer based only on context. Cite the source file when possible.},{role:user,content:f{context}\n\nQuestion:{question}}])returnresponse.choices[0].message.content至此咱们的修改的主要部分完成了最后对主函数进行一定的修改那么就可以实现调用了if__name____main__:docsload_documents(data)# 你的文件夹chunksprocess_documents(docs)ragRAGSystem(chunks)rag.build_index()answerrag.ask(What is the difference between paper1 and paper2?)print(answer)项目完整代码importosimportfaissimportnumpyasnpfromopenaiimportOpenAI clientOpenAI(api_key,base_urlhttps://api.deepseek.com)client2OpenAI(api_key,base_urlhttps://api.shubiaobiao.com/v1)defget_embedding(text):responseclient2.embeddings.create(modeltext-embedding-3-small,# 改这里inputtext)returnnp.array(response.data[0].embedding,dtypefloat32)defsplit_text(text,chunk_size200,overlap50):chunks[]foriinrange(0,len(text),chunk_size-overlap):chunks.append(text[i:ichunk_size])returnchunksdefload_documents(folder_path):documents[]forfilenameinos.listdir(folder_path):iffilename.endswith(.txt):withopen(os.path.join(folder_path,filename),r,encodingutf-8)asf:textf.read()documents.append({text:text,source:filename})returndocumentsdefprocess_documents(documents):all_chunks[]fordocindocuments:chunkssplit_text(doc[text],chunk_size200,overlap50)forcinchunks:all_chunks.append({text:c,source:doc[source]})returnall_chunksclassRAGSystem:def__init__(self,chunks,top_k5,rerank_k3):self.chunkschunks self.top_ktop_k self.rerank_krerank_k self.indexNoneself.embeddingsNonedefbuild_index(self):texts[c[text]forcinself.chunks]ifself.embeddingsisNone:#加缓存避免重复计算消耗APIembeddings[get_embedding(t)fortintexts]self.embeddingsnp.vstack(embeddings)dimself.embeddings.shape[1]self.indexfaiss.IndexFlatL2(dim)self.index.add(self.embeddings)defretrieve(self,query,k5):query_vecget_embedding(query).reshape(1,-1)distances,indicesself.index.search(query_vec,k)return[self.chunks[i]foriinindices[0]]defrerank(self,query,chunks):promptf You are a ranking assistant. Query:{query}Rank the following passages from most relevant to least relevant. Passages: fori,cinenumerate(chunks):promptf\n[{i}]{c}\nprompt\nReturn ONLY the indices in sorted order, like [2,0,1].responseclient.chat.completions.create(modeldeepseek-chat,messages[{role:user,content:prompt}])importasttry:returnast.literal_eval(response.choices[0].message.content)except:returnlist(range(len(chunks)))defask(self,question):retrievedself.retrieve(question,kself.top_k)# 可以加 rerank你已经有了#context \n.join(retrieved)# rerank用texttexts[c[text]forcinretrieved]sorted_indicesself.rerank(question,texts)best_chunks[retrieved[i]foriinsorted_indices[:self.rerank_k]]# 拼context加来源contextforcinbest_chunks:contextf[Source:{c[source]}]\n{c[text]}\n\nresponseclient.chat.completions.create(modeldeepseek-chat,messages[{role:system,content:Answer based only on context. Cite the source file when possible.},{role:user,content:f{context}\n\nQuestion:{question}}])returnresponse.choices[0].message.contentif__name____main__:docsload_documents(dataa)# 你的文件夹chunksprocess_documents(docs)ragRAGSystem(chunks)rag.build_index()answerrag.ask(What is the core contribution of paper1?)print(answer)如果这篇文章对你有帮助可以点个赞完整代码地址https://github.com/1186141415/A-Paper-Rag-Agent

【2026 CVPR】Asking like Socrates: Socrates helps VLMs understand remote sensing images

RS-EoT (Remote Sensing Evidence-of-Thought) 研究旨在解决视觉语言模型（VLM）在处理遥感图像时的“虚假推理”问题。文章目录核心问题核心思想核心方法 A. 数据合成：SocraticAgent Data Statistics B. 训练策略：两阶段渐进式强化学习 (RL) C. 训练策略实验验证主要…...

2026/4/6 1:21:39 阅读更多 →

高呼电车渗透率创新高？忘记2025年渗透率近六成的历史纪录了！

3月份国内市场的电车销量环比猛涨66%，渗透率达到53.9%，都在高呼电车渗透率达到历史新高纪录，然而似乎业界都忘记了2025年的月度渗透率曾接近六成的历史最高纪录，不知是不是故意忽视这一点！2025年电车渗透率屡创新高是多…...

2026/4/6 1:19:56 阅读更多 →

AI 编程上下文管理新范式（非常详细），Spec 机制从入门到精通，收藏这一篇就够了！

最近围绕 Spec 的讨论明显变多。比较有代表性的声音大致有两类：一类更关注 Spec 和代码之间的边界，另一类更关注 Spec 在真实项目协作中的工程价值。这两类观察并不冲突，放在一起看，刚好能把问题看得更完整。本质上都在回答同一…...

2026/4/6 1:16:42 阅读更多 →

如何在7分钟内搭建专业级仓库管理系统：从零到生产就绪的完整指南

如何在7分钟内搭建专业级仓库管理系统：从零到生产就绪的完整指南【免费下载链接】GreaterWMS This Inventory management system is the currently Ford Asia Pacific after-sales logistics warehousing supply chain process . After I leave Ford , I start thi…...

2026/4/28 6:15:44 阅读更多 →

星露谷物语模组加载器SMAPI：轻松打造个性化农场体验的终极指南

星露谷物语模组加载器SMAPI：轻松打造个性化农场体验的终极指南【免费下载链接】SMAPI The modding API for Stardew Valley. 项目地址: https://gitcode.com/gh_mirrors/smap/SMAPI 想要为《星露谷物语》添加无限乐趣，却担心模组安装复杂、游戏崩…...

2026/4/26 0:01:52 阅读更多 →

终极指南：4步构建专业级浏览器资源捕获与管理工作流

终极指南：4步构建专业级浏览器资源捕获与管理工作流【免费下载链接】cat-catch 猫抓浏览器资源嗅探扩展 / cat-catch Browser Resource Sniffing Extension 项目地址: https://gitcode.com/GitHub_Trending/ca/cat-catch 猫抓（cat-catch&#x…...

2026/4/26 0:04:21 阅读更多 →