多文档RAG实战:支持多文件检索与跨文档问答的系统设计
读取多文档看我们的生活中吧平时我们在做研究的时候我们如果有某个Idea通常都会把相关文献都放在一个文件夹下这样当我们在写论文的时候遇到某个点给别人解释不通想要去借鉴那篇相关文献中原作者的表述去给别人讲某个原理的时候就可以去翻相关的参考文献文件夹这种文件夹通常是这样的 论文A.pdf 论文B.pdf README.md Idea笔记.txt今天咱们先不做阅读pdf的程序先将程序实现可以阅读几个txt文件让程序能回答paper1和paper2有什么区别这样的简单问题。首先需要先做一个数据集合像这样在data目录下放几个txt文件里面可以写一些论文的摘要data/ paper1.txt paper2.txt paper3.txt随后我们就可以定义读取这些”论文“的函数了importosdefload_documents(folder_path):documents[]forfilenameinos.listdir(folder_path):iffilename.endswith(.txt):withopen(os.path.join(folder_path,filename),r,encodingutf-8)asf:textf.read()documents.append({text:text,source:filename})returndocuments这样我们就会得到带有论文内容和来源的文件字典。随后我们需要将字典中的论文内容切成chunks以便后面作为知识存储起来。defprocess_documents(documents):all_chunks[]fordocindocuments:chunkssplit_text(doc[text],chunk_size200,overlap50)forcinchunks:all_chunks.append({text:c,source:doc[source]})returnall_chunks通过这样的操作我们后面得到的chunk将会像这个样子{text:...,source:paper1.txt}改进RAGSystem类通过前面的操作我们的的数据类型将会不同于最早的写法咱们最早的只是是这样的contextVoice over IP (VoIP) steganography based on low-bit-rate speech codecs has attracted increasing attention due to its high imperceptibility and large embedding capacity, particularly in the fixed codebook (FCB) parameter domain. However, effective steganalysis remains challenging under extreme embedding conditions. At low embedding rates, steganographic artifacts are weak and sparsely distributed, making them difficult to distinguish from natural speech variations. In contrast, at high embedding rates, the recompression-based calibration process may introduce structural distortions that interfere with reliable feature extraction. To address these challenges, this paper proposes a calibration-aware cross-view steganalysis netword for VoIP steganalysis (CACVAN). An embedding-rate-aware data augmentation (ERADA) strategy is first introduced to construct cross-intensity training samples, which improves the robustness of the model under embedding-rate mismatch scenarios. Furthermore, a cross-view interaction backbone (CVIB) is designed to jointly analyze the original speech stream and its recompressed counterpart, enabling the network to capture subtle inconsistencies introduced by steganographic embedding while suppressing content-related variations. A hybrid attention refinement neck (HARN) is then employed to enhance discriminative feature responses and stabilize the modeling of sparse steganographic artifacts. Extensive experiments on public VoIP steganalysis datasets demonstrate that the proposed method consistently outperforms existing state-of-the-art approaches under various embedding rates and speech durations, especially in challenging scenarios involving low embedding rates and short speech segments. Moreover, the proposed framework achieves high computational efficiency and satisfies the real-time requirements of streaming VoIP steganalysis, indicating its practical applicability.现在咱们的知识已经是多文档的知识了数据的组织型态发生了变化以前context经过split_text(context, chunk_size200)得到的是以一段段字符串构成的数组但是现在的数据经过切片后的组织形式是这样的[{text:...,source:paper1.txt},{text:...,source:paper2.txt}]所以以前的类方法需要进行一些适当的调整对build_index和ask进行修改defbuild_index(self):texts[c[text]forcinself.chunks]ifself.embeddingsisNone:#加缓存避免重复计算消耗APIembeddings[get_embedding(t)fortintexts]self.embeddingsnp.vstack(embeddings)dimself.embeddings.shape[1]self.indexfaiss.IndexFlatL2(dim)self.index.add(self.embeddings)defask(self,question):retrievedself.retrieve(question,kself.top_k)# rerank用texttexts[c[text]forcinretrieved]sorted_indicesself.rerank(question,texts)best_chunks[retrieved[i]foriinsorted_indices[:self.rerank_k]]# 拼context加来源contextforcinbest_chunks:contextf[Source:{c[source]}]\n{c[text]}\n\nresponseclient.chat.completions.create(modeldeepseek-chat,messages[{role:system,content:Answer based only on context. Cite the source file when possible.},{role:user,content:f{context}\n\nQuestion:{question}}])returnresponse.choices[0].message.content至此咱们的修改的主要部分完成了最后对主函数进行一定的修改那么就可以实现调用了if__name____main__:docsload_documents(data)# 你的文件夹chunksprocess_documents(docs)ragRAGSystem(chunks)rag.build_index()answerrag.ask(What is the difference between paper1 and paper2?)print(answer)项目完整代码importosimportfaissimportnumpyasnpfromopenaiimportOpenAI clientOpenAI(api_key,base_urlhttps://api.deepseek.com)client2OpenAI(api_key,base_urlhttps://api.shubiaobiao.com/v1)defget_embedding(text):responseclient2.embeddings.create(modeltext-embedding-3-small,# 改这里inputtext)returnnp.array(response.data[0].embedding,dtypefloat32)defsplit_text(text,chunk_size200,overlap50):chunks[]foriinrange(0,len(text),chunk_size-overlap):chunks.append(text[i:ichunk_size])returnchunksdefload_documents(folder_path):documents[]forfilenameinos.listdir(folder_path):iffilename.endswith(.txt):withopen(os.path.join(folder_path,filename),r,encodingutf-8)asf:textf.read()documents.append({text:text,source:filename})returndocumentsdefprocess_documents(documents):all_chunks[]fordocindocuments:chunkssplit_text(doc[text],chunk_size200,overlap50)forcinchunks:all_chunks.append({text:c,source:doc[source]})returnall_chunksclassRAGSystem:def__init__(self,chunks,top_k5,rerank_k3):self.chunkschunks self.top_ktop_k self.rerank_krerank_k self.indexNoneself.embeddingsNonedefbuild_index(self):texts[c[text]forcinself.chunks]ifself.embeddingsisNone:#加缓存避免重复计算消耗APIembeddings[get_embedding(t)fortintexts]self.embeddingsnp.vstack(embeddings)dimself.embeddings.shape[1]self.indexfaiss.IndexFlatL2(dim)self.index.add(self.embeddings)defretrieve(self,query,k5):query_vecget_embedding(query).reshape(1,-1)distances,indicesself.index.search(query_vec,k)return[self.chunks[i]foriinindices[0]]defrerank(self,query,chunks):promptf You are a ranking assistant. Query:{query}Rank the following passages from most relevant to least relevant. Passages: fori,cinenumerate(chunks):promptf\n[{i}]{c}\nprompt\nReturn ONLY the indices in sorted order, like [2,0,1].responseclient.chat.completions.create(modeldeepseek-chat,messages[{role:user,content:prompt}])importasttry:returnast.literal_eval(response.choices[0].message.content)except:returnlist(range(len(chunks)))defask(self,question):retrievedself.retrieve(question,kself.top_k)# 可以加 rerank你已经有了#context \n.join(retrieved)# rerank用texttexts[c[text]forcinretrieved]sorted_indicesself.rerank(question,texts)best_chunks[retrieved[i]foriinsorted_indices[:self.rerank_k]]# 拼context加来源contextforcinbest_chunks:contextf[Source:{c[source]}]\n{c[text]}\n\nresponseclient.chat.completions.create(modeldeepseek-chat,messages[{role:system,content:Answer based only on context. Cite the source file when possible.},{role:user,content:f{context}\n\nQuestion:{question}}])returnresponse.choices[0].message.contentif__name____main__:docsload_documents(dataa)# 你的文件夹chunksprocess_documents(docs)ragRAGSystem(chunks)rag.build_index()answerrag.ask(What is the core contribution of paper1?)print(answer)如果这篇文章对你有帮助可以点个赞完整代码地址https://github.com/1186141415/A-Paper-Rag-Agent