My First AI Project: A Journey of Building RAG Knowledge Base from Scratch

Project Background

I’m a beginner in AI application development. In the past, I’ve been focused on traditional frontend, backend, and toolchain development, with very limited knowledge about AI.

Recently, I’ve been working on a toolchain project and writing documentation for it. Suddenly, an idea occurred to me - I could use the MCP protocol to tell AI about the project details and let it help me write code.

Let’s get started! After discussing with GPT, I decided to adopt the following technology stack:

Backend Framework: FastAPI + Python - Chose FastAPI for its async capabilities and automatic API documentation generation
Vector Database: ChromaDB (with memory fallback) - Supports persistent storage while providing memory mode for development and testing
Embedding Model: Sentence Transformers - Lightweight and effective embedding model
Large Language Model: Local Qwen2.5-7B via Ollama - Completely local deployment, privacy protection
Architecture Pattern: RAG (Retrieval-Augmented Generation) - Combines document retrieval with LLM generation

Learning Journey

Document Chunking Strategy

Initially, I wanted to directly vectorize entire documents, but considering model token limitations and document structure, this approach seemed difficult to implement.

Therefore, GPT told me I had to split documents into chunks before vectorization. My documents are written in Markdown format, containing many h2, h3, h4 headers, which created conditions for chunking.

It took about half an hour to implement a chunking strategy based on Markdown headers rather than simple line-by-line splitting.

# Header-based chunking strategy
def create_chunks_by_headers(self, content: str, metadata: Dict) -> List[Chunk]:
    chunks = []
    lines = content.split('\n')
    current_chunk = []
    current_title = metadata.get('title', '')
    
    for line in lines:
        if line.startswith('#'):
            # Save current chunk
            if current_chunk:
                chunks.append(Chunk(
                    content='\n'.join(current_chunk),
                    metadata={**metadata, 'title': current_title}
                ))
            current_chunk = [line]
            current_title = line.lstrip('#').strip()
        else:
            current_chunk.append(line)
    
    # Save the last chunk
    if current_chunk:
        chunks.append(Chunk(
            content='\n'.join(current_chunk),
            metadata={**metadata, 'title': current_title}
        ))
    
    return chunks

The Dilemma of Similarity Search

After document vectorization was complete, I could use input queries to search the vector database and return results based on similarity.

When I eagerly tested similarity search, the results were disappointing. The problem was that many keywords didn’t explicitly appear in the original text, making it impossible to match relevant information.

For example, searching for “function definition” but the document might say “function declaration” or “how to create a function” - there are many cases where semantics are similar but vocabulary differs.

GPT told me I could use multi-round retrieval to solve this problem.

Multi-round Retrieval Improvement

Later, I learned about the concept of multi-round retrieval and decided to try it.

First Round: Low threshold (0.3) broad search to capture more candidates
Second Round: High threshold (0.7) refined search to filter high-quality results
Merge and Deduplicate: Combine results and remove duplicates

def search_with_context(self, query: str, max_results: int = 5) -> Tuple[List[SearchResult], str]:
    # First round: broad search
    broad_results = self.vector_store.search(query, max_results * 2, threshold=0.3)
    
    # Second round: refined search
    refined_results = self.vector_store.search(query, max_results, threshold=0.7)
    
    # Merge results and deduplicate
    all_results = self._merge_and_deduplicate(broad_results, refined_results)
    
    # Build context
    context = self._build_context(all_results[:max_results])
    
    return all_results[:max_results], context

The results did improve, though not perfect, but at least it could match better than before.

Smooth LLM Integration

Compared to the bumpy data preprocessing, LLM integration was quite smooth. I called the local Qwen2.5-7B model through Ollama, and with appropriate prompt templates, the results were acceptable.

def answer_question(self, context: str, query: str) -> str:
    prompt = f"""Based on the following document content, answer the question. If there's no relevant information in the documents, please state that no answer can be found.

Document content:
{context}

Question: {query}

Answer:"""
    
    return self.llm_provider.generate(prompt, context, query)

The advantage of local models is complete privacy protection, and the response speed is also acceptable.

The MCP Nightmare

The most headache-inducing part was the practice of the MCP protocol. GPT generated a lot of “dirty code” for me, including:

Tedious access chains with any types everywhere
Invalid function signatures
Incorrect parameter passing
Confused type definitions

Even worse, my commonly used Cursor IDE has poor MCP integration support. After struggling for half an hour without results, I finally took AI’s advice and used HTTP calls instead of MCP.

Summary

This project gave me a deeper understanding of AI development. The RAG architecture is indeed powerful, but the quality of data preprocessing directly affects the final results. Multi-round retrieval is a good improvement idea, and while the MCP protocol has good concepts, its practical use still needs to mature.

As an AI beginner, this experience made me realize that AI development isn’t just about calling APIs - data quality, retrieval strategies, and prompt engineering are all important. Although I encountered many pitfalls, the gains were substantial.