Send YouTube Transcripts to RAG Pipelines
Get clean YouTube transcripts for RAG indexing in seconds
Or just change youtube.com to 2outube.com in your browser
Swap 'youtube.com' to '2outube.com' in any video URL and get the full transcript instantly. Paste it directly into your RAG pipeline as a document chunk—no API keys, no scraping, no signup required.
The Trick
youtube.com/watch?v=VIDEO_ID
2outube.com/watch?v=VIDEO_ID
Just change 'y' to '2'
Works with any YouTube video that has captions
Using Transcripts with RAG Pipelines
Identify your source video
Find the YouTube video you want to index—tutorials, lectures, podcasts, interviews, or any video with captions. Copy the full URL from your browser address bar.
Extract the transcript via 2outube
Replace 'youtube.com' with '2outube.com' in the URL and open it. The full transcript appears on the page with speaker turns and timestamps preserved. Select all and copy the text.
Chunk and embed for your vector store
Paste the transcript into your RAG ingestion pipeline. Chunk by paragraph or fixed token window (512–1024 tokens works well for video content), then embed using your chosen model—OpenAI, Cohere, or a local model via Ollama.
Query your grounded knowledge base
Store the embeddings in your vector database (Pinecone, Weaviate, Chroma, pgvector, etc.) with metadata including the video URL, title, and timestamp offsets. Your LLM can now retrieve and cite specific moments from the video in responses.
Quick Start
Get the transcript
Find a YouTube video with relevant content for your knowledge base. Any video with auto-generated or manual captions will work.
Change youtube to 2outube
In the video URL, replace 'youtube.com' with '2outube.com'. For example: youtube.com/watch?v=abc123 becomes 2outube.com/watch?v=abc123. Hit enter and the transcript loads immediately.
Paste into your RAG ingestion script
Copy the transcript text and pass it to your document loader (LangChain's TextLoader, LlamaIndex's SimpleDirectoryReader, or a custom splitter). Chunk, embed, and upsert to your vector store as you would any other document.
Ready-Made Template
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.docstore.document import Document
# 1. Paste your 2outube transcript here
transcript_text = """
[Paste full transcript from 2outube.com here]
"""
# 2. Wrap in a Document with metadata
video_url = "https://www.youtube.com/watch?v=VIDEO_ID"
video_title = "Your Video Title"
doc = Document(
page_content=transcript_text,
metadata={"source": video_url, "title": video_title, "type": "youtube_transcript"}
)
# 3. Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_documents([doc])
# 4. Embed and store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")
vectorstore.persist()
print(f"Indexed {len(chunks)} chunks from: {video_title}")
Questions
Does this work with any YouTube video?
Yes, any video with captions—including auto-generated ones. Most YouTube videos have auto-generated captions, so coverage is very broad.
Is it really free?
Completely free. No account, no limits, no API key required. Just swap the URL and copy the transcript.
What format is the transcript in for RAG ingestion?
The transcript is plain text, which is ideal for RAG pipelines. You can copy it directly into any document loader—LangChain, LlamaIndex, Haystack, or a custom script. No parsing or cleaning required.
How should I chunk YouTube transcripts for RAG?
Chunk by semantic paragraph or fixed token windows of 512–1024 tokens with 10–20% overlap. YouTube transcripts lack hard paragraph breaks, so RecursiveCharacterTextSplitter with sentence separators works well. Include the video URL and title in chunk metadata for source citation.
Can I automate transcript extraction for bulk RAG ingestion?
For bulk ingestion, you can script URL substitution: iterate over a list of YouTube URLs, replace 'youtube.com' with '2outube.com', and fetch the page. For large-scale pipelines, the YouTube Data API or yt-dlp with subtitle extraction are also options, but 2outube is the fastest path for individual videos.
Which vector databases work best with YouTube transcript chunks?
Any vector database works—Pinecone, Weaviate, Chroma, Qdrant, pgvector, Milvus, or FAISS for local use. YouTube transcripts are plain text, so there are no format constraints. Store the source URL and video title as metadata for retrieval attribution.
Do timestamps come through in the transcript?
2outube provides the full caption text. For timestamp-level retrieval (useful for long-form lectures or podcasts), you can use the YouTube Transcript API with Python to get per-segment timestamps and store them as chunk metadata.
What embedding models work well for YouTube transcript RAG?
OpenAI text-embedding-3-small and text-embedding-3-large are popular choices. For local or open-source RAG, BAAI/bge-large-en-v1.5 and nomic-embed-text perform well on conversational and instructional content typical of YouTube videos.
Start indexing YouTube transcripts into your RAG pipeline
Free, no signup required
Try It Free