If you’ve spent any time building RAG applications, you know the drill. Before your app can do anything useful, you need a pipeline: extract text from PDFs, chunk it, run it through an embedding model, store the vectors somewhere. If you’re lucky, you’ve got a Python environment that isn’t broken. If you’re not, you’re debugging dependency conflicts at 11pm wondering why you wanted to do this.
I built CorpusKit Studio because I wanted to build RAG-powered iOS apps and I didn’t want a Python pipeline in the loop.
What It Does
CorpusKit Studio is a native Mac app. You drag in a PDF. The app extracts the text, chunks it, generates embeddings using an on-device Core ML model, and exports a signed .corpus bundle that any CorpusKit iOS app can consume directly.
No Python. No API keys. No data leaving your machine.
That last part matters more than it sounds, especially if you’re working with sensitive documents — legal filings, medical records, proprietary research. The embedding step, which normally means sending your text to an external API, happens entirely on your Mac using Apple’s Neural Engine.
The Technical Decisions Worth Talking About
PDF extraction via PDFKit
The obvious choice for a Mac app, and it mostly just works. The edge case worth knowing: some PDFs are image-based scans with no extractable text layer. CorpusKit Studio checks for this and warns you if the page character count is suspiciously low. OCR is outside scope — I’m not going to bundle a third-party OCR engine when there are good dedicated tools for that job. Extract the text first, then bring it into the app.
Running MiniLM on Core ML
This was the interesting part. MiniLM-L6-v2 is the embedding model that’s become a standard for RAG applications — small, fast, good quality for retrieval tasks. Getting it running natively on macOS via Core ML meant converting the model and implementing the WordPiece tokenizer in Swift, which doesn’t exist anywhere off the shelf.
The result: embedding a typical PDF runs in seconds on Apple Silicon. The Neural Engine handles it efficiently and the model is identical to what the iOS CorpusKit apps use, which matters for consistency between where you build the corpus and where you query it.
NSDocument architecture
I used NSDocument as the foundation rather than building a custom persistence layer from scratch. That decision gave me File > Open Recent, autosave, and document versioning essentially for free. It’s one of those AppKit decisions that feels like overhead until you realize how much work it saves.
Cosine similarity search via Accelerate
No vector database. The corpus sizes I’m targeting — a few hundred to a few thousand chunks from a typical book or document — don’t need one. Accelerate’s vDSP handles cosine similarity over that scale trivially fast. Adding a vector DB dependency would introduce complexity with no real benefit at this scale.
The Part That Isn’t Just Engineering
There’s a piece of CorpusKit Studio that goes beyond the mechanical pipeline: the curator layer.
Raw text extraction treats every chunk as equal. But if you’ve actually read the document, you know that’s not true. Some passages are central to the subject. Others are footnotes, boilerplate, chapter openers that don’t contain useful retrieval content.
CorpusKit Studio lets you read the source document inside the app, highlight passages, and rate their importance. Those curator signals travel with the exported corpus. When the iOS app retrieves results, importance ratings can influence ranking. Eventually — as apps accumulate usage data — real retrieval signals (which chunks actually led to good conversations) can feed back into the corpus weighting.
The idea is that a corpus curated by someone who actually understands the content should outperform one that was just mechanically chunked. That’s the bet.
It Teaches You How RAG Actually Works
Something I didn’t fully anticipate: CorpusKit Studio turns out to be a genuinely good way to understand retrieval-augmented generation at a mechanical level.
Most developers interact with RAG through an API or a framework that hides the steps. You send text in, you get responses back, and the middle is a black box. That’s fine for shipping product, but it means you’re guessing when something doesn’t work — when results are irrelevant, when the model seems to miss obvious answers, when response quality is inconsistent.
CorpusKit Studio makes the whole pipeline visible. You can see exactly how your document gets chunked. You can adjust chunk size and overlap and immediately see how the boundaries change. You can run a query against your corpus and see the ranked results — with cosine similarity scores — before any LLM is involved. That score tells you whether retrieval is finding the right passages. If it isn’t, you know the problem is in the corpus, not the model.
When you highlight a passage and mark it as high importance, you’re making a decision about signal vs. noise that a pure pipeline never asks you to make. When you run a test query and see a mediocre chunk ranking above a better one, you learn something real about how embedding similarity works — and you can fix it by adjusting your chunking strategy or adding more representative content.
If you’re learning RAG or trying to build intuition about why your retrieval quality varies, building a corpus in CorpusKit Studio is more instructive than reading about it. The feedback loop is immediate and concrete.
What’s Coming
CorpusKit Studio is coming to the Mac App Store. The product page is live now at robroy.online if you want to read more about what it does.
If you’re building RAG applications on Apple platforms and you’ve been tolerating the Python pipeline, I’d be curious what your setup looks like. Drop me a line at contact@robroy.online.
CorpusKit Studio is a Mac app that builds semantic search corpora from PDF documents using on-device Core ML embeddings. Learn more →