Building a Private On-Device AI Companion: What I Learned

The starting question

Rob Roy · robroy.online · April 2026


A few weeks ago I asked a simple question: could I take a small language model and teach it about one specific domain?

That question led me through a full modern AI engineering stack — retrieval-augmented generation, embedding models, on-device inference, Core ML conversion, Apple Foundation Models — and eventually to a working iOS app that runs entirely on the device with no network calls. Along the way I changed direction several times, discarded entire architectures, and learned what actually matters versus what just sounds impressive in a roadmap.

This is a record of the journey, including the parts where I was wrong.

The project

The app is called Big Book via LLM. It provides a private, on-device conversational interface to the 1939 first edition of the Alcoholics Anonymous Big Book — a text that is firmly in the public domain in the United States. The app retrieves relevant passages from the text in response to user queries and generates warm, concise responses grounded in those passages. Every answer cites its source. Nothing leaves the device.

The project had three honest goals from the start:

Learn the current AI engineering stack deeply enough to be professionally fluent in it. Build something that might serve a community I care about. Produce work that demonstrates real capability, not just familiarity with buzzwords.

Commercial scale was never the primary goal. That turned out to matter, because it freed me to make good technical decisions instead of defensive market positioning.

Starting with the wrong architecture

My first instinct was to bundle everything into the app. Download a small LLM, index the Big Book corpus, run inference locally, ship it. Privacy would be the pitch. On-device would be the differentiator.

That architecture started falling apart almost immediately.

The first problem was model size. Phi-4 Mini quantized to Q4 is about 2.3 GB. Apple’s On-Demand Resource system caps individual asset packs at 512 MB. The workaround was splitting the GGUF file into chunks and downloading both, but the real issue surfaced first: a 2.3 GB download before the user sees any value is a terrible first experience. Most users abandon apps in the first 60 seconds. A progress bar is not an experience.

The second problem was memory. A 2.3 GB model needs roughly 3 GB of RAM to load with overhead. iPhones with 4 GB of RAM physically cannot run it. That cut off iPhone 12 and 13 users — a significant slice of exactly the demographic most likely to benefit from this app.

The third problem was stability. llama.cpp integration via Swift bindings involved C++ interop, manual KV cache management, and a class of bugs that don’t exist in pure Swift. I spent days chasing hard crashes before realizing the architecture was fighting the platform.

The lesson: elegance on the developer’s MacBook is not the same as elegance on the user’s iPhone.

Changing direction

I made three architectural changes in quick succession, each one forced by a problem I should have anticipated:

Switched from Phi-4 Mini to Gemma 3 1B. Smaller model, more devices supported, smaller download. Response quality for RAG is largely driven by the retrieval layer, not the model’s raw capabilities — the LLM is primarily framing retrieved passages rather than generating from scratch. A 1 billion parameter model does this task surprisingly well.

Replaced llama.cpp with Apple Foundation Models. When iOS 26 shipped with native on-device LLM support, the entire custom inference stack became unnecessary. No download, no GGUF files, no C++ bridge, no ODR splitting. A single Swift API that uses the Neural Engine directly. The tradeoff was setting iOS 26 as the minimum deployment target, which sounds restrictive but according to Apple’s data covers 66% of all active iPhones and 74% of devices from the last four years. The users I’d exclude were mostly on hardware that couldn’t run the app well regardless.

Built a Mac companion app to handle corpus preparation. My original workflow used Python scripts that only I could run. That’s a workflow for exactly one person. I built CorpusKit Studio as a native Mac app that accepts PDFs, chunks them, generates embeddings via Core ML, lets curators highlight important passages, tests retrieval live, and exports a deployable corpus bundle. No terminal required. No Python. Same code patterns as the iOS app, same embedding model, consistent results across both.

The RAG architecture that worked

With the right pieces in place, the final architecture is conceptually simple:

A corpus is prepared by CorpusKit Studio on a Mac. The Big Book PDF goes in. PDFKit extracts the text. A chunker splits it into overlapping 200-word segments with metadata preserved. A Core ML converted version of all-MiniLM-L6-v2 generates a 384-dimensional semantic embedding for each chunk. Important passages get curator importance weights set via a highlighting interface. The whole thing exports as a bundle containing chunks.json and embeddings.npy.

The iOS app bundles this corpus along with the same MiniLM Core ML model. At query time: the user’s question is expanded using a domain-specific vocabulary dictionary (“step 4” becomes “searching and fearless moral inventory personal housecleaning list resentments fears”), embedded using the same model that created the corpus, compared against all chunk embeddings via cosine similarity, weighted by chapter and importance, and the top three chunks are retrieved. Those chunks are assembled into a prompt with a tight three-part response structure. Apple Foundation Models streams a warm, brief response. The UI renders the response text alongside passage cards showing the specific cited sentences.

End to end, a query takes about 800 milliseconds on an iPhone 15 Pro. No network. No server. No user data leaving the device.

Three design decisions I’d make again

The three-part response structure. Rather than letting the LLM generate whatever it wanted, I constrained responses to: one sentence acknowledging what the person is bringing, one sentence pointing to where the book addresses it, and one question that opens reflection. Under 50 words total. This makes the app feel like a thoughtful companion rather than a chatbot. The text is just connective tissue — the UI renders actual book passages as citation cards.

Sentence-level citation extraction. Chunks are 200 words because that’s what works for retrieval. But showing 200 words as a citation is overwhelming. At display time, I embed each sentence within the retrieved chunk and show only the single most relevant sentence with its page number. “Resentment is the number one offender. — Big Book, p. 64” is more useful than a paragraph of context the user didn’t ask for.

Treating authority as a first-class concept. Not every document in a library is equally authoritative. In law, the Constitution outweighs a law review article. In recovery, the Big Book first edition outweighs commentary. CorpusKit Studio lets curators set authority weights per corpus, stamped into every chunk at export time. The iOS app multiplies similarity by authority, so a passage from a primary source wins over a loosely relevant passage from a secondary one. This encodes domain expertise directly into retrieval rather than hoping the model figures it out.

Three decisions I got wrong first

I initially wanted the app to rely on a local Flask server for embeddings during development. This worked in the simulator but created a dependency that couldn’t ship to actual users. I should have prioritized Core ML embedding conversion from day one rather than using it as a placeholder.

I tried to bundle the model inline with the app. Apple’s ODR system and memory constraints made this painful. The real answer was either a smaller model that fits within reasonable constraints, or Apple’s own Foundation Models when available. Fighting the platform on model delivery was wasted effort.

I considered competing with NotebookLM on features. That would have been a losing strategy. Google’s free, works-everywhere tool wins on raw capability. What my app can offer that theirs cannot is specialization — domain-tuned query expansion, citation-first UI, crisis awareness appropriate for recovery literature, a warm tone deliberately shaped for the community it serves. General tools cannot credibly be all things to all domains. Specificity is defensible.

What I learned about RAG

The retrieval layer matters more than the model. A well-tuned retrieval pipeline with a mediocre LLM outperforms a great LLM with naive retrieval. I spent a disproportionate amount of time on chunk size, overlap, chapter weights, query expansion, and importance scoring — and that’s where most of the quality came from.

Domain knowledge encoded as data beats domain knowledge expected from the model. Rather than hoping the LLM knows that “step 4” means a moral inventory, I wrote a query expansion dictionary. Rather than hoping it weights the Constitution higher than a law review article, I stamp authority weights directly onto chunks. The model does not need to be an expert in your domain if your retrieval layer already is.

Evaluation is the hardest part. It’s easy to build RAG that returns something. It’s hard to know whether it returned the right thing. I built a verification script that runs a set of representative queries and prints the retrieved chunks with scores. Watching those scores move as I tuned chunk size and chapter weights was how I actually knew the system was improving. Without that feedback loop, you’re flying blind.

What I learned about shipping on-device AI

iOS 26 and Apple Foundation Models changed the calculus significantly. Before iOS 26, shipping an on-device LLM meant downloading gigabytes and managing memory pressure yourself. After iOS 26, it’s a Swift API. For anyone building in this space today, that transition is the single most important development.

Core ML conversion has sharp edges. The coremltools package has known issues with Python 3.13, requires specific coremltools versions that support various model architectures, and the documentation assumes you already know what you’re doing. The path that worked for me was Python 3.11, direct PyTorch-to-CoreML conversion skipping ONNX entirely, FP16 precision, and verifying cosine similarity above 0.999 between the PyTorch reference and the converted Core ML output before trusting anything.

Privacy is not automatically a competitive advantage but it is table stakes for certain domains. Generic users don’t care much about where their queries go. Users dealing with recovery, therapy, legal issues, medical information, or any topic they consider sensitive care a great deal. Choosing domains where privacy matters makes on-device architecture valuable. Choosing domains where it doesn’t makes on-device architecture expensive and complicated with no upside.

Where this goes next

CorpusKit Studio is the more interesting long-term product than the iOS app. It’s infrastructure — a tool for producing deployable RAG corpora from any PDF, with curator controls for importance, authority, and query expansion. The same engine that produces the Big Book corpus can produce a corpus from any closed body of text. Recovery literature, legal documents, medical education, sacred texts, personal libraries.

The iOS app is one consumer of the output. The Mac app is the workshop. The corpus bundle is the portable artifact that connects them. A publisher, a clinician, a sponsor, a curator — none of them needs to touch a terminal to produce a high-quality RAG corpus. That accessibility is what turns this from a personal project into something potentially useful to other people.

I plan to release the Big Book corpus package publicly under a permissive license. The corpus derivation is my work; the source text is public domain; the resulting index is something any CorpusKit-compatible application can consume. The recovery community should have access to this kind of tool, and one app cannot reach everyone — but a well-made corpus can work inside any tool that knows how to read it.

The skills this built

I was explicit with myself at the start that one of the goals was professional skill development. Working through this project end-to-end, I can now credibly speak to:

Retrieval-augmented generation architecture: chunk sizing, overlap, embedding model selection, vector similarity search, query expansion, chapter and importance weighting, evaluation methodology.

On-device LLM inference: model quantization tradeoffs, memory constraints on mobile devices, llama.cpp integration, Apple Foundation Models, Core ML model conversion from PyTorch, FP16 precision considerations.

Production iOS development with AI: SwiftUI architecture, async token streaming, Core ML integration, proper error handling around AI pipelines, the difference between simulator and device behavior, iOS 26 Foundation Models API specifics.

Product decisions in the AI space: when to use RAG versus fine-tuning, how to handle crisis scenarios in sensitive domains, the tradeoffs of closed-source versus open models, cloud versus on-device architectures, the actual versus perceived value of privacy-first design.

The deeper lesson is that doing this work is different from reading about it. I could have spent the same weeks watching tutorials and finished with less real understanding than I got from hitting actual problems and solving them. The errors — the crashes, the zero-vector embeddings, the Python version mismatches, the Core ML conversion failures, the response length that wouldn’t shorten — were where the learning lived.

A note on direction changes

Looking back, I changed direction in significant ways roughly five times across this project. Each time it felt like a setback. Each time it was actually the right call.

Changing direction is often treated as a failure mode. In software it usually is not. The information you have at the start of a project is always incomplete. The information you have three weeks in is substantially better. Refusing to incorporate new information out of commitment to a plan is a worse outcome than revising the plan honestly.

The original architecture had Phi-4 Mini running via llama.cpp with ODR-split GGUF files, a Flask embedding server during development, and a Python corpus pipeline that only I could run. The final architecture has Apple Foundation Models, Core ML embeddings, a Mac app that any curator can use, and a corpus format designed for portability. The final version is dramatically better than the original plan — and I could not have arrived at it without starting somewhere and paying attention.


Rob Roy is an independent iOS developer based in Maine. His work and contact are at robroy.online.