How to Use Ollama as a Free Local AI on Mac

What is Ollama?

If you have ever wanted to run a powerful AI language model on your Mac without sending your data to the cloud, paying for API access, or even needing an internet connection, Ollama is the tool you have been looking for. Ollama is a free, open-source application that makes it remarkably simple to download and run large language models (LLMs) entirely on your local machine.

Traditional AI services like ChatGPT and Claude require you to send your text to remote servers for processing. That means your data leaves your computer, you need an active internet connection, and in most cases you are paying per request. Ollama eliminates all three of those concerns. Once you download a model, everything runs locally on your Mac's hardware. Your data never leaves your device, there are no recurring costs, and you can use it offline at 30,000 feet just as easily as you can at your desk.

The project has grown rapidly since its release, with a thriving community and a library of hundreds of models to choose from. Whether you need a general-purpose assistant, a coding helper, or a model fine-tuned for a specific language, there is likely an Ollama-compatible model that fits your needs.

Installing Ollama on macOS

Getting Ollama running on your Mac takes just a few minutes. The installation process is straightforward and follows the same pattern as most macOS applications.

Download the installer from ollama.com/download. The site will automatically detect that you are on macOS and offer the correct version.
Open the downloaded zip file and drag the Ollama application into your Applications folder.
Launch Ollama from your Applications folder. On first launch, macOS may ask you to confirm that you want to open an app downloaded from the internet. Click "Open" to proceed.
Allow the CLI installation. When Ollama launches for the first time, it will ask for permission to install its command-line tool. This is what lets you interact with Ollama from Terminal, so go ahead and approve it.

Once installed, Ollama runs quietly as a menu bar application. You will see a small llama icon in your menu bar whenever it is active. You can verify the installation by opening Terminal and running:

ollama --version

If you see a version number printed back, you are all set.

Downloading Your First Model

Ollama does not come with any models pre-installed. You need to download at least one before you can start using it. Models are pulled from the Ollama model library using a simple terminal command.

To download Meta's Llama 3.2, which is a great general-purpose model, open Terminal and run:

ollama pull llama3.2

You can also grab Mistral, a fast and capable model from Mistral AI:

ollama pull mistral

Model downloads can take a few minutes depending on your internet speed. Most models range from about 2 GB to 8 GB for the standard variants, though larger versions exist for those with more RAM and disk space. The llama3.2 default variant is around 2 GB, making it an excellent starting point. The mistral default variant is roughly 4 GB and delivers noticeably stronger output for most tasks.

You can download as many models as your disk space allows. They are stored in a local cache, so switching between models later is instant with no re-downloading required.

Running Ollama from Terminal

Once you have a model downloaded, you can start chatting with it immediately. The most common way to interact with Ollama is through an interactive terminal session.

Interactive Chat

To start a conversation with Llama 3.2, run:

ollama run llama3.2

This opens an interactive prompt where you can type messages and receive responses in real time. The model streams its output token by token, so you see the response as it is generated. Type /bye to exit the session when you are done.

Managing Installed Models

To see a list of all models currently installed on your machine, run:

ollama list

This displays each model's name, size, and when it was last modified. If you want to free up disk space, you can remove a model with:

ollama rm <model-name>

API Mode

Ollama also exposes a local REST API that other applications can use to send prompts and receive responses programmatically. If Ollama is running from the menu bar, the API is already available. If you need to start it manually, run:

ollama serve

This starts the Ollama server on localhost:11434. Any application on your Mac can now send HTTP requests to this endpoint to interact with your locally installed models. This is exactly how tools like VoxyAI integrate with Ollama behind the scenes.

Which Model Should You Choose?

With dozens of models available, picking the right one can feel overwhelming. The table below compares the most popular models that work well on consumer Mac hardware. The right choice depends on your priorities: do you need the fastest responses, the highest-quality output, or something optimized for a specific task like coding?

Model	Size	Speed	Quality	Best For
`llama3.2` RECOMMENDED	~2 GB	Very Fast	Good	Quick tasks, lightweight usage, Macs with 8 GB RAM
`mistral` RECOMMENDED	~4 GB	Fast	Very Good	General-purpose writing, summarization, Q&A
`gemma2`	~5 GB	Fast	Very Good	Instruction following, creative writing, analysis
`phi3`	~2 GB	Very Fast	Good	Reasoning tasks, lightweight deployments, constrained hardware
`codellama`	~4 GB	Fast	Very Good	Code generation, debugging, programming assistance

If you are just getting started and want a single model that handles most tasks well, mistral is an excellent all-rounder. If your Mac has only 8 GB of RAM, start with llama3.2 or phi3 since their smaller footprint leaves more memory available for other applications. For developers who want AI-assisted coding, codellama is purpose-built for programming tasks and understands code context far better than general-purpose models.

You are not locked into a single choice. Install multiple models and switch between them depending on the task at hand. Ollama makes it easy to experiment.

Using Ollama with VoxyAI

One of the best reasons to set up Ollama on your Mac is that it unlocks completely free, completely private AI-powered voice dictation through VoxyAI. The integration is seamless and requires no configuration on your part.

VoxyAI automatically detects when Ollama is running on your Mac by checking for the local API at localhost:11434. Once detected, VoxyAI uses your locally installed model to intelligently format your voice dictation in real time. This means your spoken words are transcribed by Apple's built-in speech recognition and then formatted, punctuated, and structured by your local Ollama model. At no point does your text leave your computer.

Here is all you need to do:

Install Ollama following the steps above.
Pull a model such as mistral or llama3.2.
Make sure Ollama is running (look for the llama icon in your menu bar).
Open VoxyAI and select Ollama as your AI provider in the settings.

That is it. No API keys to generate, no accounts to create, no subscription fees. VoxyAI will immediately begin using your local model to format your dictation. You get intelligent punctuation, paragraph breaks, capitalization, and natural formatting, all powered by the model running right on your Mac.

This combination is particularly valuable for anyone who works with sensitive information. Lawyers drafting notes, doctors recording observations, journalists protecting sources, or anyone who simply values their privacy can dictate freely knowing that every word stays on their device. The entire pipeline, from your voice to the final formatted text, is completely local.

Tips for Performance

Running a large language model locally is more hardware-intensive than most everyday tasks. Here are practical tips to get the best experience with Ollama on your Mac.

Use an Apple Silicon Mac

If you have a Mac with an M1, M2, M3, or M4 chip, you are in excellent shape. Apple Silicon processors have a unified memory architecture that allows the GPU to access system RAM directly, which dramatically accelerates model inference. An M1 MacBook Air can comfortably run 7-billion-parameter models, and higher-end Apple Silicon machines with 32 GB or more of RAM can handle much larger models with ease. Intel Macs can run Ollama, but expect significantly slower response times.

Match Model Size to Your RAM

The single biggest factor in performance is whether the model fits comfortably in your available memory. Here are general guidelines:

8 GB RAM: Stick to smaller models like llama3.2 (3B) or phi3 (3.8B). These will run smoothly and leave enough memory for your other applications.
16 GB RAM: You can comfortably run 7B parameter models and can stretch to some 14B models. This is the sweet spot for most users running models like mistral or gemma2.
32 GB+ RAM: Larger models in the 14B to 32B range become practical. You will see noticeably higher quality output from these larger models, especially for complex reasoning and nuanced writing tasks.

Close Heavy Applications

When you are actively using Ollama, especially with larger models, it helps to close memory-hungry applications. Web browsers with many open tabs, video editors, and virtual machines all compete for the same RAM that your model needs. If you notice slow or stuttering responses, check Activity Monitor to see what else is consuming memory and consider closing applications you are not actively using.

Choose Speed Over Size for Dictation

If you are using Ollama primarily for voice dictation formatting with VoxyAI, response speed matters more than raw intelligence. A smaller, faster model like llama3.2 will format your dictation nearly instantly, while a larger model might introduce a noticeable delay. For dictation, the formatting task is relatively straightforward, so you do not need the most powerful model available. Save the bigger models for tasks where output quality is the top priority, such as drafting emails, generating code, or analyzing documents.