As I am coming to the end of writing the second edition of Lean Ansible (more news on that coming soon), I thought now would be a great time to have a look at what exciting developments have been happening in the six months now that I have a little more free time.
One of the things I have been keeping an eye on is the state of Large Language Models (LLM for short), especially since the introduction of open-source models such as Llama from Meta and Mistral 7B , which you can run locally.
Luckily for me, the fact I have been busy writing has meant enough time has passed for deployment methods to be much more straightforward and streamlined than they first were. The first tool I will look at in this post is Ollama; while it has been available since July last year, it has come on leaps and bounds since November 2023.
Info
As you may know from reading other my blog posts, I am a MacOS user, so the commands in this post will cover only MacOS. I also have an M3 MacBook Pro with 36GB of RAM, so your mileage may vary depending on your machineβs specifications.
Ollama
So, what is Ollama? The ollama website describes the tool as:
Get up and running with large language models, locally. Run Llama 2, Code Llama, and other models. Customize and create your own.
The description is simple and to the point, much like the tool itself. Once you start using the tool, it will feel simple and basic - but donβt let that fool you; a lot is happening in the background.
You are getting a tool that allows you to pull, update, and maintain copies of dozens of modelsβit also runs as a server in the background on your local machine. It gives you a standard API endpoint to connect to, allowing you to consume the models in a standardised way.
Rather than discussing the toolβs features further, letβs install it and run some tests.
Installing on macOS
Installing Ollama on macOS using Homebrew couldn’t be simpler; all you need to do is run:
The keen-eyed amongst you may have noticed that I am passing the βcask
flag; this installs the desktop version of ollama rather than just the terminal version, which you can install by running:
While the desktop version of Olama doesnβt have many features, running allows you to quickly start and stop the web services that run in the background by opening and closing the application. Another reason to prefer the desktop application over just running it on the command line is that it quietly handles updating itself in the background, prompting you to restart whenever a fresh update is available for download.
So now that we have it installed letβs do something.
Pulling and running a model
Anyone familiar with the Docker way of pulling and using images will instantly feel at home to download and install the llama2 7b model we need to run:
This should give you something like the following output:
Please note the file size: 3.8GB, so ensure you have the bandwidth available. Luckily, once downloaded, Ollama doesnβt have to connect to the internet again (unless you want to download another model or update it).
With the model downloaded, we can now interact with it by running the command below:
Once launched, you will be dropped into a chat prompt and from here, you can ask your questions:
Entering any text in at the>>>
prompt will be sent directly to the module; there is a help menu that can be accessed by typing οΏΌ/?
οΏΌ; this will show the available commands, one of which is/bye
which exists the chat.
Using the API
As mentioned, Ollama runs a web-based API on your local machine, which listens on port 11434 by default. You can view this by goinghttp://localhost:11434in your browser, and you should receive the message “Ollama is running” - you can send requests directly to the API using cURL, for example:
After a second or two, a response is returned:
|
|
In the output above, I truncated the context values as many exist.
Running another model
Do you want to run another model, like the newly launched Mistral 7B v0.2 release (which, when writing this post, was released last week)? No problem, just run:
This will pull the model and drop us straight at a chat prompt:
As you can see, this time, it was a 4.1GB download - which now means we have around 8GB of LLMs downloaded and sat on our local machine. To interact with mistral using the API, switch the model name and send your request:
This returns the same JSON response (apart from content, of course, as it’s a different model). Before moving on to the next part of the post, let’s pull down one more model:
Open WebUI
The authors describe the project, which was formally called Ollama WebUI - so you can guess what it used for, as;
Open WebUI is an extensible, feature-rich, and user-friendly self-hosted WebUI designed to operate entirely offline. It supports various LLM runners, including Ollama and OpenAI-compatible APIs.
It is distributed in a container, so we can run it using Docker or Podman - with little in the way of any prerequisites needing to be installed.
Running on macOS
The only steps we need to do is create somewhere to store our data in, to do this I have a folder called~/Containers/
on my machine so lets stick anopen-webui
folder in there:
With the folder in there and the image pulled, the following command will launch Open WebUI and bind it to port 3000 on our local machine:
With the container running, go tohttp://localhost:3000/.
Our first chat
A login page should greet you; click on the Sign-Up link and create a user; once you have an account - you will be presented with a ChatGPT-like interface - select a model from the drop-down menu at the top of the chat box and ask your question:
As you can see from the screenshots above, the interface is familiar and quite function-rich, so let’s try something a little more interesting. As mentioned at the start of the post, I have completed the first full draft of the second edition of Learn Ansible, which means I have PDF copies of all the chapters on my machine. Let’s upload Chapter 01 and chat about it.
Talking with a document
Click on Documents in the left-hand menu and then the+button, upload your file and if you like tag it:
Now, let’s ask Mistral something about the document I uploaded. In the chapter, I talk about a declarative approach to IaC and give an example of how a declarative IaC would run.
As you can see from the screens above, I typed in#chatper01pd
to select and attached the uploaded document to the chat prompt and then asked the question below:
Me
What happens in the declarative example given in the document? Also, can you give me a summary of each of the runs documented?
After a few seconds, I was given the following response:
Mistral:Latest
In the declarative example given in the document, our configuration management tool executes tasks declared in a certain order to achieve a desired state. The tasks include creating a group, creating users, and granting privileges. However, due to the eventual consistency principle, the final state may not be achieved immediately after each run. Instead, the system may reach the desired state after several runs.
Run 1:
- Task 2 (create group Team) executed but could not create user Alice since the group did not exist yet.
- Task 3 (create user Bob) was successful as the group Team existed at that time.
- Task 4 (grant escalated privileges to user Alice) failed because there was no user called Alice on the system for the privileges to be applied to.
Run 2:
- The tasks were executed in the same order as Run 1.
- This time, since the group Team existed before user creation, user Alice was created without errors.
- As Alice was present during this run, she received escalated privileges.
- No changes were required during Run 3 as the system was already in the expected state.
In summary, during each run, the tool attempts to reach the desired configuration but may not succeed immediately due to eventual consistency. The tasks are executed based on their order in the definition, and the final state is reached eventually through multiple runs.
I must admit that it is pretty spot on with what I wrote. Let’s get a little meta, pardon the pun, and get Llama2 to write our response first. Select the file again by using#chatper01pdf
and then type@llama2
, you will get a prompt that Llama2 is thinking, and then it will fill in a response:
Llama2:Latest
Ah, I see! Thank you for explaining that to me. That makes sense. So, in each run, the tool tries to reach the desired configuration, but it may not succeed immediately due to eventual consistency. And it seems like each run is trying to achieve the same goal, but the order of the tasks matters. Is that right?
And can you tell me more about the final state? What does it mean for the system to be in the expected state? Is it just a matter of having the desired configuration, or are there other factors at play as well?
Upon clicking send, after a second, I got the following response:
Mistral:Latest
Yes, that is correct. In each run, the tool is trying to reach the desired configuration, but it may not succeed immediately due to eventual consistency. The order of tasks matters because some tasks may depend on others being completed first.
As for the final state or expected configuration, it generally refers to having all the desired components and settings in place for a system to function optimally according to its intended design or purpose. This can include specific configurations, installed software, user accounts, security settings, and other factors that contribute to the overall health and stability of the system. In some cases, there may be additional requirements such as compliance with industry standards or regulatory guidelines, or performance benchmarks that must also be met for a system to be considered in the expected state.
There are a few important things to remember here: first, this is all running on my local machineβat no point did I send the Chapter01 PDF file to a third-party API, and I could have disabled the network if I wanted. It is incredible how quickly this technology is moving forward!!!
I could discuss many more features, but this post is getting quite long, and there is one more thing I would like to cover.
One more thing
The Open WebUI interface also works well in a mobile browser, so let’s use ngrok to open a tunnel from our local machine to the internet so we can access it - to do this, sign up for an ngrok account attheir site- the service is described as:
ngrok combines your reverse proxy, firewall, API gateway, and global load balancing to deliver apps and APIs.
Once you have signed up, its free, and downloaded the agent, run the following command:
This will create the tunnel, give you an HTTPS URL to access the content on and display some stats:
As you can see from the screens below, if you goto the URL given by ngrok and click through the warning you will be taken to the login page, and from their you can use the mobile UI:
Once finished, pressctrl+c
to close the tunnel and quit the ngrok session.
Cleaning up
There is just one thing to do: clean up the resources we have launched, starting with the containers:
Don’t worry again about the container being removed; the data and configuration are safe on the local machine, and to restore them, you need only run thedocker container run
command from earlier in the post.
The final thing you should do is quit the Ollama desktop application. This will shut down the API server and stop it from running in the background. Then, reopen the application to start everything back up again.
Conclusion
In this post, we explored how to run powerful large language models like llama2, Mistral, and codellama entirely locally on macOS using the Ollama tool. We saw how easy it is to pull models, interact with them via a chat interface and API, run a feature-rich web UI, and even access the models remotely using ngrok.
The fact that all of this cutting-edge AI capability can run offline on a laptop with a tool that’s simple to install is truly incredible. I encourage you to try out Ollama for yourself and experience the power of local LLMs. The possibilities are endless, from exploring and learning about these models to building your own applications on top of them.
I hope this post has been informative and inspirational. Feel free to reach out with any questions, and happy local LLM’ing!
Here are some links to the tools we have looked at in the post: