As I am coming to the end of writing the second edition of Lean Ansible (more news on that coming soon), I thought now would be a great time to have a look at what exciting developments have been happening in the six months now that I have a little more free time.

One of the things I have been keeping an eye on is the state of Large Language Models (LLM for short), especially since the introduction of open-source models such as Llama from Meta↗ and Mistral 7B↗ , which you can run locally.

Luckily for me, the fact I have been busy writing has meant enough time has passed for deployment methods to be much more straightforward and streamlined than they first were. The first tool I will look at in this post is Ollama; while it has been available since July last year, it has come on leaps and bounds since November 2023.

Info

As you may know from reading other my blog posts, I am a MacOS user, so the commands in this post will cover only MacOS. I also have an M3 MacBook Pro with 36GB of RAM, so your mileage may vary depending on your machine’s specifications.

Ollama

So, what is Ollama? The ollama website↗ describes the tool as:

Get up and running with large language models, locally. Run Llama 2, Code Llama, and other models. Customize and create your own.

The description is simple and to the point, much like the tool itself. Once you start using the tool, it will feel simple and basic - but don’t let that fool you; a lot is happening in the background.

You are getting a tool that allows you to pull, update, and maintain copies of dozens of modelsβ€”it also runs as a server in the background on your local machine. It gives you a standard API endpoint to connect to, allowing you to consume the models in a standardised way.

Rather than discussing the tool’s features further, let’s install it and run some tests.

Installing on macOS

Installing Ollama on macOS using Homebrewβ†— couldn’t be simpler; all you need to do is run:

Installing the desktop version of ollama
brewinstall-caskollama

The keen-eyed amongst you may have noticed that I am passing the β€”cask flag; this installs the desktop version of ollama rather than just the terminal version, which you can install by running:

LInstalling the terminal version of ollama
brewinstallollama

While the desktop version of Olama doesn’t have many features, running allows you to quickly start and stop the web services that run in the background by opening and closing the application. Another reason to prefer the desktop application over just running it on the command line is that it quietly handles updating itself in the background, prompting you to restart whenever a fresh update is available for download.

So now that we have it installed let’s do something.

Pulling and running a model

Anyone familiar with the Docker way of pulling and using images will instantly feel at home to download and install the llama2 7b model we need to run:

Pulling llama2
ollamapullllama2:latest

This should give you something like the following output:

Output
pppppppvwrsuuuuuuuereulllllllrimclllllllitociiiiiiifivennnnnnnynisgggggggignsngm8872f4gma9ccea2aan31203bsnni47340ahiyfdcf947afe92b3df2eus6e3f685sntdb667a6tu3bd750sf08d01de0e006did8a1c1dg......el......sa......tye111111r000000s000000%%%%%%β–•β–•β–•β–•β–•β–•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–β–β–β–β–β–374...5808595917GKKBBBBBB

Please note the file size: 3.8GB, so ensure you have the bandwidth available. Luckily, once downloaded, Ollama doesn’t have to connect to the internet again (unless you want to download another model or update it).

With the model downloaded, we can now interact with it by running the command below:

Opening a chat with llama2
ollamarunllama2

Once launched, you will be dropped into a chat prompt and from here, you can ask your questions:

Chatting with llama2
TeodmRAgsrIfatAUhnxioasiuetanhvsetyrryvns'cdea///////eegeelairustisslsb??Wsrecenilotw/lehoay,"/hksnttirgstwrha?atoaves"byy.ihgeeioseybwde/h"yEoahst,nr,lhoeiaaTnnuhgtale<<ertsprhsslewhinimmltopte.tcthinggCoopctehshasennclhoddubha'eTetokroletmeeteershtfyemtumllsgsmeleoidoia>>isaoortitrnifnnnkbtlsnihthegntdylmehgnisegtesauocoegssthr:SSLSEHHbesurr.blchteaehoaxeemlpltsliaahctoavilluubheewcugttestwdetppleeesraaehtusst?crvttettnemayffiaeawetarhi.wsoooo-u,balephiemisdsurrlssverpaneBtieerieioeniesgeuholsakntrlgnaxotnsceeobetgrtaaftiiucyfenh,aonchvnorobmnsgsndtdveafnrmoeacot,tctaeroemasomhherasyrEironarspuesse.ah,aaamrtndahnuvmalrbadgetoocbTeodtltltmsseneffhlhlreh,heioeh.orueee'sodsomstlaetotsnesrehisshifahlitnteglarnmeaocoihrimotbotnunnltegeuelubmtyi,dhgnunloscgtehsetusamhsafeeplotunifmcohllcdseobafceeeahcrlnordcnodteudleudarievuo.Rlsasooasraesnpcfcrtyscbgecoyolalerutlafeotu.srhodnifteeseredtgeTd.phhgrahwEewesanithanacstdsherdtsaehrntiektseviohnrytisuy'gesrogosviruelkhuoasicsenoanpnhttoustoag,wtemar.aineorsnatsnieWrahapunshnaesehmueilsebtlntlsaurehtrctneresoams,ougtoeoafnetstfilneprirhogaeerhndrtdea,

Entering any text in at the>>>prompt will be sent directly to the module; there is a help menu that can be accessed by typing οΏΌ/?οΏΌ; this will show the available commands, one of which is/byewhich exists the chat.

Using the API

As mentioned, Ollama runs a web-based API on your local machine, which listens on port 11434 by default. You can view this by goinghttp://localhost:11434β†—in your browser, and you should receive the message “Ollama is running” - you can send requests directly to the API using cURL, for example:

Asking a question using cURL
cu}r"""'lmpsort-dorXemelpaP"tmO:""S::T""lWfhlhatayltmspaie:2s/"/,tlhoecaslkhyosbtl:u1e1?4"3,4/api/generate-d'{

After a second or two, a response is returned:

~/2024/03/29/running-llms-locally-with-ollama/
Lines: 17Charaters: 1594Language: JSON
1234567891011121314151617
{"model":"llama2","created_at":"2024-03-29T11:16:00.231151Z","response":"\nThe sky appears blue because of a phenomenon called Rayleigh scattering. When sunlight enters Earth's atmosphere, it encounters tiny molecules of gases such as nitrogen and oxygen. These molecules scatter the light in all directions, but they scatter shorter (blue) wavelengths more than longer (red) wavelengths. This is known as Rayleigh scattering.\n\nAs a result of this scattering, the blue light is dispersed throughout the atmosphere, giving the sky its blue appearance. The blue light is scattered in all directions, but it is most visible in the direction of the sun, which is why the sky appears blue during the daytime.\n\nIt's worth noting that the color of the sky can appear different under different conditions. For example, during sunrise and sunset, the sky can take on hues of red, orange, and pink due to the angle of the sunlight and the scattering of light by atmospheric particles. In urban areas, the light pollution from city streets can make the sky appear more yellow or orange than blue.\n\nSo, to summarize, the sky appears blue because of Rayleigh scattering, which scatters shorter (blue) wavelengths of light more than longer (red) wavelengths, giving the appearance of a blue sky.","done":true,"context": [518,25580,29962,29889],"total_duration":9717238042,"load_duration":663584,"prompt_eval_duration":167513000,"eval_count":291,"eval_duration":9548392000}


In the output above, I truncated the context values as many exist.

Running another model

Do you want to run another model, like the newly launched Mistral 7B v0.2 release (which, when writing this post, was released last week)? No problem, just run:

Download and chat with Mistral v0.2
ollamarunmistral:latest

This will pull the model and drop us straight at a chat prompt:

Download and chat with Mistral v0.2
ppppppvwrstgmttauuuuuuereuThaohhtllllllrimchesreemllllllitoceeeoiiiiiifiveWsssss/nnnnnnynishcuekkpbggggggignsyonaayyhyngl'ns,eeme4eefgmiosdicra836d9aasrlwaena081bsnnrpyeni37311hiytoaaaf506eeafhfyrtsatebe0d32euestheps529a15sntiaeptt9d2796tushrcneh34476skeeltao7e69edeyaeohrsa510cidscstee53fdfgbkhhr.....ellyiebes.....sauEnrldp.....tyeaau,ee?prtcec11111rpthopi00000sehelcif00000a'ooni%%%%%rsarlkcsiso,β–•β–•β–•β–•β–•arraβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆbt.sonβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆlmuprgβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆuoBcrlβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆeslheoeβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆpudrsβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆdheaoa.β–ˆβ–ˆβ–ˆβ–ˆβ–ˆuesmnβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆerligβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆeirneβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆt,geaβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆohdndβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆtttuβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆaholeβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆehryβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆpya.tβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆrsyoβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆoaeAβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆcraldtβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆeeldhβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆssoieβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆsshwtβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆco.isβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆcarocβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆattAnaβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆltesatβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆlerltβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆeraleβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆdewyrβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆdar,iβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆRvenβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆaiesdgβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆynluuβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆlelroβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆeantifβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆilg,nβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆgltgs▏▏▏▏▏hhwudhsn4siaeul.14crnnni11438aedrg203tcwihGKttgestBBBBBeieerotliinsoannsongskdt.bchyauseWtpuhttnEeheasanerterettd,h's

As you can see, this time, it was a 4.1GB download - which now means we have around 8GB of LLMs downloaded and sat on our local machine. To interact with mistral using the API, switch the model name and send your request:

Asking a question using cURL
cu}r"""'lmpsort-dorXemelpaP"tmO:""S::T""mWfhihatsylttsprie:as/l/"tl,hoecaslkhyosbtl:u1e1?4"3,4/api/generate-d'{

This returns the same JSON response (apart from content, of course, as it’s a different model). Before moving on to the next part of the post, let’s pull down one more model:

Pulling codellama
ollamapullcodellama:latest

Open WebUI


The authors describe the project, which was formally called Ollama WebUI - so you can guess what it used for, as;

Open WebUI is an extensible, feature-rich, and user-friendly self-hosted WebUI designed to operate entirely offline. It supports various LLM runners, including Ollama and OpenAI-compatible APIs.

It is distributed in a container, so we can run it using Docker or Podman - with little in the way of any prerequisites needing to be installed.

Running on macOS

The only steps we need to do is create somewhere to store our data in, to do this I have a folder called~/Containers/on my machine so lets stick anopen-webuifolder in there:

Sorting out a directory to store our data in
mcdkdodci~kr/eCr-opnitm~aa/igCneoenrptsua/liolnpeegrnhs-c/wroe.pbieuoni-owpeebnu-iw/ebui/open-webui:main

With the folder in there and the image pulled, the following command will launch Open WebUI and bind it to port 3000 on our local machine:

Launching Open WebUI
docgk-----hedpanrcreudoaertbdlms.cal-uetiocihmaonhsoeor/thsptoa\t~epi3=/naen0hC-lne0ooww-r0sneaw:ttbyer8.ausbu0diiun8on\i0ce\/\kro\espre.onip-newtnee-brwuneiab:lum:iah:io/nsatp-pg/abtaecwkaeynddata\

With the container running, go tohttp://localhost:3000/β†—.

Our first chat

A login page should greet you; click on the Sign-Up link and create a user; once you have an account - you will be presented with a ChatGPT-like interface - select a model from the drop-down menu at the top of the chat box and ask your question: