Notes on SmolLM-3

July 10, 2025

My mental model for these new small local models is that they're like the MP3. They're compressed, efficient, and for a lot of tasks, they're indistinguishable from the lossless WAV file of a much larger model. They're perfect for things like quick summarizations, reformatting code, or generating boilerplate. But how good are they?

I wanted to get SmolLM-3, a new 3B parameter model, running locally to play this exact role.

After a bit of digging, I found the solution for getting it running on my Mac. The easiest path is to use Apple's own MLX framework, via the mlx-lm package. It's built for Apple Silicon and avoids the need to compile anything from source.

First, install the latest version:

pip install mlx-lm  # or upgrade - you need bleeding edge

Then, you can run the chat interface directly, pointing it at the model on the Hugging Face Hub:

mlx_lm.generate \
    --model "mlx-community/SmolLM3-3B-bf16" \
    --max-tokens 2048 \
    --max-kv-size 16384 \
    --system-prompt "/no_think"

The prompt (using --prompt)

$(glow github.com/supermacro/neverthrow)
---
INSTRUCTIONS: When to use `orTee` and when to use `andTee`, be brief, show code samples"

And the response

Here is a concise explanation of when to use `orTee` and when to use `andTee`:

**Use `andTee`** when you want to log or handle side effects that should only occur if the main logic succeeds. For example, if you're inserting a user into a database and then logging the user's details, you'd use `andTee` to ensure the logging happens only if the insertion is successful.

**Use `orTee`** when you want to log or handle side effects that should only occur if the main logic fails. For example, if you're parsing user input and then logging an error if the parsing fails, you'd use `orTee` to ensure the error is logged only if the parsing fails.

Here are some code examples:

**Using `andTee`**:

`⁠``javascript
const resAsync = parseUserInput(userInput)
  .andTee(logUser)
  .asyncAndThen(insertUser)
`⁠``

**Using `orTee`**:

`⁠``javascript
const resAsync = parseUserInput(userInput)
  .orTee(logParseError)
  .asyncAndThen(insertUser)
`⁠``

In the first example, `logUser` is called only if the parsing is successful. In the second example, `logParseError` is called only if the parsing fails.

Usage

Prompt: 15299 tokens, 444.994 tokens-per-sec
Generation: 281 tokens, 19.824 tokens-per-sec
Peak memory: 7.964 GB

This is my M1 32GB Macbook Pro. It took 56 seconds and the answer is correct. Pretty impressive.

In the prompt I'm using Glow to grab the Markdown README for neverthrow and asking a specific question. Now I'm wondering, could you practically start moving some LLM workloads out of coding agents like Claude Code, asking it to use the CLI that hands tasks off to a local LLM? Might depend on your preference - takes longer but saves tokens and Claude context.

I tried another prompt variation: "What's the best way to log an error without effecting the result type?". Here I'm not specifically mentioning orTee but asking it to point me to it without mentioning it. I tried --system-prompt '/no-think' but it gave the wrong answer. With reasoning it reasoned in circles a bit but eventually in 150 seconds came up with a great answer. That gives some indication about what to expect from small local models.

Interesting times!