← /research/ · research ·

Local Vision Models for Alt Text, Tested: Six Models, Three Variables, One Recipe (2026)

We sent 46 images through six on-device vision models to see which writes web-ready alt text, and how the prompt and the input resolution change the result. The winner, the exact prompt, and every number, run locally with no cloud API. Joel Shetler·aiaccessibility

Most writing about AI alt text stops at "use a vision model." That skips the questions that decide whether the output is usable: which model, prompted how, fed an image at what size. So we ran the test. Six vision models, all of them on a laptop with no cloud API, across 26 images, with the model held constant while we varied the prompt, and the prompt held constant while we varied the resolution. Then we re-ran the winning combination on 20 images it had never seen, to check that the recipe holds.

Every number below comes from a local run you can reproduce with the commands in section 6. Nothing went to a hosted API, so there is no per-call bill and no rate limit hiding in the results. The token counts are still here, because if you would prefer to call a frontier model through an API, the input-token count is the number that sets your cost.

A note on house style before we start: this report was researched and written by the team at Scratch, the desktop app we built for editing a whole catalog of content at once. Section 9 is where alt text at that scale connects to what we sell, labeled as such. Everything before it is field notes.

What is inside

  1. Key findings
  2. Setup and method
  3. Variable one: the model
  4. Variable two: the prompt
  5. Variable three: image resolution
  6. The recipe we would ship
  7. Does it reproduce? A held-out run
  8. Limitations
  9. If you are doing this across a whole catalog
  10. Image sources and licenses
  11. Appendix: the full pipeline source

TL;DR, just give me the prompt. Send each image to a vision model with the prompt below, then enforce a 125-character cap. If the model goes over, re-ask it to "shorten to under 125 characters." Downscale images to about 768 px on the long edge first, because bigger is slower with no quality gain. The best local model in our tests was Qwen3-VL 30B through LM Studio. Any current frontier vision model works through an API.

Write production-ready alt text for this image for a SaaS company's website.
Rules:
- ONE line, MAXIMUM 125 characters.
- Describe only what is actually visible. Do NOT guess locations, settings, people's names, or brands that are not clearly shown.
- For a photo of a person, describe them plainly by apparent gender and clearly visible features, e.g. 'Woman with long dark hair in a white turtleneck, smiling'. Do not use vague labels like 'business professional' or 'person' when gender is clearly visible.
- For a product illustration, diagram, or UI mockup, describe what the graphic DEPICTS (the product, tool, or workflow). Treat any sample or placeholder text shown inside a mockup as example content, NOT the subject. Do not make a placeholder word the main topic.
- If readable product names or UI labels are clearly visible (e.g. Supabase, Notion, Webflow), you may include them.
- Plain, specific language. Avoid subjective adjectives like sleek, futuristic, cheerful, professional, or modern.
- Do NOT start with 'image of', 'photo of', 'screenshot of', or 'illustration of'.
- No quotation marks. Output only the alt text.

This exact prompt produced 0 errors and 0 over-limit results on a 20-image held-out set (section 7). Swap "a SaaS company's website" for your own context. The rest of this report is how we arrived at it.

1. Key findings

2. Setup and method

Every model runs on-device through LM Studio, which exposes an OpenAI-compatible endpoint at http://localhost:1234/v1. Each request sends one image, as a base64 data URL, plus a text prompt to /v1/chat/completions:

def ask(model, prompt, img_path, max_tokens=300, reasoning_effort=None):
    payload = {
        "model": model,
        "messages": [{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url",
                 "image_url": {"url": data_url(img_path)}},
            ],
        }],
        "temperature": 0.2,
        "max_tokens": max_tokens,
    }
    if reasoning_effort is not None:        # "none" disables thinking on Gemma 4 / Qwen3.6
        payload["reasoning_effort"] = reasoning_effort
    r = requests.post("http://localhost:1234/v1/chat/completions", json=payload)
    ...

Every image is normalized before sending. We decode it (WebP and AVIF included), convert to RGB, downscale the long edge, and re-encode as JPEG, so format and size are controlled, not left to whatever the page served:

def data_url(path, max_side=768):
    """Normalize any format (webp/avif/png) to a downscaled JPEG data URL."""
    im = Image.open(path)
    im.load()
    if im.mode != "RGB":
        im = im.convert("RGB")
    if max(im.size) > max_side:
        im.thumbnail((max_side, max_side))
    buf = io.BytesIO()
    im.save(buf, format="JPEG", quality=85)
    return "data:image/jpeg;base64," + base64.b64encode(buf.getvalue()).decode()

The image set

The main study uses 26 images. Twelve are ours, scraped from our own marketing site (whalesync.com): headshots, clean UI screenshots, stylized product illustrations, a cartoon mascot, and busy marketing composites. Fourteen are openly licensed (CC0, public domain, or CC-BY), pulled from Openverse across deliberately different scenarios: portrait photos, landscapes, food, charts, flow diagrams, scanned documents, and artwork. Mixing our own product imagery with a generic open set keeps the findings from overfitting to one site's house style. Full attribution for every open image, including the 20 used later for validation, is in section 10.

A few words on how to read the tables:

3. Variable one: the model

Same production prompt, same 26 images, across six local vision models spanning four families (Mistral, Google Gemma, Qwen, and Zhipu GLM), with two entries from the Qwen line, and ranging from 3B to 30B. Reasoning models ran with thinking disabled through their respective switch. Alongside quality and speed we report token usage, because for anyone weighing a hosted API, the input-token count, which includes the image, is the real cost driver.

Model Avg chars Median latency Max latency* Input tok Output tok Est $/1k imgs† Over 125c Errors
Ministral 3B 85 4.8 s 7.3 s 1562 27 $0.33 0 0
Gemma 4 12B 108 5.0 s 9.6 s 489 24 $0.11 3 0
Gemma 4 26B 102 6.1 s 22.0 s 489 21 $0.11 0 0
Qwen3-VL 30B (MoE) 99 6.5 s 15.4 s 270 27 $0.07 2 0
Qwen3.6 27B (dense) 120 27.1 s 36.0 s 282 24 $0.07 12 0
GLM-4.6V flash 105 14.0 s 16.7 s 268 27 $0.07 5 0

*Max latency is roughly the cold model-load on the first image. †Illustrative only: input and output tokens priced at $0.20 and $0.60 per million, a generic open-vision-model API rate. These models ran locally at no marginal cost. Use the column to compare models, not as a real invoice.

What we saw, model by model:

Three images that drove those notes:

Product illustration: a website mockup carrying sample 'Automotive' marketing copy beside a CMS dashboard grid

The CMS illustration. The mockup's sample copy mentions the automotive industry. Ministral 3B made that the subject ("a content management system interface for automotive SEO optimization"). Qwen3-VL and both Gemmas read it as on-screen sample text and described the actual product, a CMS dashboard. This single image is the case for the placeholder rule in section 4.

Oil portrait of a man in a grey military uniform and cap, seated with his hands folded

The oil portrait. Both Gemmas and Qwen described the uniform, the cap, and the pose. Ministral 3B returned "Artistic painting of abstract shapes and colors," a clean miss on a figurative portrait.

Network chart linking Stein Bagger and IT Factory to other companies and people with red lines

The network chart. Every model above the 3B read the proper nouns ("Stein Bagger," "IT Factory") straight off a busy news graphic. Reading printed names in a chart is a solved problem for these models. Inventing a subject from a placeholder is not.

4. Variable two: the prompt

Same model (Ministral 3B, the weakest reader, which makes the prompt's effect easy to see) and the same 12 owned images, across three prompt styles. Prompt choice swings output length far more than model choice does.

Prompt style What it asks for Avg chars
detailed Free-form: one descriptive sentence, no length cap. 151
web_125 Web cap: a single line under 125 chars, name prominent UI or text. 112
short_12w Terse: 12 words max, one line. 51

The free-form prompt averages well past the 125-character target. An explicit cap pulls it into range. Three representative images, all three styles:

Woman with long dark hair in a white turtleneck, smiling, against a soft geometric background
Prompt style Output Chars
detailed A professional portrait of an engaged individual smiling warmly against a modern, abstract background featuring soft geometric shapes and light tones. 150
web_125 Professional woman smiling against modern office background with abstract geometric lighting. 93
short_12w Woman smiling indoors wearing light top 39
Illustration of a Supabase table editor syncing to a Notion database, with arrows showing data flow
Prompt style Output Chars
detailed A sleek database interface displaying real-time data synchronization between Supabase and Notion, showcasing seamless integration capabilities. 143
web_125 Real-time database integration displays Supabase table editor syncing with Notion's contact list. 97
short_12w Database table syncing in Notion via Supabase real-time integration 67
Product UI listing several syncs with status labels such as Syncing, Active, and Disabled
Prompt style Output Chars
detailed A dashboard showing active and syncing integrations for database management, marketing site optimization, lead enrichment, reporting tools, and internal intranet within a software system. 187
web_125 SaaS sync dashboard showing active database integrations with status indicators for marketing site SEO, lead enrichment, conference reporting, and intranet systems 163
short_12w Database sync status dashboard showing active and paused integrations 69

That last row is the catch: even the capped web_125 prompt overshoots on a dense dashboard (163 characters). Length instructions move the average but do not guarantee the ceiling. That is why the production recipe adds a deterministic second pass.

The production prompt does two more things beyond length, and both showed up as quality wins:

The production prompt is the one in the TL;DR. A small cleanup pass strips the boilerplate models like to add ("Alt text:", wrapping quotes, stray special tokens):

def clean_alt(text):
    """Strip the junk models add around the actual alt text."""
    t = text.strip()
    t = re.sub(r'<\|[^|]*\|>', '', t)                          # GLM's <|begin_of_box|> wrappers
    t = re.sub(r'^\s*alt[\s_-]*text\s*[:\-–]\s*', '', t, flags=re.I)  # leading "Alt text:"
    t = t.strip().strip('"“”‘’\'').strip()                      # wrapping quotes
    return re.sub(r'\s+', ' ', t)                               # collapse whitespace

And because no model reliably obeys the character cap, a one-line text-only retry rewrites anything that spills over. It is generous with tokens so a reasoning model can think and still answer, and it keeps the original if the rewrite comes back empty:

def shrink(text, model):
    prompt = (f"Rewrite this website alt text to be UNDER 125 characters while keeping "
              "the most important information. One line, plain language, no quotes. "
              f"Output only the rewritten alt text:\n\n{text}")
    payload = {"model": model, "temperature": 0.2, "max_tokens": 512,
               "messages": [{"role": "user", "content": prompt}]}
    cand = clean_alt(post(payload))
    return cand if cand and not cand.startswith("ERROR:") else text

5. Variable three: image resolution

Fixed model (Qwen3-VL 30B) and prompt. The same image goes to the model at six max-side resolutions. We chose text-heavy UI images on purpose, because reading the labels is the hard part. The encoder is the same encode_at thumbnail step from section 2, just swept across sizes:

Max side (px) Mean payload Median latency Mean chars
256 8 KB 3.5 s 137
384 16 KB 3.7 s 205
512 26 KB 3.5 s 139
768 48 KB 4.6 s 111
1024 75 KB 5.9 s 154
1536 138 KB 11.9 s 148

Median latency is shown because the first request of the run (256 px, first image) included a one-time model load of 31.5 s and would otherwise distort the smallest bucket. Mean character count is not monotonic, which is the point: more pixels did not buy a better description.

The result inverts the usual instinct. Three side-by-side demos, each sent at 256 px and at 640 px:

Supabase-to-Notion sync illustration sent to the model at 256 pixels
sent at 256 px
The same Supabase-to-Notion sync illustration sent to the model at 640 pixels
sent at 640 px

At 256 px Qwen3-VL still named "Supabase," "Notion," and "Webflow." At higher resolution it traded one of those for slightly different phrasing, not for more accuracy.

App Database and User Management panel illustration sent to the model at 256 pixels
sent at 256 px
The same App Database and User Management panel illustration sent to the model at 640 pixels
sent at 640 px

The "App Database" and "User Management" panel labels read fine at 256 px. The fine print inside the panels (column names, a highlighted row) only appears at higher resolution.

Sessions and Speakers to Conference Planner illustration sent to the model at 256 pixels
sent at 256 px
The same Sessions and Speakers to Conference Planner illustration sent to the model at 640 pixels
sent at 640 px

"Sessions & Speakers" and "Conference Planner" read at 256 px. At 1536 px the model added that the source was a Notion database and the target a Google Sheet, finer detail, but the core description was already right at the smallest size.

What this means in practice:

6. The recipe we would ship

Putting the three variables together, the configuration we would run for this kind of image mix:

To reproduce every number in this report:

python3 download_images.py https://whalesync.com --max 12   # our 12 owned images
python3 fetch_openverse.py                                  # 14 openly-licensed images
python3 compare_models.py --models mistralai/ministral-3-3b google/gemma-4-12b-qat \
        google/gemma-4-26b-a4b qwen/qwen3-vl-30b --reasoning-effort none --max-tokens 300
#   ...add another model later as a column:  compare_models.py --merge --models <id>
python3 alt_prompts.py            # prompt sweep (Ministral)
python3 size_experiment.py        # resolution sweep (Qwen)
python3 validate.py               # held-out repeatability run

7. Does it reproduce? A held-out run

A recipe that only works on the images you tuned it against is not a recipe. So we ran the winning configuration once, unchanged, Qwen3-VL 30B plus the tightened 125-character prompt plus 768 px plus the shorten pass, over 20 fresh openly-licensed images that were never used while developing it. The new set deliberately covers new categories: city, nature, sport, workspace, pie charts, schematics, maps, letters, still life, portrait, and vehicle.

Run Images Avg chars Input tok Output tok Est $/1k Over 125c Errors
Study set, Qwen3-VL 30B 26 99 270 27 $0.07 2 0
Held-out validation 20 96 270 23 $0.07 0 0

The recipe reproduces. Twenty images, 0 errors, 0 over the 125-character limit, a flat 270 input tokens per image (identical to the study run), and about 4.8 seconds per image at steady state. Output length and cost match the study set within noise.

Every line of output from that run:

Image (category) Alt text (Qwen3-VL 30B at 768px) Chars
city White Art Deco building with decorative facade and fire escapes against a blue sky. 83
city Night scene of a city street with illuminated buildings, a red bus, and people walking. 87
letter Handwritten text in brown ink on a white background, with a small green and blue butterfly emblem in the corner. 112
letter Handwritten letter on aged paper with a date of Sept. 8, 1944, in the upper right corner. 89
map Vintage map collage with double-hemisphere projection, aged paper, and decorative elements. 91
map Circular diagram showing planets and zodiac signs in orbits around the sun, labeled in Latin. 93
nature A forest with tall trees and a branch holding a few orange leaves, set against a misty blue background. 103
nature A black and white photo of a winding path through a forest with a large, gnarled tree branch arching overhead. 110
pie chart Infographic showing statistics on food waste, including global and U.S. data, causes, costs, and environmental impact. 118
pie chart Infographic showing drug production and usage statistics in the Americas with circular charts and a map. 104
portrait Woman with dark hair lying on her back, arms raised above her head, hands open, in a black and white photograph. 112
schematic Diagram of a control panel with labeled sections, knobs, switches, and handwritten notes. 89
schematic Hand-drawn diagram of a bicycle dynamo AC to 5V USB adapter with bridge rectifier, voltage regulator, and USB pins. 115
sport Empty green field with red seats and two tall light towers under a bright sky. 78
sport Woman with dark hair in a ponytail wearing a purple jacket and visor, holding a black jacket and a tennis racket. 113
still life Still life painting of fruit, a glass of red liquid, and a knife on a table. 76
still life Still life painting with grapes, a lobster, fruit, and a glass on a dark table. 79
vehicle Two diagrams of a vintage car with French labels identifying parts like the chassis, fender, and running board. 111
workspace Two people at a desk with laptops and papers, one writing with a red pen. 73
workspace Top-down view of a wooden desk with a laptop, tablet, coffee cup, glasses, and a small plant. 93

The interesting failures were in the data, not the model. The keyword a query returned did not always match the picture, and Qwen described what was in the picture instead of forcing the expected label:

Empty green sports field with red seats and two floodlight towers under a bright sky
query: "soccer match"
Circular Latin-labeled diagram of planets and zodiac signs orbiting the sun
query: "antique map"

A "soccer match" query returned an empty stadium, and Qwen wrote "Empty green field with red seats and two tall light towers," not "a soccer match." An "antique map" query returned a zodiac chart, and Qwen wrote "Circular diagram showing planets and zodiac signs," not "a map." It described the schematic and the vintage-car diagram below by their actual labels too. The model did not hallucinate to match the keyword, which is exactly the behavior you want.

Hand-drawn schematic of a bicycle dynamo AC-to-USB adapter with a bridge rectifier and voltage regulator
read the components, not just "a diagram"
Two labeled diagrams of a vintage car with French part names
read the French labels

The only real friction was sourcing. One Openverse query ("man headshot studio") returned no openly-licensed results and the API rate-limited intermittently, so the first fetch yielded 18 of 20. Two supplemental queries topped it back up to 20. None of that touched the model output.

8. Limitations

9. If you are doing this across a whole catalog

This is the part where the research connects to what we build, labeled so you can skip it.

Running the recipe once is a script. Running it across every product image in a Shopify or Webflow catalog, every illustration in a CMS, every photo in a media library, and then reviewing each line before it ships, is the part that takes weeks by hand or gets handed to a plugin you cannot inspect. That review step is not optional: the tables above show even the best model overshoots the cap sometimes and occasionally calls a website a mobile app, so something has to catch those before they go live.

That is the job Scratch does. It pulls every record out of your platform into local files, your model writes the alt text there (not just for the first 200 rows), and you approve every change as a diff before anything publishes. The same shape applies whether the field is alt text, a meta description, or a product title. If you want the step-by-step version for one platform, we wrote up how to add alt text to every WordPress image with AI, and the Shopify SEO teardown covers where image alts sit in a larger on-page picture. This report is the kind of work we run on our own content.

10. Image sources and licenses

The 12 product images are ours (whalesync.com) and used with permission. The images below are openly licensed (CC0, public domain, or CC-BY) and were retrieved through Openverse: the 14 used in the study plus the 20 used for held-out validation, credited here so this report can be shared freely.

Category Title / creator License Source
photo-person Portrait photograph PDM 1.0 flickr
photo-person Pearls, Pearls, Pearls (cindy richardson) BY 2.0 flickr
photo-scene Mountain lake with gate and figures CC0 1.0 rawpixel
photo-scene Mountain lake scene (Ferdinand Katona) CC0 1.0 rawpixel
photo-food Tofu croquettes on salad (Lablascovegmenu) BY 2.0 flickr
photo-food Lunch at DCPS (DC Central Kitchen) BY 2.0 flickr
chart Net closing in on missing Stein Bagger millions (Podknox) BY 2.0 flickr
chart Financial statistics and accounting concept (paola.bazurto4) BY 2.0 flickr
diagram The Self-Repair Manifesto, iFixit (dullhunk) BY 2.0 flickr
diagram Using Social Media for Professional Learning (cambodia4kidsorg) BY 2.0 flickr
document Diary (Barnaby) BY 2.0 flickr
document Art Journal Template (Calsidyrose) BY 2.0 flickr
artwork Portrait of a uniformed man (Jacques-Émile Blanche) CC0 1.0 rawpixel
artwork Postinjohtaja Chr. Völschow, 1684 CC0 1.0 rawpixel
city Eddy Street (dalecruse) BY 2.0 flickr
city Big City Life, Berlin (matthias.ripp) BY 2.0 flickr
nature Leafs in the winter fog (Gael Varoquaux) BY 2.0 flickr
nature Over the Trail (Corey Leopold) BY 2.0 flickr
sport Sydney Showground (edwin.11) BY 2.0 flickr
sport Sania Mirza (Andrew Campbell Photography) BY 2.0 flickr
workspace Writing Papers (Helloquence) CC0 1.0 stocksnap
workspace Macbook Laptop (Lia Leslie) CC0 1.0 stocksnap
pie-chart The Big Food Wasters (GDS Infographics) BY 2.0 flickr
pie-chart Illegal Drugs in the Americas (GDS Infographics) BY 2.0 flickr
schematic Control Display from Apollo 13 (jurvetson) BY 2.0 flickr
schematic Bicycle dynamo USB adapter (aegidian) BY 2.0 flickr
map Assorted Maps brush set (webtreats) BY 2.0 flickr
map Planisphaerium Ptolemaicum (Leventhal Map Center, BPL) BY 2.0 flickr
letter Write more Letters (Markus Reinhardt) BY 2.0 flickr
letter letter (Lizbeth King) BY 2.0 flickr
still-life Still Life with Fruit and Sweetmeats, 1635 (Georg Flegel) CC0 1.0 rawpixel
still-life Still Life with Fruit and Lobster (Pieter de Ring) CC0 1.0 rawpixel
portrait Art of gymnastics (lauragrb) BY 2.0 flickr
vehicle Nomenclature automobile en 1927 (claude-22) PDM 1.0 flickr

11. Appendix: the full pipeline source

Every script that produced the numbers above, complete and runnable. They share a small request layer (gen_alt_text.py) and talk only to a local LM Studio endpoint. Click to expand.

download_images.py · scrape and download images from a web page
#!/usr/bin/env python3
"""Download images from a web page (or pages) into ./images.

Usage:
    python3 download_images.py                       # defaults to whalesync.com
    python3 download_images.py https://example.com   # one or more URLs
    python3 download_images.py URL --max 15          # cap number of images
"""
import argparse
import hashlib
import os
import re
import sys
from urllib.parse import urljoin, urlparse

import requests
from bs4 import BeautifulSoup

UA = (
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 "
    "(KHTML, like Gecko) Chrome/124.0 Safari/537.36"
)
OUT_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "images")
# Skip vector/icon formats that raster vision models can't ingest well.
SKIP_EXT = {".svg", ".ico", ".gif"}


def largest_from_srcset(srcset: str) -> str | None:
    """Pick the highest-resolution candidate from a srcset attribute."""
    best_url, best_w = None, -1
    for part in srcset.split(","):
        bits = part.strip().split()
        if not bits:
            continue
        url = bits[0]
        w = 0
        if len(bits) > 1 and bits[1].endswith("w"):
            try:
                w = int(bits[1][:-1])
            except ValueError:
                w = 0
        if w >= best_w:
            best_url, best_w = url, w
    return best_url


def collect_image_urls(page_url: str) -> list[str]:
    r = requests.get(page_url, headers={"User-Agent": UA}, timeout=30)
    r.raise_for_status()
    soup = BeautifulSoup(r.text, "html.parser")
    urls: list[str] = []

    for img in soup.find_all("img"):
        cand = None
        if img.get("srcset"):
            cand = largest_from_srcset(img["srcset"])
        if not cand:
            cand = img.get("src") or img.get("data-src")
        if cand:
            urls.append(urljoin(page_url, cand))

    # <picture><source srcset=...></picture>
    for source in soup.find_all("source"):
        if source.get("srcset"):
            cand = largest_from_srcset(source["srcset"])
            if cand:
                urls.append(urljoin(page_url, cand))

    # og:image
    og = soup.find("meta", property="og:image")
    if og and og.get("content"):
        urls.append(urljoin(page_url, og["content"]))

    return urls


def filename_for(url: str, content_type: str) -> str:
    path = urlparse(url).path
    base = os.path.basename(path) or "image"
    base = re.sub(r"[^A-Za-z0-9._-]", "_", base)
    root, ext = os.path.splitext(base)
    if not ext:
        ext = {
            "image/jpeg": ".jpg",
            "image/png": ".png",
            "image/webp": ".webp",
        }.get(content_type.split(";")[0].strip(), ".jpg")
        base = root + ext
    # Prefix with a short URL hash to avoid collisions across pages.
    h = hashlib.sha1(url.encode()).hexdigest()[:8]
    return f"{h}_{base}"


def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("urls", nargs="*", default=["https://whalesync.com"],
                    help="page URL(s) to scrape (default: https://whalesync.com)")
    ap.add_argument("--max", type=int, default=20, help="max images to download")
    ap.add_argument("--min-bytes", type=int, default=5000,
                    help="skip images smaller than this (filters icons/spacers)")
    args = ap.parse_args()
    urls = args.urls or ["https://whalesync.com"]

    os.makedirs(OUT_DIR, exist_ok=True)

    # Gather + dedupe candidate image URLs across all pages.
    seen_urls, candidates = set(), []
    for page in urls:
        try:
            found = collect_image_urls(page)
        except Exception as e:
            print(f"!! failed to scrape {page}: {e}", file=sys.stderr)
            continue
        print(f"   {page}: found {len(found)} <img>/source candidates")
        for u in found:
            if u in seen_urls:
                continue
            ext = os.path.splitext(urlparse(u).path)[1].lower()
            if ext in SKIP_EXT:
                continue
            seen_urls.add(u)
            candidates.append(u)

    print(f"\n{len(candidates)} unique candidate images; downloading up to {args.max}\n")

    saved, seen_hashes = 0, set()
    for u in candidates:
        if saved >= args.max:
            break
        try:
            resp = requests.get(u, headers={"User-Agent": UA}, timeout=30)
            resp.raise_for_status()
            data = resp.content
        except Exception as e:
            print(f"  skip (error) {u}: {e}")
            continue
        if len(data) < args.min_bytes:
            print(f"  skip (too small {len(data)}b) {u}")
            continue
        digest = hashlib.sha1(data).hexdigest()
        if digest in seen_hashes:
            print(f"  skip (duplicate bytes) {u}")
            continue
        seen_hashes.add(digest)
        ct = resp.headers.get("Content-Type", "")
        if "image" not in ct and not os.path.splitext(u)[1]:
            print(f"  skip (not image, {ct}) {u}")
            continue
        fname = filename_for(u, ct)
        with open(os.path.join(OUT_DIR, fname), "wb") as f:
            f.write(data)
        saved += 1
        print(f"  [{saved}] {fname}  ({len(data)//1024} KB)")

    print(f"\nDone. {saved} images in {OUT_DIR}")


if __name__ == "__main__":
    main()
fetch_openverse.py · pull openly-licensed images from Openverse
#!/usr/bin/env python3
"""Fetch openly-licensed images (Openverse API) with license + attribution.
CC0 / public-domain / CC-BY only, recorded so the shareable report can credit them.

    python3 fetch_openverse.py                 # main set -> ./images
    python3 fetch_openverse.py --set val       # held-out validation set -> ./images_val
"""
import argparse
import hashlib
import io
import json
import os

import requests
from PIL import Image

HERE = os.path.dirname(os.path.abspath(__file__))
API = "https://api.openverse.org/v1/images/"
UA = "alt-text-research/1.0 (local eval)"

# (slug, query, category-or-None)
CATEGORIES_MAIN = [
    ("photo-person", "portrait person face", "photograph"),
    ("photo-scene", "mountain landscape lake", "photograph"),
    ("photo-animal", "dog puppy outdoors", "photograph"),
    ("photo-food", "plate of food meal", "photograph"),
    ("chart", "bar chart graph statistics", None),
    ("diagram", "flowchart process diagram", None),
    ("document", "scanned document page text", None),
    ("artwork", "oil painting portrait", "digitized_artwork"),
]
# Deliberately different queries so the validation set is genuinely held-out.
CATEGORIES_VAL = [
    ("portrait", "man headshot studio", "photograph"),
    ("city", "city street architecture", "photograph"),
    ("nature", "forest trees path", "photograph"),
    ("sport", "soccer match action", "photograph"),
    ("workspace", "laptop desk office", "photograph"),
    ("pie-chart", "pie chart infographic", None),
    ("schematic", "circuit diagram schematic", None),
    ("map", "antique map illustration", None),
    ("letter", "handwritten letter vintage", None),
    ("still-life", "still life painting fruit", "digitized_artwork"),
]
MIN_BYTES, MAX_BYTES = 15_000, 8_000_000


def search(query, category):
    params = {"q": query, "license": "cc0,pdm,by", "page_size": 12, "mature": "false"}
    if category:
        params["category"] = category
    r = requests.get(API, params=params, headers={"User-Agent": UA}, timeout=30)
    r.raise_for_status()
    return r.json().get("results", [])


def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--set", choices=["main", "val"], default="main")
    ap.add_argument("--per", type=int, default=2, help="images per category")
    args = ap.parse_args()

    cats = CATEGORIES_MAIN if args.set == "main" else CATEGORIES_VAL
    out_dir = os.path.join(HERE, "images" if args.set == "main" else "images_val")
    meta_path = os.path.join(HERE, "images_meta.json" if args.set == "main"
                             else "images_val_meta.json")
    os.makedirs(out_dir, exist_ok=True)
    meta = json.load(open(meta_path)) if os.path.exists(meta_path) else {}

    # Dedupe against anything already downloaded in either set.
    seen_hashes = set()
    for d in ("images", "images_val"):
        dp = os.path.join(HERE, d)
        if os.path.isdir(dp):
            for fn in os.listdir(dp):
                try:
                    seen_hashes.add(hashlib.sha1(open(os.path.join(dp, fn), "rb").read()).hexdigest())
                except Exception:
                    pass

    for slug, query, category in cats:
        print(f"=== {slug}: '{query}' ===")
        got = 0
        try:
            results = search(query, category)
        except Exception as e:
            print(f"  !! search failed: {e}")
            continue
        for item in results:
            if got >= args.per:
                break
            url = item.get("url")
            if not url:
                continue
            try:
                resp = requests.get(url, headers={"User-Agent": UA}, timeout=30)
                resp.raise_for_status()
                data = resp.content
            except Exception as e:
                print(f"  skip (dl error) {url[:55]}: {e}")
                continue
            if not (MIN_BYTES <= len(data) <= MAX_BYTES):
                continue
            digest = hashlib.sha1(data).hexdigest()
            if digest in seen_hashes:
                continue
            try:
                im = Image.open(io.BytesIO(data)); im.load()
                if min(im.size) < 200:
                    continue
                ext = (im.format or "JPEG").lower().replace("jpeg", "jpg")
            except Exception:
                continue
            seen_hashes.add(digest)
            fname = f"ov_{slug}_{got+1}_{digest[:6]}.{ext}"
            with open(os.path.join(out_dir, fname), "wb") as f:
                f.write(data)
            meta[fname] = {
                "category": slug, "title": item.get("title"), "creator": item.get("creator"),
                "license": (item.get("license") or "").upper(),
                "license_version": item.get("license_version"),
                "source": item.get("source"), "landing": item.get("foreign_landing_url"),
            }
            got += 1
            print(f"  [{got}] {fname}  ({len(data)//1024} KB)  {meta[fname]['license']}  {im.size}")
        if got == 0:
            print("  (no usable results)")

    with open(meta_path, "w") as f:
        json.dump(meta, f, indent=2)
    n = len([k for k in meta])
    print(f"\nDone. {n} images recorded in {os.path.basename(meta_path)} (set={args.set})")


if __name__ == "__main__":
    main()
gen_alt_text.py · the request layer: send an image to a local model and parse the reply
#!/usr/bin/env python3
"""Generate alt text for every image in ./images using local LM Studio models.

LM Studio exposes an OpenAI-compatible API at http://localhost:1234/v1.

Usage:
    python3 gen_alt_text.py --probe          # test which models accept images
    python3 gen_alt_text.py                   # full run over all images/models
    python3 gen_alt_text.py --models qwen/qwen3-vl-30b google/gemma-4-12b-qat
    python3 gen_alt_text.py --prompt "Write alt text in <= 12 words."
"""
import argparse
import base64
import glob
import json
import os
import subprocess
import time

import io

import requests
from PIL import Image

HERE = os.path.dirname(os.path.abspath(__file__))
IMG_DIR = os.path.join(HERE, "images")
API = "http://localhost:1234/v1/chat/completions"

# Candidate vision models already downloaded in LM Studio, ordered fastest
# -> slowest so the report fills in quickly (the reasoning model is last).
DEFAULT_MODELS = [
    "mistralai/ministral-3-3b",
    "google/gemma-4-12b-qat",
    "qwen/qwen3-vl-30b",
    "google/gemma-4-26b-a4b",
    "mistralai/ministral-3-14b-reasoning",
]

DEFAULT_PROMPT = (
    "Write concise, descriptive alt text for this image to be used on a website. "
    "Describe the key visual content in one sentence. Be specific about what is "
    "shown. Do not begin with 'image of', 'picture of', or 'a screenshot of'. "
    "Output only the alt text, nothing else."
)

MAX_SIDE = 1024  # downscale longest edge before sending (speed + sanity)


def data_url(path: str) -> str:
    """Normalize any format (webp/avif/png/...) to a downscaled JPEG data URL."""
    im = Image.open(path)
    im.load()
    if im.mode != "RGB":
        im = im.convert("RGB")
    if max(im.size) > MAX_SIDE:
        im.thumbnail((MAX_SIDE, MAX_SIDE))
    buf = io.BytesIO()
    im.save(buf, format="JPEG", quality=85)
    b64 = base64.b64encode(buf.getvalue()).decode()
    return f"data:image/jpeg;base64,{b64}"


def ask_full(model: str, prompt: str, img_path: str, timeout: int,
             max_tokens: int = 1024, reasoning_effort: str | None = None,
             base_url: str = API, headers: dict | None = None) -> tuple[str, float, dict]:
    """Return (alt_text_or_error, seconds, usage). Errors are prefixed 'ERROR:'.

    usage = {prompt, completion, total, reasoning, cost} (values may be None).
    `base_url`/`headers` make this provider-agnostic (LM Studio default; an
    OpenAI-compatible cloud endpoint like OpenRouter can be passed in).

    Reasoning models (Gemma-4, ministral-*-reasoning) emit chain-of-thought into
    a separate reasoning_content field; the real answer is in content. A generous
    max_tokens lets them finish thinking AND answer. Alternatively pass
    reasoning_effort='none' to disable thinking entirely (much faster, no truncation).
    """
    payload = {
        "model": model,
        "messages": [{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {"url": data_url(img_path)}},
            ],
        }],
        "temperature": 0.2,
        "max_tokens": max_tokens,
    }
    if reasoning_effort is not None:
        payload["reasoning_effort"] = reasoning_effort
    t0 = time.time()
    try:
        r = requests.post(base_url, json=payload, timeout=timeout, headers=headers)
        dt = time.time() - t0
        if r.status_code != 200:
            return f"ERROR: HTTP {r.status_code}: {r.text[:200]}", dt, {}
        j = r.json()
        ch = j["choices"][0]
        u = j.get("usage", {}) or {}
        det = u.get("completion_tokens_details", {}) or {}
        usage = {"prompt": u.get("prompt_tokens"), "completion": u.get("completion_tokens"),
                 "total": u.get("total_tokens"), "reasoning": det.get("reasoning_tokens"),
                 "cost": u.get("cost")}  # cost present only on some providers (OpenRouter)
        text = (ch["message"].get("content") or "").strip()
        if not text:
            if ch.get("finish_reason") == "length":
                return "ERROR: truncated (ran out of tokens while reasoning)", dt, usage
            return "ERROR: empty response", dt, usage
        return text, dt, usage
    except Exception as e:
        return f"ERROR: {e}", time.time() - t0, {}


def ask(model: str, prompt: str, img_path: str, timeout: int,
        max_tokens: int = 1024, reasoning_effort: str | None = None) -> tuple[str, float]:
    """Backward-compatible wrapper that drops the usage dict."""
    text, dt, _ = ask_full(model, prompt, img_path, timeout, max_tokens, reasoning_effort)
    return text, dt


def unload_all():
    """Free VRAM/RAM before loading the next model (best effort)."""
    subprocess.run(["lms", "unload", "--all"], capture_output=True)


def probe(models, prompt, timeout):
    imgs = sorted(glob.glob(os.path.join(IMG_DIR, "*")))
    if not imgs:
        print("No images in ./images — run download_images.py first.")
        return
    test_img = imgs[0]
    print(f"Probing {len(models)} models with: {os.path.basename(test_img)}\n")
    for m in models:
        unload_all()
        out, dt = ask(m, prompt, test_img, timeout)
        ok = not out.startswith("ERROR:")
        tag = "VISION OK " if ok else "NO/ERROR  "
        preview = out.replace("\n", " ")[:90]
        print(f"  [{tag}] {m:42s} {dt:5.1f}s  {preview}")


def run(models, prompt, timeout):
    imgs = sorted(glob.glob(os.path.join(IMG_DIR, "*")))
    if not imgs:
        print("No images in ./images — run download_images.py first.")
        return
    print(f"{len(imgs)} images x {len(models)} models\n")

    results = {os.path.basename(p): {} for p in imgs}

    def flush():
        with open(os.path.join(HERE, "results.json"), "w") as f:
            json.dump({"prompt": prompt, "models": models, "results": results},
                      f, indent=2)
        build_report(models, prompt, results)

    for m in models:
        print(f"=== {m} ===")
        unload_all()
        for p in imgs:
            name = os.path.basename(p)
            out, dt = ask(m, prompt, p, timeout)
            results[name][m] = {"text": out, "seconds": round(dt, 1)}
            preview = out.replace("\n", " ")[:70]
            print(f"  {name:34s} {dt:5.1f}s  {preview}")
            flush()  # incremental: report.html updates after every image
        print()

    print("Wrote results.json and report.html")


def build_report(models, prompt, results):
    rows = []
    for name, by_model in results.items():
        cells = []
        for m in models:
            r = by_model.get(m, {})
            txt = (r.get("text") or "").replace("<", "&lt;").replace(">", "&gt;")
            secs = r.get("seconds", "")
            err = txt.startswith("ERROR:")
            cells.append(
                f'<td class="{ "err" if err else "" }">{txt}'
                f'<div class="meta">{secs}s</div></td>'
            )
        rows.append(
            f'<tr><td class="img"><img src="images/{name}"><div class="meta">{name}</div></td>'
            + "".join(cells) + "</tr>"
        )
    head = "".join(f"<th>{m}</th>" for m in models)
    html = f"""<!doctype html><meta charset="utf-8">
<title>Alt text comparison</title>
<style>
 body{{font:14px/1.5 -apple-system,sans-serif;margin:24px;color:#111}}
 .prompt{{background:#f4f4f5;padding:10px 14px;border-radius:8px;margin-bottom:16px}}
 table{{border-collapse:collapse;width:100%}}
 th,td{{border:1px solid #ddd;padding:10px;vertical-align:top;text-align:left}}
 th{{background:#fafafa;position:sticky;top:0}}
 td.img{{width:200px}} td.img img{{max-width:180px;max-height:140px;display:block}}
 td{{width:320px}}
 .meta{{color:#888;font-size:12px;margin-top:6px}}
 .err{{color:#b00;background:#fff5f5}}
</style>
<h1>Alt text comparison</h1>
<div class="prompt"><b>Prompt:</b> {prompt}</div>
<table><tr><th>Image</th>{head}</tr>{''.join(rows)}</table>
"""
    with open(os.path.join(HERE, "report.html"), "w") as f:
        f.write(html)


def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--models", nargs="+", default=DEFAULT_MODELS)
    ap.add_argument("--prompt", default=DEFAULT_PROMPT)
    ap.add_argument("--timeout", type=int, default=300)
    ap.add_argument("--probe", action="store_true",
                    help="just test which models accept images (1 image)")
    args = ap.parse_args()
    if args.probe:
        probe(args.models, args.prompt, args.timeout)
    else:
        run(args.models, args.prompt, args.timeout)


if __name__ == "__main__":
    main()
alt_prompts.py · sweep prompt styles on a single model
#!/usr/bin/env python3
"""Compare alt-text PROMPT styles on a single model (default: ministral-3-3b).

Runs each candidate prompt over every image in ./images and builds
report_prompts.html with one column per prompt, plus character counts
(good web alt text is generally <= 125 chars).

    python3 alt_prompts.py
    python3 alt_prompts.py --model qwen/qwen3-vl-30b
"""
import argparse
import glob
import json
import os
import re

from gen_alt_text import ask, unload_all, HERE, IMG_DIR

MODEL = "mistralai/ministral-3-3b"

# Candidate prompt styles to compare. Tweak freely.
PROMPTS = {
    "detailed": (
        "Write descriptive alt text for this image for a software company's "
        "website. One sentence describing the key visual content. Do not start "
        "with 'image of' or 'screenshot of'. Output only the alt text."
    ),
    "web_125": (
        "Write production-ready alt text for this image as it would appear on a "
        "SaaS marketing website. Requirements: a single line UNDER 125 characters; "
        "describe the most important visual content and any prominent UI or text; "
        "plain, specific language; do NOT start with 'image of', 'photo of', "
        "'screenshot of', or 'illustration of'; no quotation marks. "
        "Output only the alt text."
    ),
    "short_12w": (
        "Write very short alt text for this image: at most 12 words, one line, "
        "no ending punctuation, no quotes, do not start with 'image of' or "
        "'screenshot of'. Output only the alt text."
    ),
}


def clean_alt(text: str) -> str:
    """Strip the junk models add around the actual alt text."""
    if text.startswith("ERROR:"):
        return text
    t = text.strip()
    # Strip model-specific special tokens, e.g. GLM's <|begin_of_box|> wrappers.
    t = re.sub(r'<\|[^|]*\|>', '', t)
    # Drop a leading label like "Alt text:" / "Alt-text -" etc.
    t = re.sub(r'^\s*alt[\s_-]*text\s*[:\-–]\s*', '', t, flags=re.I)
    # Remove wrapping single/double/smart quotes.
    t = t.strip().strip('"“”‘’\'').strip()
    # Collapse internal whitespace/newlines to single spaces.
    t = re.sub(r'\s+', ' ', t)
    return t


def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--model", default=MODEL)
    ap.add_argument("--timeout", type=int, default=300)
    args = ap.parse_args()

    imgs = sorted(glob.glob(os.path.join(IMG_DIR, "*")))
    if not imgs:
        print("No images in ./images — run download_images.py first.")
        return
    print(f"Model: {args.model}\n{len(imgs)} images x {len(PROMPTS)} prompts\n")

    unload_all()  # load the one model once
    results = {os.path.basename(p): {} for p in imgs}

    def flush():
        with open(os.path.join(HERE, "results_prompts.json"), "w") as f:
            json.dump({"model": args.model, "prompts": PROMPTS,
                       "results": results}, f, indent=2)
        build_report(args.model, results)

    for key, prompt in PROMPTS.items():
        print(f"=== prompt: {key} ===")
        for p in imgs:
            name = os.path.basename(p)
            raw, dt = ask(args.model, prompt, p, args.timeout)
            txt = clean_alt(raw)
            results[name][key] = {"text": txt, "chars": len(txt),
                                  "seconds": round(dt, 1)}
            print(f"  {name[:28]:28s} {len(txt):3d}c {dt:4.1f}s  {txt[:60]}")
            flush()
        print()
    print("Wrote results_prompts.json and report_prompts.html")


def build_report(model, results):
    keys = list(PROMPTS.keys())
    rows = []
    for name, by in results.items():
        cells = []
        for k in keys:
            r = by.get(k, {})
            txt = (r.get("text") or "").replace("<", "&lt;").replace(">", "&gt;")
            chars = r.get("chars", "")
            over = isinstance(chars, int) and chars > 125
            err = txt.startswith("ERROR:")
            cls = "err" if err else ("over" if over else "")
            cells.append(
                f'<td class="{cls}">{txt}'
                f'<div class="meta">{chars} chars · {r.get("seconds","")}s</div></td>'
            )
        rows.append(
            f'<tr><td class="img"><img src="images/{name}">'
            f'<div class="meta">{name}</div></td>' + "".join(cells) + "</tr>"
        )
    head = "".join(
        f'<th>{k}<div class="meta">{PROMPTS[k][:60]}…</div></th>' for k in keys
    )
    html = f"""<!doctype html><meta charset="utf-8">
<title>Alt-text prompt comparison — {model}</title>
<style>
 body{{font:14px/1.5 -apple-system,sans-serif;margin:24px;color:#111}}
 h1 .meta{{font-weight:400}}
 table{{border-collapse:collapse;width:100%}}
 th,td{{border:1px solid #ddd;padding:10px;vertical-align:top;text-align:left}}
 th{{background:#fafafa;font-size:13px}}
 td.img{{width:200px}} td.img img{{max-width:180px;max-height:140px;display:block}}
 td{{width:300px}}
 .meta{{color:#888;font-size:12px;margin-top:6px;font-weight:400}}
 .over{{background:#fff8e6}} .err{{color:#b00;background:#fff5f5}}
</style>
<h1>Alt-text prompt comparison <span class="meta">— {model}</span></h1>
<p class="meta">Amber = over 125 characters.</p>
<table><tr><th>Image</th>{head}</tr>{''.join(rows)}</table>
"""
    with open(os.path.join(HERE, "report_prompts.html"), "w") as f:
        f.write(html)


if __name__ == "__main__":
    main()
make_alt_text.py · production run with boilerplate cleanup and the shorten retry
#!/usr/bin/env python3
"""Produce production-ready alt text for every image in ./images.

Model: ministral-3-3b (LM Studio). Balanced style, hard <=125 char cap with a
text-only 'shorten' retry for any that spill over.

Outputs: alt_text.json, alt_text.csv, report_final.html

    python3 make_alt_text.py
"""
import argparse
import csv
import glob
import json
import os

import requests

from gen_alt_text import ask, data_url, unload_all, HERE, IMG_DIR, API
from alt_prompts import clean_alt

MODEL = "mistralai/ministral-3-3b"
LIMIT = 125

PROMPT = (
    "Write production-ready alt text for this image for a SaaS company's website.\n"
    "Rules:\n"
    "- ONE line, MAXIMUM 125 characters.\n"
    "- Describe only what is actually visible. Do NOT guess locations, settings, "
    "people's names, or brands that are not clearly shown.\n"
    "- For a photo of a person, describe them plainly by apparent gender and clearly "
    "visible features, e.g. 'Woman with long dark hair in a white turtleneck, smiling'. "
    "Do not use vague labels like 'business professional' or 'person' when gender is "
    "clearly visible.\n"
    "- For a product illustration, diagram, or UI mockup, describe what the graphic "
    "DEPICTS (the product, tool, or workflow). Treat any sample or placeholder text "
    "shown inside a mockup as example content, NOT the subject — do not make a "
    "placeholder word the main topic.\n"
    "- If readable product names or UI labels are clearly visible (e.g. Supabase, "
    "Notion, Webflow), you may include them.\n"
    "- Plain, specific language. Avoid subjective adjectives like sleek, futuristic, "
    "cheerful, professional, or modern.\n"
    "- Do NOT start with 'image of', 'photo of', 'screenshot of', or 'illustration of'.\n"
    "- No quotation marks. Output only the alt text."
)


def shrink(text: str, timeout: int, model: str = MODEL) -> str:
    """Text-only pass: rewrite an over-long alt text to fit under the limit.

    max_tokens is generous so reasoning models can think and still answer; if the
    rewrite comes back empty/errored, we keep the original rather than lose it.
    """
    prompt = (
        f"Rewrite this website alt text to be UNDER {LIMIT} characters while keeping "
        "the most important information. One line, plain language, no quotes. "
        f"Output only the rewritten alt text:\n\n{text}"
    )
    payload = {"model": model, "temperature": 0.2, "max_tokens": 512,
               "messages": [{"role": "user", "content": prompt}]}
    try:
        r = requests.post(API, json=payload, timeout=timeout)
        if r.status_code == 200:
            cand = clean_alt(r.json()["choices"][0]["message"].get("content") or "")
            if cand and not cand.startswith("ERROR:"):
                return cand
    except Exception:
        pass
    return text


def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--timeout", type=int, default=300)
    args = ap.parse_args()

    imgs = sorted(glob.glob(os.path.join(IMG_DIR, "*")))
    if not imgs:
        print("No images in ./images — run download_images.py first.")
        return
    print(f"{MODEL}: writing alt text for {len(imgs)} images (<= {LIMIT} chars)\n")
    unload_all()

    out = {}
    for p in imgs:
        name = os.path.basename(p)
        raw, dt = ask(MODEL, PROMPT, p, args.timeout)
        alt = clean_alt(raw)
        tries = 0
        while len(alt) > LIMIT and not alt.startswith("ERROR:") and tries < 2:
            alt = shrink(alt, args.timeout)
            tries += 1
        flag = "" if len(alt) <= LIMIT else " (still over)"
        out[name] = {"alt": alt, "chars": len(alt)}
        print(f"  {name[:30]:30s} {len(alt):3d}c {dt:4.1f}s{flag}  {alt}")

    # JSON
    with open(os.path.join(HERE, "alt_text.json"), "w") as f:
        json.dump(out, f, indent=2)
    # CSV
    with open(os.path.join(HERE, "alt_text.csv"), "w", newline="") as f:
        w = csv.writer(f)
        w.writerow(["filename", "alt_text", "chars"])
        for name, r in out.items():
            w.writerow([name, r["alt"], r["chars"]])
    # HTML
    rows = []
    for name, r in out.items():
        alt = r["alt"].replace("<", "&lt;").replace(">", "&gt;")
        over = r["chars"] > LIMIT
        rows.append(
            f'<tr><td class="img"><img src="images/{name}">'
            f'<div class="meta">{name}</div></td>'
            f'<td class="{ "over" if over else "" }">{alt}'
            f'<div class="meta">{r["chars"]} chars</div></td></tr>'
        )
    html = f"""<!doctype html><meta charset="utf-8">
<title>Final alt text — {MODEL}</title>
<style>
 body{{font:15px/1.55 -apple-system,sans-serif;margin:24px;color:#111}}
 table{{border-collapse:collapse;width:100%;max-width:1000px}}
 td{{border:1px solid #ddd;padding:12px;vertical-align:top}}
 td.img{{width:220px}} td.img img{{max-width:200px;max-height:150px;display:block}}
 .meta{{color:#888;font-size:12px;margin-top:6px}}
 .over{{background:#fff8e6}}
</style>
<h1>Final alt text <span class="meta">— {MODEL}, &le;{LIMIT} chars</span></h1>
<table>{''.join(rows)}</table>
"""
    with open(os.path.join(HERE, "report_final.html"), "w") as f:
        f.write(html)
    over = sum(1 for r in out.values() if r["chars"] > LIMIT)
    print(f"\nWrote alt_text.json, alt_text.csv, report_final.html "
          f"({len(out)} images, {over} over {LIMIT} chars)")


if __name__ == "__main__":
    main()
compare_models.py · the model-comparison harness (the main study)
#!/usr/bin/env python3
"""Compare models head-to-head on the SAME production alt-text prompt.

Same PROMPT, clean-up, and <=125 char cap (with shorten retry) as make_alt_text,
so differences are purely the model. Builds a side-by-side report.

    python3 compare_models.py                      # ministral vs gemma-4-12b
    python3 compare_models.py --models mistralai/ministral-3-3b google/gemma-4-26b-a4b qwen/qwen3-vl-30b
"""
import argparse
import glob
import json
import os

from gen_alt_text import ask_full, unload_all, HERE, IMG_DIR
from alt_prompts import clean_alt
from make_alt_text import PROMPT, LIMIT, shrink

DEFAULT_MODELS = ["mistralai/ministral-3-3b", "google/gemma-4-12b-qat"]


def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--models", nargs="+", default=DEFAULT_MODELS)
    ap.add_argument("--timeout", type=int, default=400)
    ap.add_argument("--max-tokens", type=int, default=1024,
                    help="reasoning models (Gemma-4) need ~4000 on complex images")
    ap.add_argument("--reasoning-effort", default=None,
                    help="set to 'none' to disable thinking on reasoning models (fast)")
    ap.add_argument("--prompt-suffix", default="",
                    help="appended to the prompt, e.g. ' /nothink' to disable GLM thinking")
    ap.add_argument("--only", nargs="+", default=None,
                    help="only images whose filename contains one of these substrings")
    ap.add_argument("--merge", action="store_true",
                    help="add these models as columns to existing results_compare.json "
                         "instead of overwriting (lets you add one model later)")
    args = ap.parse_args()

    imgs = sorted(glob.glob(os.path.join(IMG_DIR, "*")))
    if args.only:
        imgs = [p for p in imgs
                if any(s in os.path.basename(p) for s in args.only)]
    if not imgs:
        print("No images in ./images — run download_images.py first.")
        return
    print(f"{len(imgs)} images x {len(args.models)} models (<= {LIMIT} chars)\n")

    path = os.path.join(HERE, "results_compare.json")
    existing = json.load(open(path)) if (args.merge and os.path.exists(path)) else {}
    ex_res, ex_models = existing.get("results", {}), existing.get("models", [])
    # Report shows existing columns first, then any newly-added models.
    report_models = ex_models + [m for m in args.models if m not in ex_models]
    results = {os.path.basename(p): dict(ex_res.get(os.path.basename(p), {}))
               for p in imgs}

    def flush():
        with open(path, "w") as f:
            json.dump({"prompt": PROMPT, "models": report_models,
                       "results": results}, f, indent=2)
        build_report(report_models, results)

    for m in args.models:
        print(f"=== {m} ===")
        unload_all()
        for p in imgs:
            name = os.path.basename(p)
            raw, dt, usage = ask_full(m, PROMPT + args.prompt_suffix, p, args.timeout,
                                      max_tokens=args.max_tokens,
                                      reasoning_effort=args.reasoning_effort)
            alt = clean_alt(raw)
            tries = 0
            while len(alt) > LIMIT and not alt.startswith("ERROR:") and tries < 2:
                alt = shrink(alt, args.timeout, model=m)
                tries += 1
            results[name][m] = {"text": alt, "chars": len(alt), "seconds": round(dt, 1),
                                "ptok": usage.get("prompt"), "ctok": usage.get("completion"),
                                "rtok": usage.get("reasoning")}
            pt, ctk = usage.get("prompt"), usage.get("completion")
            print(f"  {name[:24]:24s} {len(alt):3d}c {dt:5.1f}s {str(pt):>4}+{str(ctk):<4}tok  {alt[:42]}")
            flush()
        print()
    print("Wrote results_compare.json and report_compare.html")


def build_report(models, results):
    # Per-model summary (averages over completed cells).
    summary_cells = []
    for m in models:
        cells = [by[m] for by in results.values() if m in by]
        ok = [c for c in cells if not (c.get("text") or "").startswith("ERROR:")]
        errs = len(cells) - len(ok)
        over = sum(1 for c in ok if c.get("chars", 0) > LIMIT)
        avg_c = round(sum(c.get("chars", 0) for c in ok) / len(ok)) if ok else 0
        avg_s = round(sum(c.get("seconds", 0) for c in ok) / len(ok), 1) if ok else 0
        pt = [c["ptok"] for c in ok if c.get("ptok")]
        ct = [c["ctok"] for c in ok if c.get("ctok")]
        avg_pt = round(sum(pt) / len(pt)) if pt else 0
        avg_ct = round(sum(ct) / len(ct)) if ct else 0
        summary_cells.append(
            f"<td><b>{avg_c}</b> avg chars · <b>{avg_s}s</b> avg<br>"
            f"<span class='meta'>{avg_pt} in + {avg_ct} out tok/img · "
            f"{over} over {LIMIT}c · {errs} errors</span></td>"
        )
    summary = (f"<tr><th>summary</th>{''.join(summary_cells)}</tr>")

    rows = []
    for name, by in results.items():
        cells = []
        for m in models:
            r = by.get(m, {})
            txt = (r.get("text") or "").replace("<", "&lt;").replace(">", "&gt;")
            chars = r.get("chars", "")
            over = isinstance(chars, int) and chars > LIMIT
            err = txt.startswith("ERROR:")
            cls = "err" if err else ("over" if over else "")
            tok = ""
            if r.get("ptok") or r.get("ctok"):
                tok = f' · {r.get("ptok","?")}+{r.get("ctok","?")} tok'
            cells.append(
                f'<td class="{cls}">{txt}'
                f'<div class="meta">{chars} chars · {r.get("seconds","")}s{tok}</div></td>'
            )
        rows.append(
            f'<tr><td class="img"><img src="images/{name}">'
            f'<div class="meta">{name[:24]}</div></td>' + "".join(cells) + "</tr>"
        )
    head = "".join(f"<th>{m}</th>" for m in models)
    html = f"""<!doctype html><meta charset="utf-8">
<title>Model comparison — alt text</title>
<style>
 body{{font:14px/1.5 -apple-system,sans-serif;margin:24px;color:#111}}
 table{{border-collapse:collapse;width:100%}}
 th,td{{border:1px solid #ddd;padding:10px;vertical-align:top;text-align:left}}
 th{{background:#fafafa;position:sticky;top:0}}
 td.img{{width:200px}} td.img img{{max-width:180px;max-height:140px;display:block}}
 td{{width:330px}}
 .meta{{color:#888;font-size:12px;margin-top:6px}}
 .over{{background:#fff8e6}} .err{{color:#b00;background:#fff5f5}}
</style>
<h1>Model comparison — production alt-text prompt</h1>
<p class="meta">Amber = over {LIMIT} characters. Gemma-4 run with thinking disabled (reasoning_effort=none).</p>
<table><tr><th>Image</th>{head}</tr>{summary}{''.join(rows)}</table>
"""
    with open(os.path.join(HERE, "report_compare.html"), "w") as f:
        f.write(html)


if __name__ == "__main__":
    main()
size_experiment.py · sweep input resolution for a fixed model
#!/usr/bin/env python3
"""Variable: IMAGE SIZE. Sweep input resolution for a fixed model + prompt.

Sends the same text-heavy UI images at several max-side resolutions and records
the alt text, latency, and payload size — to find the resolution floor below
which the model can no longer read on-screen labels.

    python3 size_experiment.py
"""
import base64
import io
import json
import os
import time

import requests
from PIL import Image

from gen_alt_text import unload_all, HERE, IMG_DIR, API
from make_alt_text import PROMPT

MODEL = "qwen/qwen3-vl-30b"
SIZES = [256, 384, 512, 768, 1024, 1536]
# Text-heavy UI images (large originals) where legibility should matter most.
IMAGES = ["6f9ab5e5", "bddc34c3", "a86c45b5"]


def encode_at(path: str, size: int):
    im = Image.open(path)
    im.load()
    if im.mode != "RGB":
        im = im.convert("RGB")
    im.thumbnail((size, size))  # only downscales, never upscales
    buf = io.BytesIO()
    im.save(buf, format="JPEG", quality=85)
    raw = buf.getvalue()
    return ("data:image/jpeg;base64," + base64.b64encode(raw).decode(),
            f"{im.size[0]}x{im.size[1]}", len(raw))


def ask_data_url(data_url: str, timeout: int = 120):
    payload = {
        "model": MODEL, "temperature": 0.2, "max_tokens": 300,
        "reasoning_effort": "none",
        "messages": [{"role": "user", "content": [
            {"type": "text", "text": PROMPT},
            {"type": "image_url", "image_url": {"url": data_url}}]}],
    }
    t0 = time.time()
    r = requests.post(API, json=payload, timeout=timeout)
    dt = time.time() - t0
    if r.status_code != 200:
        return f"ERROR: HTTP {r.status_code}", dt
    return (r.json()["choices"][0]["message"].get("content") or "").strip(), dt


def main():
    import glob
    paths = []
    for sub in IMAGES:
        hits = glob.glob(os.path.join(IMG_DIR, f"*{sub}*")) or \
               [p for p in glob.glob(os.path.join(IMG_DIR, "*")) if sub in os.path.basename(p)]
        if hits:
            paths.append(hits[0])
    print(f"{MODEL}: {len(paths)} images x {len(SIZES)} sizes\n")
    unload_all()

    results = {}
    for p in paths:
        name = os.path.basename(p)
        results[name] = {}
        print(f"=== {name} ===")
        for s in SIZES:
            data_url, px, kb = encode_at(p, s)
            text, dt = ask_data_url(data_url)
            results[name][s] = {"text": text, "chars": len(text),
                                "seconds": round(dt, 1), "px": px,
                                "kb": round(kb / 1024, 1)}
            print(f"  size {s:4d} -> {px:>9} {kb/1024:5.1f}KB {dt:4.1f}s {len(text):3d}c  {text[:55]}")
            with open(os.path.join(HERE, "results_size.json"), "w") as f:
                json.dump({"model": MODEL, "prompt": PROMPT, "sizes": SIZES,
                           "results": results}, f, indent=2)
        print()
    print("Wrote results_size.json")


if __name__ == "__main__":
    main()
validate.py · the held-out repeatability run
#!/usr/bin/env python3
"""Repeatability / validation pass: run the study's BEST combo on a held-out set.

  Best combo  = Qwen3-VL-30B  +  tightened <=125 prompt  +  768px  +  shorten-pass
  Held-out set = ./images_val (20 images never seen during the study)

Writes results_validation.json. Reports any issues (errors, over-limit, slow).
"""
import glob
import json
import os

import gen_alt_text
gen_alt_text.MAX_SIDE = 768  # the validated resolution sweet spot

from gen_alt_text import ask_full, unload_all, HERE  # noqa: E402
from alt_prompts import clean_alt  # noqa: E402
from make_alt_text import PROMPT, LIMIT, shrink  # noqa: E402

MODEL = "qwen/qwen3-vl-30b"
VAL_DIR = os.path.join(HERE, "images_val")


def main():
    imgs = sorted(glob.glob(os.path.join(VAL_DIR, "*")))
    if not imgs:
        print("No images in ./images_val — run fetch_openverse.py --set val")
        return
    print(f"Validation: {MODEL} @768px on {len(imgs)} held-out images\n")
    unload_all()

    out = {}
    for p in imgs:
        name = os.path.basename(p)
        raw, dt, usage = ask_full(MODEL, PROMPT, p, timeout=300, max_tokens=300)
        alt = clean_alt(raw)
        tries = 0
        while len(alt) > LIMIT and not alt.startswith("ERROR:") and tries < 2:
            alt = shrink(alt, 300, model=MODEL)
            tries += 1
        out[name] = {"alt": alt, "chars": len(alt), "seconds": round(dt, 1),
                     "ptok": usage.get("prompt"), "ctok": usage.get("completion")}
        flag = ""
        if alt.startswith("ERROR:"):
            flag = " <<ERROR"
        elif len(alt) > LIMIT:
            flag = " <<OVER"
        print(f"  {name[:30]:30s} {len(alt):3d}c {dt:4.1f}s {usage.get('prompt')}tok  {alt[:46]}{flag}")

    with open(os.path.join(HERE, "results_validation.json"), "w") as f:
        json.dump({"model": MODEL, "prompt": PROMPT, "max_side": gen_alt_text.MAX_SIDE,
                   "results": out}, f, indent=2)

    errs = [n for n, r in out.items() if r["alt"].startswith("ERROR:")]
    over = [n for n, r in out.items() if not r["alt"].startswith("ERROR:") and r["chars"] > LIMIT]
    print(f"\n{len(out)} images | errors: {len(errs)} | over {LIMIT}c: {len(over)}")
    if errs:
        print("  errors:", errs)
    if over:
        print("  over-limit:", over)
    print("Wrote results_validation.json")


if __name__ == "__main__":
    main()

All inference for this report ran on-device through LM Studio. No image or prompt left the machine, and no cloud API was called. Our product images are © whalesync.com, used here for evaluation.

See it run on your own content.

Curtis runs these calls himself. Thirty minutes, no pitch, no slides. He connects your platforms live and shows you your content as an editable, reviewable diff. Bring anything sticky: a refresh, a migration, or a rebrand.

See it run on your content → or download it free

cookies

strictly necessary
required for the site to work. always on.

analytics
google analytics & posthog — anonymous usage, so we can improve the site.