Scratch • Local Vision Models for Alt Text, Tested: Six Models, Three Variables, One Recipe (2026)

Most writing about AI alt text stops at "use a vision model." That skips the questions that decide whether the output is usable: which model, prompted how, fed an image at what size. So we ran the test. Six vision models, all of them on a laptop with no cloud API, across 26 images, with the model held constant while we varied the prompt, and the prompt held constant while we varied the resolution. Then we re-ran the winning combination on 20 images it had never seen, to check that the recipe holds.

Every number below comes from a local run you can reproduce with the commands in section 6. Nothing went to a hosted API, so there is no per-call bill and no rate limit hiding in the results. The token counts are still here, because if you would prefer to call a frontier model through an API, the input-token count is the number that sets your cost.

A note on house style before we start: this report was researched and written by the team at Scratch, the desktop app we built for editing a whole catalog of content at once. Section 9 is where alt text at that scale connects to what we sell, labeled as such. Everything before it is field notes.

What is inside

Key findings
Setup and method
Variable one: the model
Variable two: the prompt
Variable three: image resolution
The recipe we would ship
Does it reproduce? A held-out run
Limitations
If you are doing this across a whole catalog
Image sources and licenses
Appendix: the full pipeline source

TL;DR, just give me the prompt. Send each image to a vision model with the prompt below, then enforce a 125-character cap. If the model goes over, re-ask it to "shorten to under 125 characters." Downscale images to about 768 px on the long edge first, because bigger is slower with no quality gain. The best local model in our tests was Qwen3-VL 30B through LM Studio. Any current frontier vision model works through an API.
Write production-ready alt text for this image for a SaaS company's website.
Rules:
- ONE line, MAXIMUM 125 characters.
- Describe only what is actually visible. Do NOT guess locations, settings, people's names, or brands that are not clearly shown.
- For a photo of a person, describe them plainly by apparent gender and clearly visible features, e.g. 'Woman with long dark hair in a white turtleneck, smiling'. Do not use vague labels like 'business professional' or 'person' when gender is clearly visible.
- For a product illustration, diagram, or UI mockup, describe what the graphic DEPICTS (the product, tool, or workflow). Treat any sample or placeholder text shown inside a mockup as example content, NOT the subject. Do not make a placeholder word the main topic.
- If readable product names or UI labels are clearly visible (e.g. Supabase, Notion, Webflow), you may include them.
- Plain, specific language. Avoid subjective adjectives like sleek, futuristic, cheerful, professional, or modern.
- Do NOT start with 'image of', 'photo of', 'screenshot of', or 'illustration of'.
- No quotation marks. Output only the alt text.
This exact prompt produced 0 errors and 0 over-limit results on a 20-image held-out set (section 7). Swap "a SaaS company's website" for your own context. The rest of this report is how we arrived at it.

1. Key findings

Best overall: Qwen3-VL 30B. It reads on-screen labels, captures the subject of busy composite graphics, stays concise (2 of 26 outputs over the 125-character target), and runs at about 6.5 seconds per image once it is loaded.
Turn off "thinking" on reasoning models, and know that the switch differs by vendor. Four of the six models reason by default, spending their whole token budget thinking and returning nothing usable. Disabling it makes them roughly 30 times faster with no quality loss. But the mechanism is not uniform: Gemma 4 and Qwen3.6 honor reasoning_effort: "none", while GLM-4.6V ignores that field (and chat_template_kwargs) and only stops when you append a /nothink token to the prompt. Guess wrong and you get truncated output and a large hidden output-token bill.
Small models anchor on placeholder text. The 3B model described a product illustration as a tool for "automotive SEO optimization" because the demo mockup contained the word "Automotive." The larger models read that same word as sample content, not the subject.
The prompt controls length and framing more than the model does. A free-form "describe this" prompt runs well past 125 characters. An explicit cap plus a one-line "shorten if over 125" pass reliably produces standards-compliant alt text.
Cost is set by the image, and smaller models are not cheaper. Each image is a few hundred input tokens against only 20 to 40 output tokens, so cost is dominated by how a model tokenizes the picture. The 3B model used the most input tokens (about 1,560 per image) and the 30B Qwen the fewest (a flat 270), roughly six times fewer, despite being about ten times larger. Parameter count tells you nothing about API cost. The vision encoder does.
For speed, architecture beats size. Qwen3-VL 30B (a mixture-of-experts model) ran at about 6.5 seconds per image, while the similarly sized but dense Qwen3.6 27B took about 27 seconds, four times slower, because every parameter fires on every token. For local throughput, prefer mixture-of-experts vision models.
Resolution buys latency, not much quality. Qwen3-VL 30B read prominent UI labels even at 256 px. Going from 256 to 1536 px was about three to four times slower and roughly 17 times the payload for no reliable quality gain. Downscaling dense graphics to about 512 to 768 px is close to free.
The recipe reproduces. Re-running the winning combination on 20 held-out images it had never seen produced 0 errors, 0 over-limit results, and the same flat 270 tokens per image.

2. Setup and method

Every model runs on-device through LM Studio, which exposes an OpenAI-compatible endpoint at http://localhost:1234/v1. Each request sends one image, as a base64 data URL, plus a text prompt to /v1/chat/completions:

def ask(model, prompt, img_path, max_tokens=300, reasoning_effort=None):
    payload = {
        "model": model,
        "messages": [{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url",
                 "image_url": {"url": data_url(img_path)}},
            ],
        }],
        "temperature": 0.2,
        "max_tokens": max_tokens,
    }
    if reasoning_effort is not None:        # "none" disables thinking on Gemma 4 / Qwen3.6
        payload["reasoning_effort"] = reasoning_effort
    r = requests.post("http://localhost:1234/v1/chat/completions", json=payload)
    ...

Every image is normalized before sending. We decode it (WebP and AVIF included), convert to RGB, downscale the long edge, and re-encode as JPEG, so format and size are controlled, not left to whatever the page served:

def data_url(path, max_side=768):
    """Normalize any format (webp/avif/png) to a downscaled JPEG data URL."""
    im = Image.open(path)
    im.load()
    if im.mode != "RGB":
        im = im.convert("RGB")
    if max(im.size) > max_side:
        im.thumbnail((max_side, max_side))
    buf = io.BytesIO()
    im.save(buf, format="JPEG", quality=85)
    return "data:image/jpeg;base64," + base64.b64encode(buf.getvalue()).decode()

The image set

The main study uses 26 images. Twelve are ours, scraped from our own marketing site (whalesync.com): headshots, clean UI screenshots, stylized product illustrations, a cartoon mascot, and busy marketing composites. Fourteen are openly licensed (CC0, public domain, or CC-BY), pulled from Openverse across deliberately different scenarios: portrait photos, landscapes, food, charts, flow diagrams, scanned documents, and artwork. Mixing our own product imagery with a generic open set keeps the findings from overfitting to one site's house style. Full attribution for every open image, including the 20 used later for validation, is in section 10.

A few words on how to read the tables:

Chars is the character count of the alt text. The target is 125 or fewer, a common screen-reader and WCAG rule of thumb.
Latency is wall-clock seconds on the test Mac (an Apple M1 Pro, 10 cores, 32 GB RAM). The first request to each model includes a one-time model load, so we report both median and max.
Latency is hardware-specific. Treat it as relative across models, not as an absolute you will see on your own machine.

3. Variable one: the model

Same production prompt, same 26 images, across six local vision models spanning four families (Mistral, Google Gemma, Qwen, and Zhipu GLM), with two entries from the Qwen line, and ranging from 3B to 30B. Reasoning models ran with thinking disabled through their respective switch. Alongside quality and speed we report token usage, because for anyone weighing a hosted API, the input-token count, which includes the image, is the real cost driver.

Model	Avg chars	Median latency	Max latency*	Input tok	Output tok	Est $/1k imgs†	Over 125c
Ministral 3B	85	4.8 s	7.3 s	1562	27	$0.33	0
Gemma 4 12B	108	5.0 s	9.6 s	489	24	$0.11	3
Gemma 4 26B	102	6.1 s	22.0 s	489	21	$0.11	0
Qwen3-VL 30B (MoE)	99	6.5 s	15.4 s	270	27	$0.07	2
Qwen3.6 27B (dense)	120	27.1 s	36.0 s	282	24	$0.07	12
GLM-4.6V flash	105	14.0 s	16.7 s	268	27	$0.07	5

*Max latency is roughly the cold model-load on the first image. †Illustrative only: input and output tokens priced at $0.20 and $0.60 per million, a generic open-vision-model API rate. These models ran locally at no marginal cost. Use the column to compare models, not as a real invoice.

What we saw, model by model:

Ministral 3B. Fast and concise, but the weakest reader and, surprisingly, the most token-expensive at about 1,560 input tokens per image. It paraphrases generically, anchors on demo text, and sometimes misses entirely. An oil portrait of a uniformed man came back as "abstract shapes and colors."
Gemma 4 12B. A strong reader of UI labels ("Add Table Mapping") and scenes, but it runs long: 3 of 26 outputs over 125 characters, one at 167.
Gemma 4 26B. Same vision tokenizer as the 12B (489 input tokens per image) but more disciplined: 0 over the limit against the 12B's 3, and a tighter average (102 against 108 characters). It also catches the occasional label the 12B skips. On the CMS illustration below it read the "ACME" brand header in the mockup, which the 12B reduced to "a website landing page." Within this family, the bigger model was the better-behaved one.
Qwen3-VL 30B. The best balance: accurate across photos, charts, documents, and art, concise (2 of 26 slightly over), 0 errors, and the cheapest by far at a flat 270 input tokens per image regardless of content. Its one notable slip was calling a website mockup a "mobile app."
Qwen3.6 27B. A capable reader but the wrong tool here. As a dense 27B it ran about 27 seconds per image, four times slower than the same-sized Qwen3-VL mixture-of-experts, and it was the most verbose model in the set at 12 of 26 over the limit. Cheap tokens, expensive wall-clock.
GLM-4.6V flash. The other-family pick (Zhipu, 9B). Once thinking is off through /nothink it is accurate and cheap (268 input and 27 output tokens, 0 errors), reading labels like "CMS pSEO" and "Currently...". The trade-offs are about 14 seconds per image (slow for a 9B) and a little extra length. Before the /nothink fix it reasoned by default, truncated on 10 of 26 images as it ran out of budget mid-thought, and spent roughly 20 times the output tokens. It is the clearest cautionary tale in the set.

Three images that drove those notes:

Product illustration: a website mockup carrying sample 'Automotive' marketing copy beside a CMS dashboard grid

The CMS illustration. The mockup's sample copy mentions the automotive industry. Ministral 3B made that the subject ("a content management system interface for automotive SEO optimization"). Qwen3-VL and both Gemmas read it as on-screen sample text and described the actual product, a CMS dashboard. This single image is the case for the placeholder rule in section 4.

Oil portrait of a man in a grey military uniform and cap, seated with his hands folded

The oil portrait. Both Gemmas and Qwen described the uniform, the cap, and the pose. Ministral 3B returned "Artistic painting of abstract shapes and colors," a clean miss on a figurative portrait.

Network chart linking Stein Bagger and IT Factory to other companies and people with red lines

The network chart. Every model above the 3B read the proper nouns ("Stein Bagger," "IT Factory") straight off a busy news graphic. Reading printed names in a chart is a solved problem for these models. Inventing a subject from a placeholder is not.

4. Variable two: the prompt

Same model (Ministral 3B, the weakest reader, which makes the prompt's effect easy to see) and the same 12 owned images, across three prompt styles. Prompt choice swings output length far more than model choice does.

Prompt style	What it asks for	Avg chars
`detailed`	Free-form: one descriptive sentence, no length cap.	151
`web_125`	Web cap: a single line under 125 chars, name prominent UI or text.	112
`short_12w`	Terse: 12 words max, one line.	51

The free-form prompt averages well past the 125-character target. An explicit cap pulls it into range. Three representative images, all three styles:

Prompt style	Output	Chars
`detailed`	A professional portrait of an engaged individual smiling warmly against a modern, abstract background featuring soft geometric shapes and light tones.	150
`web_125`	Professional woman smiling against modern office background with abstract geometric lighting.	93
`short_12w`	Woman smiling indoors wearing light top	39

Illustration of a Supabase table editor syncing to a Notion database, with arrows showing data flow

Prompt style	Output	Chars
`detailed`	A sleek database interface displaying real-time data synchronization between Supabase and Notion, showcasing seamless integration capabilities.	143
`web_125`	Real-time database integration displays Supabase table editor syncing with Notion's contact list.	97
`short_12w`	Database table syncing in Notion via Supabase real-time integration	67

Product UI listing several syncs with status labels such as Syncing, Active, and Disabled

Prompt style	Output	Chars
`detailed`	A dashboard showing active and syncing integrations for database management, marketing site optimization, lead enrichment, reporting tools, and internal intranet within a software system.	187
`web_125`	SaaS sync dashboard showing active database integrations with status indicators for marketing site SEO, lead enrichment, conference reporting, and intranet systems	163
`short_12w`	Database sync status dashboard showing active and paused integrations	69

That last row is the catch: even the capped web_125 prompt overshoots on a dense dashboard (163 characters). Length instructions move the average but do not guarantee the ceiling. That is why the production recipe adds a deterministic second pass.

The production prompt does two more things beyond length, and both showed up as quality wins:

Describe people by apparent gender and visible features. The free-form prompt called the headshot "a professional portrait of an engaged individual." The rule turns that into "Woman with long dark hair in a white turtleneck, smiling." Specific beats polite.
Treat placeholder text in a mockup as example content, not the subject. This is the rule that kills the "Automotive" failure from section 3.

The production prompt is the one in the TL;DR. A small cleanup pass strips the boilerplate models like to add ("Alt text:", wrapping quotes, stray special tokens):

def clean_alt(text):
    """Strip the junk models add around the actual alt text."""
    t = text.strip()
    t = re.sub(r'<\|[^|]*\|>', '', t)                          # GLM's <|begin_of_box|> wrappers
    t = re.sub(r'^\s*alt[\s_-]*text\s*[:\-–]\s*', '', t, flags=re.I)  # leading "Alt text:"
    t = t.strip().strip('"“”‘’\'').strip()                      # wrapping quotes
    return re.sub(r'\s+', ' ', t)                               # collapse whitespace

And because no model reliably obeys the character cap, a one-line text-only retry rewrites anything that spills over. It is generous with tokens so a reasoning model can think and still answer, and it keeps the original if the rewrite comes back empty:

def shrink(text, model):
    prompt = (f"Rewrite this website alt text to be UNDER 125 characters while keeping "
              "the most important information. One line, plain language, no quotes. "
              f"Output only the rewritten alt text:\n\n{text}")
    payload = {"model": model, "temperature": 0.2, "max_tokens": 512,
               "messages": [{"role": "user", "content": prompt}]}
    cand = clean_alt(post(payload))
    return cand if cand and not cand.startswith("ERROR:") else text

5. Variable three: image resolution

Fixed model (Qwen3-VL 30B) and prompt. The same image goes to the model at six max-side resolutions. We chose text-heavy UI images on purpose, because reading the labels is the hard part. The encoder is the same encode_at thumbnail step from section 2, just swept across sizes:

Max side (px)	Mean payload	Median latency	Mean chars
256	8 KB	3.5 s	137
384	16 KB	3.7 s	205
512	26 KB	3.5 s	139
768	48 KB	4.6 s	111
1024	75 KB	5.9 s	154
1536	138 KB	11.9 s	148

Median latency is shown because the first request of the run (256 px, first image) included a one-time model load of 31.5 s and would otherwise distort the smallest bucket. Mean character count is not monotonic, which is the point: more pixels did not buy a better description.

The result inverts the usual instinct. Three side-by-side demos, each sent at 256 px and at 640 px:

Supabase-to-Notion sync illustration sent to the model at 256 pixels — sent at 256 px

The same Supabase-to-Notion sync illustration sent to the model at 640 pixels — sent at 640 px

At 256 px Qwen3-VL still named "Supabase," "Notion," and "Webflow." At higher resolution it traded one of those for slightly different phrasing, not for more accuracy.

App Database and User Management panel illustration sent to the model at 256 pixels — sent at 256 px

The same App Database and User Management panel illustration sent to the model at 640 pixels — sent at 640 px

The "App Database" and "User Management" panel labels read fine at 256 px. The fine print inside the panels (column names, a highlighted row) only appears at higher resolution.

Sessions and Speakers to Conference Planner illustration sent to the model at 256 pixels — sent at 256 px

The same Sessions and Speakers to Conference Planner illustration sent to the model at 640 pixels — sent at 640 px

"Sessions & Speakers" and "Conference Planner" read at 256 px. At 1536 px the model added that the source was a Notion database and the target a Google Sheet, finer detail, but the core description was already right at the smallest size.

What this means in practice:

This model is robust to downscaling. The prominent labels survive heavy compression because the illustrations use large, bold type. Only the fine print blurs out.
Latency scales hard with resolution (about 3 seconds at 256 to 512 px, up to 12 seconds at 1536 px) and payload grows about 17 times, while quality does not improve along with it. LM Studio's encoder tiles the image to a fixed budget, so most of the extra pixels are wasted compute.
Downscale dense graphics aggressively, to about 512 to 768 px, for a two-to-three-times latency win at negligible quality cost. Reserve higher resolution for genuinely small print.

6. The recipe we would ship

Putting the three variables together, the configuration we would run for this kind of image mix:

Model. Qwen3-VL 30B: best quality and cheapest tokens. If you prefer the Gemma family, use Gemma 4 26B with reasoning_effort: "none" (more disciplined than the 12B). Avoid the 3B for anything but plain photos. Through an API, any current frontier vision model works.
Prompt. The tightened production prompt from the TL;DR: apparent gender plus visible features for people, placeholder-as-example for mockups, 125 characters or fewer.
Image. Normalize to JPEG and downscale to about 768 px on the long edge.
Post-process. Strip the boilerplate, then run the one-line "shorten" retry to guarantee 125 characters or fewer.

To reproduce every number in this report:

python3 download_images.py https://whalesync.com --max 12   # our 12 owned images
python3 fetch_openverse.py                                  # 14 openly-licensed images
python3 compare_models.py --models mistralai/ministral-3-3b google/gemma-4-12b-qat \
        google/gemma-4-26b-a4b qwen/qwen3-vl-30b --reasoning-effort none --max-tokens 300
#   ...add another model later as a column:  compare_models.py --merge --models <id>
python3 alt_prompts.py            # prompt sweep (Ministral)
python3 size_experiment.py        # resolution sweep (Qwen)
python3 validate.py               # held-out repeatability run

7. Does it reproduce? A held-out run

A recipe that only works on the images you tuned it against is not a recipe. So we ran the winning configuration once, unchanged, Qwen3-VL 30B plus the tightened 125-character prompt plus 768 px plus the shorten pass, over 20 fresh openly-licensed images that were never used while developing it. The new set deliberately covers new categories: city, nature, sport, workspace, pie charts, schematics, maps, letters, still life, portrait, and vehicle.

Run	Images	Avg chars	Input tok	Output tok	Est $/1k	Over 125c	Errors
Study set, Qwen3-VL 30B	26	99	270	27	$0.07	2	0
Held-out validation	20	96	270	23	$0.07	0	0

The recipe reproduces. Twenty images, 0 errors, 0 over the 125-character limit, a flat 270 input tokens per image (identical to the study run), and about 4.8 seconds per image at steady state. Output length and cost match the study set within noise.

Every line of output from that run:

Image (category)	Alt text (Qwen3-VL 30B at 768px)	Chars
city	White Art Deco building with decorative facade and fire escapes against a blue sky.	83
city	Night scene of a city street with illuminated buildings, a red bus, and people walking.	87
letter	Handwritten text in brown ink on a white background, with a small green and blue butterfly emblem in the corner.	112
letter	Handwritten letter on aged paper with a date of Sept. 8, 1944, in the upper right corner.	89
map	Vintage map collage with double-hemisphere projection, aged paper, and decorative elements.	91
map	Circular diagram showing planets and zodiac signs in orbits around the sun, labeled in Latin.	93
nature	A forest with tall trees and a branch holding a few orange leaves, set against a misty blue background.	103
nature	A black and white photo of a winding path through a forest with a large, gnarled tree branch arching overhead.	110
pie chart	Infographic showing statistics on food waste, including global and U.S. data, causes, costs, and environmental impact.	118
pie chart	Infographic showing drug production and usage statistics in the Americas with circular charts and a map.	104
portrait	Woman with dark hair lying on her back, arms raised above her head, hands open, in a black and white photograph.	112
schematic	Diagram of a control panel with labeled sections, knobs, switches, and handwritten notes.	89
schematic	Hand-drawn diagram of a bicycle dynamo AC to 5V USB adapter with bridge rectifier, voltage regulator, and USB pins.	115
sport	Empty green field with red seats and two tall light towers under a bright sky.	78
sport	Woman with dark hair in a ponytail wearing a purple jacket and visor, holding a black jacket and a tennis racket.	113
still life	Still life painting of fruit, a glass of red liquid, and a knife on a table.	76
still life	Still life painting with grapes, a lobster, fruit, and a glass on a dark table.	79
vehicle	Two diagrams of a vintage car with French labels identifying parts like the chassis, fender, and running board.	111
workspace	Two people at a desk with laptops and papers, one writing with a red pen.	73
workspace	Top-down view of a wooden desk with a laptop, tablet, coffee cup, glasses, and a small plant.	93

The interesting failures were in the data, not the model. The keyword a query returned did not always match the picture, and Qwen described what was in the picture instead of forcing the expected label:

Empty green sports field with red seats and two floodlight towers under a bright sky — query: "soccer match"

Circular Latin-labeled diagram of planets and zodiac signs orbiting the sun — query: "antique map"

A "soccer match" query returned an empty stadium, and Qwen wrote "Empty green field with red seats and two tall light towers," not "a soccer match." An "antique map" query returned a zodiac chart, and Qwen wrote "Circular diagram showing planets and zodiac signs," not "a map." It described the schematic and the vintage-car diagram below by their actual labels too. The model did not hallucinate to match the keyword, which is exactly the behavior you want.

Hand-drawn schematic of a bicycle dynamo AC-to-USB adapter with a bridge rectifier and voltage regulator — read the components, not just "a diagram"

Two labeled diagrams of a vintage car with French part names — read the French labels

The only real friction was sourcing. One Openverse query ("man headshot studio") returned no openly-licensed results and the API rate-limited intermittently, so the first fetch yielded 18 of 20. Two supplemental queries topped it back up to 20. None of that touched the model output.

8. Limitations

The sample is 26 images in the study and 20 in validation, across two sources. This is directional, not statistically powered. The quality verdicts are our judgment against the actual images, not a blind rubric.
Latency is specific to one Apple M1 Pro and to LM Studio's current runtime. Absolute seconds will differ on other hardware.
Quantization levels differ across these prebuilt LM Studio models, which muddies a pure "model capability" comparison.
The dollar figures are illustrative. One assumed blended API price applied to measured tokens. Real providers price models differently, so treat the cost column as a comparison aid, not a quote.
The open images came from keyword searches, so their exact content is incidental, useful for variety but not a curated benchmark.
No accessibility expert reviewed the alt text. "Good" here means accurate, specific, and within length conventions.

9. If you are doing this across a whole catalog

This is the part where the research connects to what we build, labeled so you can skip it.

Running the recipe once is a script. Running it across every product image in a Shopify or Webflow catalog, every illustration in a CMS, every photo in a media library, and then reviewing each line before it ships, is the part that takes weeks by hand or gets handed to a plugin you cannot inspect. That review step is not optional: the tables above show even the best model overshoots the cap sometimes and occasionally calls a website a mobile app, so something has to catch those before they go live.

That is the job Scratch does. It pulls every record out of your platform into local files, your model writes the alt text there (not just for the first 200 rows), and you approve every change as a diff before anything publishes. The same shape applies whether the field is alt text, a meta description, or a product title. If you want the step-by-step version for one platform, we wrote up how to add alt text to every WordPress image with AI, and the Shopify SEO teardown covers where image alts sit in a larger on-page picture. This report is the kind of work we run on our own content.

10. Image sources and licenses

The 12 product images are ours (whalesync.com) and used with permission. The images below are openly licensed (CC0, public domain, or CC-BY) and were retrieved through Openverse: the 14 used in the study plus the 20 used for held-out validation, credited here so this report can be shared freely.

Category	Title / creator	License	Source
photo-person	Portrait photograph	PDM 1.0	flickr
photo-person	Pearls, Pearls, Pearls (cindy richardson)	BY 2.0	flickr
photo-scene	Mountain lake with gate and figures	CC0 1.0	rawpixel
photo-scene	Mountain lake scene (Ferdinand Katona)	CC0 1.0	rawpixel
photo-food	Tofu croquettes on salad (Lablascovegmenu)	BY 2.0	flickr
photo-food	Lunch at DCPS (DC Central Kitchen)	BY 2.0	flickr
chart	Net closing in on missing Stein Bagger millions (Podknox)	BY 2.0	flickr
chart	Financial statistics and accounting concept (paola.bazurto4)	BY 2.0	flickr
diagram	The Self-Repair Manifesto, iFixit (dullhunk)	BY 2.0	flickr
diagram	Using Social Media for Professional Learning (cambodia4kidsorg)	BY 2.0	flickr
document	Diary (Barnaby)	BY 2.0	flickr
document	Art Journal Template (Calsidyrose)	BY 2.0	flickr
artwork	Portrait of a uniformed man (Jacques-Émile Blanche)	CC0 1.0	rawpixel
artwork	Postinjohtaja Chr. Völschow, 1684	CC0 1.0	rawpixel
city	Eddy Street (dalecruse)	BY 2.0	flickr
city	Big City Life, Berlin (matthias.ripp)	BY 2.0	flickr
nature	Leafs in the winter fog (Gael Varoquaux)	BY 2.0	flickr
nature	Over the Trail (Corey Leopold)	BY 2.0	flickr
sport	Sydney Showground (edwin.11)	BY 2.0	flickr
sport	Sania Mirza (Andrew Campbell Photography)	BY 2.0	flickr
workspace	Writing Papers (Helloquence)	CC0 1.0	stocksnap
workspace	Macbook Laptop (Lia Leslie)	CC0 1.0	stocksnap
pie-chart	The Big Food Wasters (GDS Infographics)	BY 2.0	flickr
pie-chart	Illegal Drugs in the Americas (GDS Infographics)	BY 2.0	flickr
schematic	Control Display from Apollo 13 (jurvetson)	BY 2.0	flickr
schematic	Bicycle dynamo USB adapter (aegidian)	BY 2.0	flickr
map	Assorted Maps brush set (webtreats)	BY 2.0	flickr
map	Planisphaerium Ptolemaicum (Leventhal Map Center, BPL)	BY 2.0	flickr
letter	Write more Letters (Markus Reinhardt)	BY 2.0	flickr
letter	letter (Lizbeth King)	BY 2.0	flickr
still-life	Still Life with Fruit and Sweetmeats, 1635 (Georg Flegel)	CC0 1.0	rawpixel
still-life	Still Life with Fruit and Lobster (Pieter de Ring)	CC0 1.0	rawpixel
portrait	Art of gymnastics (lauragrb)	BY 2.0	flickr
vehicle	Nomenclature automobile en 1927 (claude-22)	PDM 1.0	flickr

11. Appendix: the full pipeline source

Every script that produced the numbers above, complete and runnable. They share a small request layer (gen_alt_text.py) and talk only to a local LM Studio endpoint. Click to expand.

download_images.py · scrape and download images from a web page

#!/usr/bin/env python3
"""Download images from a web page (or pages) into ./images.

Usage:
    python3 download_images.py                       # defaults to whalesync.com
    python3 download_images.py https://example.com   # one or more URLs
    python3 download_images.py URL --max 15          # cap number of images
"""
import argparse
import hashlib
import os
import re
import sys
from urllib.parse import urljoin, urlparse

import requests
from bs4 import BeautifulSoup

UA = (
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 "
    "(KHTML, like Gecko) Chrome/124.0 Safari/537.36"
)
OUT_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "images")
# Skip vector/icon formats that raster vision models can't ingest well.
SKIP_EXT = {".svg", ".ico", ".gif"}


def largest_from_srcset(srcset: str) -> str | None:
    """Pick the highest-resolution candidate from a srcset attribute."""
    best_url, best_w = None, -1
    for part in srcset.split(","):
        bits = part.strip().split()
        if not bits:
            continue
        url = bits[0]
        w = 0
        if len(bits) > 1 and bits[1].endswith("w"):
            try:
                w = int(bits[1][:-1])
            except ValueError:
                w = 0
        if w >= best_w:
            best_url, best_w = url, w
    return best_url


def collect_image_urls(page_url: str) -> list[str]:
    r = requests.get(page_url, headers={"User-Agent": UA}, timeout=30)
    r.raise_for_status()
    soup = BeautifulSoup(r.text, "html.parser")
    urls: list[str] = []

    for img in soup.find_all("img"):
        cand = None
        if img.get("srcset"):
            cand = largest_from_srcset(img["srcset"])
        if not cand:
            cand = img.get("src") or img.get("data-src")
        if cand:
            urls.append(urljoin(page_url, cand))

    # <picture><source srcset=...></picture>
    for source in soup.find_all("source"):
        if source.get("srcset"):
            cand = largest_from_srcset(source["srcset"])
            if cand:
                urls.append(urljoin(page_url, cand))

    # og:image
    og = soup.find("meta", property="og:image")
    if og and og.get("content"):
        urls.append(urljoin(page_url, og["content"]))

    return urls


def filename_for(url: str, content_type: str) -> str:
    path = urlparse(url).path
    base = os.path.basename(path) or "image"
    base = re.sub(r"[^A-Za-z0-9._-]", "_", base)
    root, ext = os.path.splitext(base)
    if not ext:
        ext = {
            "image/jpeg": ".jpg",
            "image/png": ".png",
            "image/webp": ".webp",
        }.get(content_type.split(";")[0].strip(), ".jpg")
        base = root + ext
    # Prefix with a short URL hash to avoid collisions across pages.
    h = hashlib.sha1(url.encode()).hexdigest()[:8]
    return f"{h}_{base}"


def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("urls", nargs="*", default=["https://whalesync.com"],
                    help="page URL(s) to scrape (default: https://whalesync.com)")
    ap.add_argument("--max", type=int, default=20, help="max images to download")
    ap.add_argument("--min-bytes", type=int, default=5000,
                    help="skip images smaller than this (filters icons/spacers)")
    args = ap.parse_args()
    urls = args.urls or ["https://whalesync.com"]

    os.makedirs(OUT_DIR, exist_ok=True)

    # Gather + dedupe candidate image URLs across all pages.
    seen_urls, candidates = set(), []
    for page in urls:
        try:
            found = collect_image_urls(page)
        except Exception as e:
            print(f"!! failed to scrape {page}: {e}", file=sys.stderr)
            continue
        print(f"   {page}: found {len(found)} <img>/source candidates")
        for u in found:
            if u in seen_urls:
                continue
            ext = os.path.splitext(urlparse(u).path)[1].lower()
            if ext in SKIP_EXT:
                continue
            seen_urls.add(u)
            candidates.append(u)

    print(f"\n{len(candidates)} unique candidate images; downloading up to {args.max}\n")

    saved, seen_hashes = 0, set()
    for u in candidates:
        if saved >= args.max:
            break
        try:
            resp = requests.get(u, headers={"User-Agent": UA}, timeout=30)
            resp.raise_for_status()
            data = resp.content
        except Exception as e:
            print(f"  skip (error) {u}: {e}")
            continue
        if len(data) < args.min_bytes:
            print(f"  skip (too small {len(data)}b) {u}")
            continue
        digest = hashlib.sha1(data).hexdigest()
        if digest in seen_hashes:
            print(f"  skip (duplicate bytes) {u}")
            continue
        seen_hashes.add(digest)
        ct = resp.headers.get("Content-Type", "")
        if "image" not in ct and not os.path.splitext(u)[1]:
            print(f"  skip (not image, {ct}) {u}")
            continue
        fname = filename_for(u, ct)
        with open(os.path.join(OUT_DIR, fname), "wb") as f:
            f.write(data)
        saved += 1
        print(f"  [{saved}] {fname}  ({len(data)//1024} KB)")

    print(f"\nDone. {saved} images in {OUT_DIR}")


if __name__ == "__main__":
    main()

fetch_openverse.py · pull openly-licensed images from Openverse

#!/usr/bin/env python3
"""Fetch openly-licensed images (Openverse API) with license + attribution.
CC0 / public-domain / CC-BY only, recorded so the shareable report can credit them.

    python3 fetch_openverse.py                 # main set -> ./images
    python3 fetch_openverse.py --set val       # held-out validation set -> ./images_val
"""
import argparse
import hashlib
import io
import json
import os

import requests
from PIL import Image

HERE = os.path.dirname(os.path.abspath(__file__))
API = "https://api.openverse.org/v1/images/"
UA = "alt-text-research/1.0 (local eval)"

# (slug, query, category-or-None)
CATEGORIES_MAIN = [
    ("photo-person", "portrait person face", "photograph"),
    ("photo-scene", "mountain landscape lake", "photograph"),
    ("photo-animal", "dog puppy outdoors", "photograph"),
    ("photo-food", "plate of food meal", "photograph"),
    ("chart", "bar chart graph statistics", None),
    ("diagram", "flowchart process diagram", None),
    ("document", "scanned document page text", None),
    ("artwork", "oil painting portrait", "digitized_artwork"),
]
# Deliberately different queries so the validation set is genuinely held-out.
CATEGORIES_VAL = [
    ("portrait", "man headshot studio", "photograph"),
    ("city", "city street architecture", "photograph"),
    ("nature", "forest trees path", "photograph"),
    ("sport", "soccer match action", "photograph"),
    ("workspace", "laptop desk office", "photograph"),
    ("pie-chart", "pie chart infographic", None),
    ("schematic", "circuit diagram schematic", None),
    ("map", "antique map illustration", None),
    ("letter", "handwritten letter vintage", None),
    ("still-life", "still life painting fruit", "digitized_artwork"),
]
MIN_BYTES, MAX_BYTES = 15_000, 8_000_000


def search(query, category):
    params = {"q": query, "license": "cc0,pdm,by", "page_size": 12, "mature": "false"}
    if category:
        params["category"] = category
    r = requests.get(API, params=params, headers={"User-Agent": UA}, timeout=30)
    r.raise_for_status()
    return r.json().get("results", [])


def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--set", choices=["main", "val"], default="main")
    ap.add_argument("--per", type=int, default=2, help="images per category")
    args = ap.parse_args()

    cats = CATEGORIES_MAIN if args.set == "main" else CATEGORIES_VAL
    out_dir = os.path.join(HERE, "images" if args.set == "main" else "images_val")
    meta_path = os.path.join(HERE, "images_meta.json" if args.set == "main"
                             else "images_val_meta.json")
    os.makedirs(out_dir, exist_ok=True)
    meta = json.load(open(meta_path)) if os.path.exists(meta_path) else {}

    # Dedupe against anything already downloaded in either set.
    seen_hashes = set()
    for d in ("images", "images_val"):
        dp = os.path.join(HERE, d)
        if os.path.isdir(dp):
            for fn in os.listdir(dp):
                try:
                    seen_hashes.add(hashlib.sha1(open(os.path.join(dp, fn), "rb").read()).hexdigest())
                except Exception:
                    pass

    for slug, query, category in cats:
        print(f"=== {slug}: '{query}' ===")
        got = 0
        try:
            results = search(query, category)
        except Exception as e:
            print(f"  !! search failed: {e}")
            continue
        for item in results:
            if got >= args.per:
                break
            url = item.get("url")
            if not url:
                continue
            try:
                resp = requests.get(url, headers={"User-Agent": UA}, timeout=30)
                resp.raise_for_status()
                data = resp.content
            except Exception as e:
                print(f"  skip (dl error) {url[:55]}: {e}")
                continue
            if not (MIN_BYTES <= len(data) <= MAX_BYTES):
                continue
            digest = hashlib.sha1(data).hexdigest()
            if digest in seen_hashes:
                continue
            try:
                im = Image.open(io.BytesIO(data)); im.load()
                if min(im.size) < 200:
                    continue
                ext = (im.format or "JPEG").lower().replace("jpeg", "jpg")
            except Exception:
                continue
            seen_hashes.add(digest)
            fname = f"ov_{slug}_{got+1}_{digest[:6]}.{ext}"
            with open(os.path.join(out_dir, fname), "wb") as f:
                f.write(data)
            meta[fname] = {
                "category": slug, "title": item.get("title"), "creator": item.get("creator"),
                "license": (item.get("license") or "").upper(),
                "license_version": item.get("license_version"),
                "source": item.get("source"), "landing": item.get("foreign_landing_url"),
            }
            got += 1
            print(f"  [{got}] {fname}  ({len(data)//1024} KB)  {meta[fname]['license']}  {im.size}")
        if got == 0:
            print("  (no usable results)")

    with open(meta_path, "w") as f:
        json.dump(meta, f, indent=2)
    n = len([k for k in meta])
    print(f"\nDone. {n} images recorded in {os.path.basename(meta_path)} (set={args.set})")


if __name__ == "__main__":
    main()

gen_alt_text.py · the request layer: send an image to a local model and parse the reply

#!/usr/bin/env python3
"""Generate alt text for every image in ./images using local LM Studio models.

LM Studio exposes an OpenAI-compatible API at http://localhost:1234/v1.

Usage:
    python3 gen_alt_text.py --probe          # test which models accept images
    python3 gen_alt_text.py                   # full run over all images/models
    python3 gen_alt_text.py --models qwen/qwen3-vl-30b google/gemma-4-12b-qat
    python3 gen_alt_text.py --prompt "Write alt text in <= 12 words."
"""
import argparse
import base64
import glob
import json
import os
import subprocess
import time

import io

import requests
from PIL import Image

HERE = os.path.dirname(os.path.abspath(__file__))
IMG_DIR = os.path.join(HERE, "images")
API = "http://localhost:1234/v1/chat/completions"

# Candidate vision models already downloaded in LM Studio, ordered fastest
# -> slowest so the report fills in quickly (the reasoning model is last).
DEFAULT_MODELS = [
    "mistralai/ministral-3-3b",
    "google/gemma-4-12b-qat",
    "qwen/qwen3-vl-30b",
    "google/gemma-4-26b-a4b",
    "mistralai/ministral-3-14b-reasoning",
]

DEFAULT_PROMPT = (
    "Write concise, descriptive alt text for this image to be used on a website. "
    "Describe the key visual content in one sentence. Be specific about what is "
    "shown. Do not begin with 'image of', 'picture of', or 'a screenshot of'. "
    "Output only the alt text, nothing else."
)

MAX_SIDE = 1024  # downscale longest edge before sending (speed + sanity)


def data_url(path: str) -> str:
    """Normalize any format (webp/avif/png/...) to a downscaled JPEG data URL."""
    im = Image.open(path)
    im.load()
    if im.mode != "RGB":
        im = im.convert("RGB")
    if max(im.size) > MAX_SIDE:
        im.thumbnail((MAX_SIDE, MAX_SIDE))
    buf = io.BytesIO()
    im.save(buf, format="JPEG", quality=85)
    b64 = base64.b64encode(buf.getvalue()).decode()
    return f"data:image/jpeg;base64,{b64}"


def ask_full(model: str, prompt: str, img_path: str, timeout: int,
             max_tokens: int = 1024, reasoning_effort: str | None = None,
             base_url: str = API, headers: dict | None = None) -> tuple[str, float, dict]:
    """Return (alt_text_or_error, seconds, usage). Errors are prefixed 'ERROR:'.

    usage = {prompt, completion, total, reasoning, cost} (values may be None).
    `base_url`/`headers` make this provider-agnostic (LM Studio default; an
    OpenAI-compatible cloud endpoint like OpenRouter can be passed in).

    Reasoning models (Gemma-4, ministral-*-reasoning) emit chain-of-thought into
    a separate reasoning_content field; the real answer is in content. A generous
    max_tokens lets them finish thinking AND answer. Alternatively pass
    reasoning_effort='none' to disable thinking entirely (much faster, no truncation).
    """
    payload = {
        "model": model,
        "messages": [{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {"url": data_url(img_path)}},
            ],
        }],
        "temperature": 0.2,
        "max_tokens": max_tokens,
    }
    if reasoning_effort is not None:
        payload["reasoning_effort"] = reasoning_effort
    t0 = time.time()
    try:
        r = requests.post(base_url, json=payload, timeout=timeout, headers=headers)
        dt = time.time() - t0
        if r.status_code != 200:
            return f"ERROR: HTTP {r.status_code}: {r.text[:200]}", dt, {}
        j = r.json()
        ch = j["choices"][0]
        u = j.get("usage", {}) or {}
        det = u.get("completion_tokens_details", {}) or {}
        usage = {"prompt": u.get("prompt_tokens"), "completion": u.get("completion_tokens"),
                 "total": u.get("total_tokens"), "reasoning": det.get("reasoning_tokens"),
                 "cost": u.get("cost")}  # cost present only on some providers (OpenRouter)
        text = (ch["message"].get("content") or "").strip()
        if not text:
            if ch.get("finish_reason") == "length":
                return "ERROR: truncated (ran out of tokens while reasoning)", dt, usage
            return "ERROR: empty response", dt, usage
        return text, dt, usage
    except Exception as e:
        return f"ERROR: {e}", time.time() - t0, {}


def ask(model: str, prompt: str, img_path: str, timeout: int,
        max_tokens: int = 1024, reasoning_effort: str | None = None) -> tuple[str, float]:
    """Backward-compatible wrapper that drops the usage dict."""
    text, dt, _ = ask_full(model, prompt, img_path, timeout, max_tokens, reasoning_effort)
    return text, dt


def unload_all():
    """Free VRAM/RAM before loading the next model (best effort)."""
    subprocess.run(["lms", "unload", "--all"], capture_output=True)


def probe(models, prompt, timeout):
    imgs = sorted(glob.glob(os.path.join(IMG_DIR, "*")))
    if not imgs:
        print("No images in ./images — run download_images.py first.")
        return
    test_img = imgs[0]
    print(f"Probing {len(models)} models with: {os.path.basename(test_img)}\n")
    for m in models:
        unload_all()
        out, dt = ask(m, prompt, test_img, timeout)
        ok = not out.startswith("ERROR:")
        tag = "VISION OK " if ok else "NO/ERROR  "
        preview = out.replace("\n", " ")[:90]
        print(f"  [{tag}] {m:42s} {dt:5.1f}s  {preview}")


def run(models, prompt, timeout):
    imgs = sorted(glob.glob(os.path.join(IMG_DIR, "*")))
    if not imgs:
        print("No images in ./images — run download_images.py first.")
        return
    print(f"{len(imgs)} images x {len(models)} models\n")

    results = {os.path.basename(p): {} for p in imgs}

    def flush():
        with open(os.path.join(HERE, "results.json"), "w") as f:
            json.dump({"prompt": prompt, "models": models, "results": results},
                      f, indent=2)
        build_report(models, prompt, results)

    for m in models:
        print(f"=== {m} ===")
        unload_all()
        for p in imgs:
            name = os.path.basename(p)
            out, dt = ask(m, prompt, p, timeout)
            results[name][m] = {"text": out, "seconds": round(dt, 1)}
            preview = out.replace("\n", " ")[:70]
            print(f"  {name:34s} {dt:5.1f}s  {preview}")
            flush()  # incremental: report.html updates after every image
        print()

    print("Wrote results.json and report.html")


def build_report(models, prompt, results):
    rows = []
    for name, by_model in results.items():
        cells = []
        for m in models:
            r = by_model.get(m, {})
            txt = (r.get("text") or "").replace("<", "&lt;").replace(">", "&gt;")
            secs = r.get("seconds", "")
            err = txt.startswith("ERROR:")
            cells.append(
                f'<td class="{ "err" if err else "" }">{txt}'
                f'<div class="meta">{secs}s</div></td>'
            )
        rows.append(
            f'<tr><td class="img"><img src="images/{name}"><div class="meta">{name}</div></td>'
            + "".join(cells) + "</tr>"
        )
    head = "".join(f"<th>{m}</th>" for m in models)
    html = f"""<!doctype html><meta charset="utf-8">
<title>Alt text comparison</title>
<style>
 body{{font:14px/1.5 -apple-system,sans-serif;margin:24px;color:#111}}
 .prompt{{background:#f4f4f5;padding:10px 14px;border-radius:8px;margin-bottom:16px}}
 table{{border-collapse:collapse;width:100%}}
 th,td{{border:1px solid #ddd;padding:10px;vertical-align:top;text-align:left}}
 th{{background:#fafafa;position:sticky;top:0}}
 td.img{{width:200px}} td.img img{{max-width:180px;max-height:140px;display:block}}
 td{{width:320px}}
 .meta{{color:#888;font-size:12px;margin-top:6px}}
 .err{{color:#b00;background:#fff5f5}}
</style>
<h1>Alt text comparison</h1>
<div class="prompt"><b>Prompt:</b> {prompt}</div>
<table><tr><th>Image</th>{head}</tr>{''.join(rows)}</table>
"""
    with open(os.path.join(HERE, "report.html"), "w") as f:
        f.write(html)


def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--models", nargs="+", default=DEFAULT_MODELS)
    ap.add_argument("--prompt", default=DEFAULT_PROMPT)
    ap.add_argument("--timeout", type=int, default=300)
    ap.add_argument("--probe", action="store_true",
                    help="just test which models accept images (1 image)")
    args = ap.parse_args()
    if args.probe:
        probe(args.models, args.prompt, args.timeout)
    else:
        run(args.models, args.prompt, args.timeout)


if __name__ == "__main__":
    main()

alt_prompts.py · sweep prompt styles on a single model

#!/usr/bin/env python3
"""Compare alt-text PROMPT styles on a single model (default: ministral-3-3b).

Runs each candidate prompt over every image in ./images and builds
report_prompts.html with one column per prompt, plus character counts
(good web alt text is generally <= 125 chars).

    python3 alt_prompts.py
    python3 alt_prompts.py --model qwen/qwen3-vl-30b
"""
import argparse
import glob
import json
import os
import re

from gen_alt_text import ask, unload_all, HERE, IMG_DIR

MODEL = "mistralai/ministral-3-3b"

# Candidate prompt styles to compare. Tweak freely.
PROMPTS = {
    "detailed": (
        "Write descriptive alt text for this image for a software company's "
        "website. One sentence describing the key visual content. Do not start "
        "with 'image of' or 'screenshot of'. Output only the alt text."
    ),
    "web_125": (
        "Write production-ready alt text for this image as it would appear on a "
        "SaaS marketing website. Requirements: a single line UNDER 125 characters; "
        "describe the most important visual content and any prominent UI or text; "
        "plain, specific language; do NOT start with 'image of', 'photo of', "
        "'screenshot of', or 'illustration of'; no quotation marks. "
        "Output only the alt text."
    ),
    "short_12w": (
        "Write very short alt text for this image: at most 12 words, one line, "
        "no ending punctuation, no quotes, do not start with 'image of' or "
        "'screenshot of'. Output only the alt text."
    ),
}


def clean_alt(text: str) -> str:
    """Strip the junk models add around the actual alt text."""
    if text.startswith("ERROR:"):
        return text
    t = text.strip()
    # Strip model-specific special tokens, e.g. GLM's <|begin_of_box|> wrappers.
    t = re.sub(r'<\|[^|]*\|>', '', t)
    # Drop a leading label like "Alt text:" / "Alt-text -" etc.
    t = re.sub(r'^\s*alt[\s_-]*text\s*[:\-–]\s*', '', t, flags=re.I)
    # Remove wrapping single/double/smart quotes.
    t = t.strip().strip('"“”‘’\'').strip()
    # Collapse internal whitespace/newlines to single spaces.
    t = re.sub(r'\s+', ' ', t)
    return t


def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--model", default=MODEL)
    ap.add_argument("--timeout", type=int, default=300)
    args = ap.parse_args()

    imgs = sorted(glob.glob(os.path.join(IMG_DIR, "*")))
    if not imgs:
        print("No images in ./images — run download_images.py first.")
        return
    print(f"Model: {args.model}\n{len(imgs)} images x {len(PROMPTS)} prompts\n")

    unload_all()  # load the one model once
    results = {os.path.basename(p): {} for p in imgs}

    def flush():
        with open(os.path.join(HERE, "results_prompts.json"), "w") as f:
            json.dump({"model": args.model, "prompts": PROMPTS,
                       "results": results}, f, indent=2)
        build_report(args.model, results)

    for key, prompt in PROMPTS.items():
        print(f"=== prompt: {key} ===")
        for p in imgs:
            name = os.path.basename(p)
            raw, dt = ask(args.model, prompt, p, args.timeout)
            txt = clean_alt(raw)
            results[name][key] = {"text": txt, "chars": len(txt),
                                  "seconds": round(dt, 1)}
            print(f"  {name[:28]:28s} {len(txt):3d}c {dt:4.1f}s  {txt[:60]}")
            flush()
        print()
    print("Wrote results_prompts.json and report_prompts.html")


def build_report(model, results):
    keys = list(PROMPTS.keys())
    rows = []
    for name, by in results.items():
        cells = []
        for k in keys:
            r = by.get(k, {})
            txt = (r.get("text") or "").replace("<", "&lt;").replace(">", "&gt;")
            chars = r.get("chars", "")
            over = isinstance(chars, int) and chars > 125
            err = txt.startswith("ERROR:")
            cls = "err" if err else ("over" if over else "")
            cells.append(
                f'<td class="{cls}">{txt}'
                f'<div class="meta">{chars} chars · {r.get("seconds","")}s</div></td>'
            )
        rows.append(
            f'<tr><td class="img"><img src="images/{name}">'
            f'<div class="meta">{name}</div></td>' + "".join(cells) + "</tr>"
        )
    head = "".join(
        f'<th>{k}<div class="meta">{PROMPTS[k][:60]}…</div></th>' for k in keys
    )
    html = f"""<!doctype html><meta charset="utf-8">
<title>Alt-text prompt comparison — {model}</title>
<style>
 body{{font:14px/1.5 -apple-system,sans-serif;margin:24px;color:#111}}
 h1 .meta{{font-weight:400}}
 table{{border-collapse:collapse;width:100%}}
 th,td{{border:1px solid #ddd;padding:10px;vertical-align:top;text-align:left}}
 th{{background:#fafafa;font-size:13px}}
 td.img{{width:200px}} td.img img{{max-width:180px;max-height:140px;display:block}}
 td{{width:300px}}
 .meta{{color:#888;font-size:12px;margin-top:6px;font-weight:400}}
 .over{{background:#fff8e6}} .err{{color:#b00;background:#fff5f5}}
</style>
<h1>Alt-text prompt comparison <span class="meta">— {model}</span></h1>
<p class="meta">Amber = over 125 characters.</p>
<table><tr><th>Image</th>{head}</tr>{''.join(rows)}</table>
"""
    with open(os.path.join(HERE, "report_prompts.html"), "w") as f:
        f.write(html)


if __name__ == "__main__":
    main()

make_alt_text.py · production run with boilerplate cleanup and the shorten retry

#!/usr/bin/env python3
"""Produce production-ready alt text for every image in ./images.

Model: ministral-3-3b (LM Studio). Balanced style, hard <=125 char cap with a
text-only 'shorten' retry for any that spill over.

Outputs: alt_text.json, alt_text.csv, report_final.html

    python3 make_alt_text.py
"""
import argparse
import csv
import glob
import json
import os

import requests

from gen_alt_text import ask, data_url, unload_all, HERE, IMG_DIR, API
from alt_prompts import clean_alt

MODEL = "mistralai/ministral-3-3b"
LIMIT = 125

PROMPT = (
    "Write production-ready alt text for this image for a SaaS company's website.\n"
    "Rules:\n"
    "- ONE line, MAXIMUM 125 characters.\n"
    "- Describe only what is actually visible. Do NOT guess locations, settings, "
    "people's names, or brands that are not clearly shown.\n"
    "- For a photo of a person, describe them plainly by apparent gender and clearly "
    "visible features, e.g. 'Woman with long dark hair in a white turtleneck, smiling'. "
    "Do not use vague labels like 'business professional' or 'person' when gender is "
    "clearly visible.\n"
    "- For a product illustration, diagram, or UI mockup, describe what the graphic "
    "DEPICTS (the product, tool, or workflow). Treat any sample or placeholder text "
    "shown inside a mockup as example content, NOT the subject — do not make a "
    "placeholder word the main topic.\n"
    "- If readable product names or UI labels are clearly visible (e.g. Supabase, "
    "Notion, Webflow), you may include them.\n"
    "- Plain, specific language. Avoid subjective adjectives like sleek, futuristic, "
    "cheerful, professional, or modern.\n"
    "- Do NOT start with 'image of', 'photo of', 'screenshot of', or 'illustration of'.\n"
    "- No quotation marks. Output only the alt text."
)


def shrink(text: str, timeout: int, model: str = MODEL) -> str:
    """Text-only pass: rewrite an over-long alt text to fit under the limit.

    max_tokens is generous so reasoning models can think and still answer; if the
    rewrite comes back empty/errored, we keep the original rather than lose it.
    """
    prompt = (
        f"Rewrite this website alt text to be UNDER {LIMIT} characters while keeping "
        "the most important information. One line, plain language, no quotes. "
        f"Output only the rewritten alt text:\n\n{text}"
    )
    payload = {"model": model, "temperature": 0.2, "max_tokens": 512,
               "messages": [{"role": "user", "content": prompt}]}
    try:
        r = requests.post(API, json=payload, timeout=timeout)
        if r.status_code == 200:
            cand = clean_alt(r.json()["choices"][0]["message"].get("content") or "")
            if cand and not cand.startswith("ERROR:"):
                return cand
    except Exception:
        pass
    return text


def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--timeout", type=int, default=300)
    args = ap.parse_args()

    imgs = sorted(glob.glob(os.path.join(IMG_DIR, "*")))
    if not imgs:
        print("No images in ./images — run download_images.py first.")
        return
    print(f"{MODEL}: writing alt text for {len(imgs)} images (<= {LIMIT} chars)\n")
    unload_all()

    out = {}
    for p in imgs:
        name = os.path.basename(p)
        raw, dt = ask(MODEL, PROMPT, p, args.timeout)
        alt = clean_alt(raw)
        tries = 0
        while len(alt) > LIMIT and not alt.startswith("ERROR:") and tries < 2:
            alt = shrink(alt, args.timeout)
            tries += 1
        flag = "" if len(alt) <= LIMIT else " (still over)"
        out[name] = {"alt": alt, "chars": len(alt)}
        print(f"  {name[:30]:30s} {len(alt):3d}c {dt:4.1f}s{flag}  {alt}")

    # JSON
    with open(os.path.join(HERE, "alt_text.json"), "w") as f:
        json.dump(out, f, indent=2)
    # CSV
    with open(os.path.join(HERE, "alt_text.csv"), "w", newline="") as f:
        w = csv.writer(f)
        w.writerow(["filename", "alt_text", "chars"])
        for name, r in out.items():
            w.writerow([name, r["alt"], r["chars"]])
    # HTML
    rows = []
    for name, r in out.items():
        alt = r["alt"].replace("<", "&lt;").replace(">", "&gt;")
        over = r["chars"] > LIMIT
        rows.append(
            f'<tr><td class="img"><img src="images/{name}">'
            f'<div class="meta">{name}</div></td>'
            f'<td class="{ "over" if over else "" }">{alt}'
            f'<div class="meta">{r["chars"]} chars</div></td></tr>'
        )
    html = f"""<!doctype html><meta charset="utf-8">
<title>Final alt text — {MODEL}</title>
<style>
 body{{font:15px/1.55 -apple-system,sans-serif;margin:24px;color:#111}}
 table{{border-collapse:collapse;width:100%;max-width:1000px}}
 td{{border:1px solid #ddd;padding:12px;vertical-align:top}}
 td.img{{width:220px}} td.img img{{max-width:200px;max-height:150px;display:block}}
 .meta{{color:#888;font-size:12px;margin-top:6px}}
 .over{{background:#fff8e6}}
</style>
<h1>Final alt text <span class="meta">— {MODEL}, &le;{LIMIT} chars</span></h1>
<table>{''.join(rows)}</table>
"""
    with open(os.path.join(HERE, "report_final.html"), "w") as f:
        f.write(html)
    over = sum(1 for r in out.values() if r["chars"] > LIMIT)
    print(f"\nWrote alt_text.json, alt_text.csv, report_final.html "
          f"({len(out)} images, {over} over {LIMIT} chars)")


if __name__ == "__main__":
    main()

compare_models.py · the model-comparison harness (the main study)

#!/usr/bin/env python3
"""Compare models head-to-head on the SAME production alt-text prompt.

Same PROMPT, clean-up, and <=125 char cap (with shorten retry) as make_alt_text,
so differences are purely the model. Builds a side-by-side report.

    python3 compare_models.py                      # ministral vs gemma-4-12b
    python3 compare_models.py --models mistralai/ministral-3-3b google/gemma-4-26b-a4b qwen/qwen3-vl-30b
"""
import argparse
import glob
import json
import os

from gen_alt_text import ask_full, unload_all, HERE, IMG_DIR
from alt_prompts import clean_alt
from make_alt_text import PROMPT, LIMIT, shrink

DEFAULT_MODELS = ["mistralai/ministral-3-3b", "google/gemma-4-12b-qat"]


def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--models", nargs="+", default=DEFAULT_MODELS)
    ap.add_argument("--timeout", type=int, default=400)
    ap.add_argument("--max-tokens", type=int, default=1024,
                    help="reasoning models (Gemma-4) need ~4000 on complex images")
    ap.add_argument("--reasoning-effort", default=None,
                    help="set to 'none' to disable thinking on reasoning models (fast)")
    ap.add_argument("--prompt-suffix", default="",
                    help="appended to the prompt, e.g. ' /nothink' to disable GLM thinking")
    ap.add_argument("--only", nargs="+", default=None,
                    help="only images whose filename contains one of these substrings")
    ap.add_argument("--merge", action="store_true",
                    help="add these models as columns to existing results_compare.json "
                         "instead of overwriting (lets you add one model later)")
    args = ap.parse_args()

    imgs = sorted(glob.glob(os.path.join(IMG_DIR, "*")))
    if args.only:
        imgs = [p for p in imgs
                if any(s in os.path.basename(p) for s in args.only)]
    if not imgs:
        print("No images in ./images — run download_images.py first.")
        return
    print(f"{len(imgs)} images x {len(args.models)} models (<= {LIMIT} chars)\n")

    path = os.path.join(HERE, "results_compare.json")
    existing = json.load(open(path)) if (args.merge and os.path.exists(path)) else {}
    ex_res, ex_models = existing.get("results", {}), existing.get("models", [])
    # Report shows existing columns first, then any newly-added models.
    report_models = ex_models + [m for m in args.models if m not in ex_models]
    results = {os.path.basename(p): dict(ex_res.get(os.path.basename(p), {}))
               for p in imgs}

    def flush():
        with open(path, "w") as f:
            json.dump({"prompt": PROMPT, "models": report_models,
                       "results": results}, f, indent=2)
        build_report(report_models, results)

    for m in args.models:
        print(f"=== {m} ===")
        unload_all()
        for p in imgs:
            name = os.path.basename(p)
            raw, dt, usage = ask_full(m, PROMPT + args.prompt_suffix, p, args.timeout,
                                      max_tokens=args.max_tokens,
                                      reasoning_effort=args.reasoning_effort)
            alt = clean_alt(raw)
            tries = 0
            while len(alt) > LIMIT and not alt.startswith("ERROR:") and tries < 2:
                alt = shrink(alt, args.timeout, model=m)
                tries += 1
            results[name][m] = {"text": alt, "chars": len(alt), "seconds": round(dt, 1),
                                "ptok": usage.get("prompt"), "ctok": usage.get("completion"),
                                "rtok": usage.get("reasoning")}
            pt, ctk = usage.get("prompt"), usage.get("completion")
            print(f"  {name[:24]:24s} {len(alt):3d}c {dt:5.1f}s {str(pt):>4}+{str(ctk):<4}tok  {alt[:42]}")
            flush()
        print()
    print("Wrote results_compare.json and report_compare.html")


def build_report(models, results):
    # Per-model summary (averages over completed cells).
    summary_cells = []
    for m in models:
        cells = [by[m] for by in results.values() if m in by]
        ok = [c for c in cells if not (c.get("text") or "").startswith("ERROR:")]
        errs = len(cells) - len(ok)
        over = sum(1 for c in ok if c.get("chars", 0) > LIMIT)
        avg_c = round(sum(c.get("chars", 0) for c in ok) / len(ok)) if ok else 0
        avg_s = round(sum(c.get("seconds", 0) for c in ok) / len(ok), 1) if ok else 0
        pt = [c["ptok"] for c in ok if c.get("ptok")]
        ct = [c["ctok"] for c in ok if c.get("ctok")]
        avg_pt = round(sum(pt) / len(pt)) if pt else 0
        avg_ct = round(sum(ct) / len(ct)) if ct else 0
        summary_cells.append(
            f"<td><b>{avg_c}</b> avg chars · <b>{avg_s}s</b> avg<br>"
            f"<span class='meta'>{avg_pt} in + {avg_ct} out tok/img · "
            f"{over} over {LIMIT}c · {errs} errors</span></td>"
        )
    summary = (f"<tr><th>summary</th>{''.join(summary_cells)}</tr>")

    rows = []
    for name, by in results.items():
        cells = []
        for m in models:
            r = by.get(m, {})
            txt = (r.get("text") or "").replace("<", "&lt;").replace(">", "&gt;")
            chars = r.get("chars", "")
            over = isinstance(chars, int) and chars > LIMIT
            err = txt.startswith("ERROR:")
            cls = "err" if err else ("over" if over else "")
            tok = ""
            if r.get("ptok") or r.get("ctok"):
                tok = f' · {r.get("ptok","?")}+{r.get("ctok","?")} tok'
            cells.append(
                f'<td class="{cls}">{txt}'
                f'<div class="meta">{chars} chars · {r.get("seconds","")}s{tok}</div></td>'
            )
        rows.append(
            f'<tr><td class="img"><img src="images/{name}">'
            f'<div class="meta">{name[:24]}</div></td>' + "".join(cells) + "</tr>"
        )
    head = "".join(f"<th>{m}</th>" for m in models)
    html = f"""<!doctype html><meta charset="utf-8">
<title>Model comparison — alt text</title>
<style>
 body{{font:14px/1.5 -apple-system,sans-serif;margin:24px;color:#111}}
 table{{border-collapse:collapse;width:100%}}
 th,td{{border:1px solid #ddd;padding:10px;vertical-align:top;text-align:left}}
 th{{background:#fafafa;position:sticky;top:0}}
 td.img{{width:200px}} td.img img{{max-width:180px;max-height:140px;display:block}}
 td{{width:330px}}
 .meta{{color:#888;font-size:12px;margin-top:6px}}
 .over{{background:#fff8e6}} .err{{color:#b00;background:#fff5f5}}
</style>
<h1>Model comparison — production alt-text prompt</h1>
<p class="meta">Amber = over {LIMIT} characters. Gemma-4 run with thinking disabled (reasoning_effort=none).</p>
<table><tr><th>Image</th>{head}</tr>{summary}{''.join(rows)}</table>
"""
    with open(os.path.join(HERE, "report_compare.html"), "w") as f:
        f.write(html)


if __name__ == "__main__":
    main()

size_experiment.py · sweep input resolution for a fixed model

#!/usr/bin/env python3
"""Variable: IMAGE SIZE. Sweep input resolution for a fixed model + prompt.

Sends the same text-heavy UI images at several max-side resolutions and records
the alt text, latency, and payload size — to find the resolution floor below
which the model can no longer read on-screen labels.

    python3 size_experiment.py
"""
import base64
import io
import json
import os
import time

import requests
from PIL import Image

from gen_alt_text import unload_all, HERE, IMG_DIR, API
from make_alt_text import PROMPT

MODEL = "qwen/qwen3-vl-30b"
SIZES = [256, 384, 512, 768, 1024, 1536]
# Text-heavy UI images (large originals) where legibility should matter most.
IMAGES = ["6f9ab5e5", "bddc34c3", "a86c45b5"]


def encode_at(path: str, size: int):
    im = Image.open(path)
    im.load()
    if im.mode != "RGB":
        im = im.convert("RGB")
    im.thumbnail((size, size))  # only downscales, never upscales
    buf = io.BytesIO()
    im.save(buf, format="JPEG", quality=85)
    raw = buf.getvalue()
    return ("data:image/jpeg;base64," + base64.b64encode(raw).decode(),
            f"{im.size[0]}x{im.size[1]}", len(raw))


def ask_data_url(data_url: str, timeout: int = 120):
    payload = {
        "model": MODEL, "temperature": 0.2, "max_tokens": 300,
        "reasoning_effort": "none",
        "messages": [{"role": "user", "content": [
            {"type": "text", "text": PROMPT},
            {"type": "image_url", "image_url": {"url": data_url}}]}],
    }
    t0 = time.time()
    r = requests.post(API, json=payload, timeout=timeout)
    dt = time.time() - t0
    if r.status_code != 200:
        return f"ERROR: HTTP {r.status_code}", dt
    return (r.json()["choices"][0]["message"].get("content") or "").strip(), dt


def main():
    import glob
    paths = []
    for sub in IMAGES:
        hits = glob.glob(os.path.join(IMG_DIR, f"*{sub}*")) or \
               [p for p in glob.glob(os.path.join(IMG_DIR, "*")) if sub in os.path.basename(p)]
        if hits:
            paths.append(hits[0])
    print(f"{MODEL}: {len(paths)} images x {len(SIZES)} sizes\n")
    unload_all()

    results = {}
    for p in paths:
        name = os.path.basename(p)
        results[name] = {}
        print(f"=== {name} ===")
        for s in SIZES:
            data_url, px, kb = encode_at(p, s)
            text, dt = ask_data_url(data_url)
            results[name][s] = {"text": text, "chars": len(text),
                                "seconds": round(dt, 1), "px": px,
                                "kb": round(kb / 1024, 1)}
            print(f"  size {s:4d} -> {px:>9} {kb/1024:5.1f}KB {dt:4.1f}s {len(text):3d}c  {text[:55]}")
            with open(os.path.join(HERE, "results_size.json"), "w") as f:
                json.dump({"model": MODEL, "prompt": PROMPT, "sizes": SIZES,
                           "results": results}, f, indent=2)
        print()
    print("Wrote results_size.json")


if __name__ == "__main__":
    main()

validate.py · the held-out repeatability run

#!/usr/bin/env python3
"""Repeatability / validation pass: run the study's BEST combo on a held-out set.

  Best combo  = Qwen3-VL-30B  +  tightened <=125 prompt  +  768px  +  shorten-pass
  Held-out set = ./images_val (20 images never seen during the study)

Writes results_validation.json. Reports any issues (errors, over-limit, slow).
"""
import glob
import json
import os

import gen_alt_text
gen_alt_text.MAX_SIDE = 768  # the validated resolution sweet spot

from gen_alt_text import ask_full, unload_all, HERE  # noqa: E402
from alt_prompts import clean_alt  # noqa: E402
from make_alt_text import PROMPT, LIMIT, shrink  # noqa: E402

MODEL = "qwen/qwen3-vl-30b"
VAL_DIR = os.path.join(HERE, "images_val")


def main():
    imgs = sorted(glob.glob(os.path.join(VAL_DIR, "*")))
    if not imgs:
        print("No images in ./images_val — run fetch_openverse.py --set val")
        return
    print(f"Validation: {MODEL} @768px on {len(imgs)} held-out images\n")
    unload_all()

    out = {}
    for p in imgs:
        name = os.path.basename(p)
        raw, dt, usage = ask_full(MODEL, PROMPT, p, timeout=300, max_tokens=300)
        alt = clean_alt(raw)
        tries = 0
        while len(alt) > LIMIT and not alt.startswith("ERROR:") and tries < 2:
            alt = shrink(alt, 300, model=MODEL)
            tries += 1
        out[name] = {"alt": alt, "chars": len(alt), "seconds": round(dt, 1),
                     "ptok": usage.get("prompt"), "ctok": usage.get("completion")}
        flag = ""
        if alt.startswith("ERROR:"):
            flag = " <<ERROR"
        elif len(alt) > LIMIT:
            flag = " <<OVER"
        print(f"  {name[:30]:30s} {len(alt):3d}c {dt:4.1f}s {usage.get('prompt')}tok  {alt[:46]}{flag}")

    with open(os.path.join(HERE, "results_validation.json"), "w") as f:
        json.dump({"model": MODEL, "prompt": PROMPT, "max_side": gen_alt_text.MAX_SIDE,
                   "results": out}, f, indent=2)

    errs = [n for n, r in out.items() if r["alt"].startswith("ERROR:")]
    over = [n for n, r in out.items() if not r["alt"].startswith("ERROR:") and r["chars"] > LIMIT]
    print(f"\n{len(out)} images | errors: {len(errs)} | over {LIMIT}c: {len(over)}")
    if errs:
        print("  errors:", errs)
    if over:
        print("  over-limit:", over)
    print("Wrote results_validation.json")


if __name__ == "__main__":
    main()

All inference for this report ran on-device through LM Studio. No image or prompt left the machine, and no cloud API was called. Our product images are © whalesync.com, used here for evaluation.

Local Vision Models for Alt Text, Tested: Six Models, Three Variables, One Recipe (2026)