Scratch

Q: Can I strip pasted Word markup without flattening the real formatting?

Yes, and the diff is the safeguard. The agent removes the Mso classes, o:p tags, inline mso styles, and empty spacers while leaving headings, lists, and links intact, and every changed body comes back as a diff so a flattened list or dropped heading is caught before it ships. Validators can require that specific structural tags survive the strip.

Q: Can I see exactly what changed before it goes live?

Yes. Every changed field shows as a word-level diff against the original, with unchanged fields grayed out, and nothing reaches the live system until you approve it per item. A normalization that mapped the wrong value, or a strip that ate a sentence, is highlighted on your screen, not discovered later on your site.

Q: Does my content get uploaded to a model or a SaaS to be cleaned?

Only the part you choose. Scratch has no built-in AI and sends your records to no model itself; the agent you bring reads the files under your own account and plan. Because the records are files, the repeatable passes run as scripts the agent runs locally, so the normalization and markup-stripping never round-trips your content to a model row by row. The model's read is spent on the judgment calls you asked for.

Your CRM has the same country written three ways. "US" on one contact, "USA" on the next, "United States" on the one after that. Names came off a trade-show scanner IN ALL CAPS, the same account carries Inc, Inc., and Incorporated across three company records, and the phone numbers follow no format anyone agreed on. Your CMS is no tidier: blog titles in five capitalizations because five people wrote them, product names that drifted off the convention three rebrands ago, and bodies still carrying class="MsoNormal" from a decade of pasting out of Word. None of it broke anything. Content accretes mess the way a junk drawer does, one reasonable addition at a time.

Then something forces the issue. A report that has to group by country. A migration that inherits every inconsistency. A help center about to feed an AI assistant. A rename that has to reach every record, not the half someone got to. Find and replace is too blunt for this, it swaps a string but cannot decide that "USA" and "United States" should both become "US" while "USA Today" stays put. A database-wide search and replace can do it in one command, which is exactly why one bad pattern rewrites every row with no way back. The job has a reliable shape: pull every record out as files, let AI clean them against your rules, review every change next to the original, and publish only what you approve. The options below each deliver some slice of that. They differ on whether anything makes a judgment call, and whether anyone sees the change before your customers do.

What counts as content that needs sanitizing

Sanitizing is two jobs wearing one word: making content consistent, and getting the junk out of it. Here is what actually accretes, most common first. Most cleanups need several of these at once, and naming which ones you have is half the brief.

Inconsistent contact and company records. The everyday CRM job. "US" next to "USA" next to "United States". Phone numbers in four formats. NAMES IN ALL CAPS off a badge scanner. Company-suffix drift, Inc, Inc., and Incorporated on three records for one account. Job titles that never agreed on VP versus V.P. versus Vice President. Trailing whitespace, double spaces, and the near-duplicate records all of it breeds.
Titles and naming that drifted. Blog post titles in five capitalizations because five people wrote them. Product names that stopped following the convention three rebrands ago. SKUs masquerading as names. The fix is not a rewrite, it is enforcing one agreed shape across every record: sentence case here, the brand name in this position, the size at the end.
Pasted-in markup junk. Content that came through Word, Google Docs, or Outlook drags its origin with it: class="MsoNormal", empty <o:p></o:p>, inline style="mso-...", deprecated <font> tags, runs of  , and <p> </p> spacers that fight your design system. It renders fine until it does not.
Encoding gremlins and mojibake. The fingerprint of a bad import: â€™ where an apostrophe belongs, â€" for a dash, Ã© for é, Â£ for £, and the replacement glyph �. A find and replace fixes a known set; the long tail takes a reader.
Tracking junk and bad links in the copy. Body links carrying utm_, fbclid, gclid, or ?ref=. Hardcoded http:// that should be https://. Preview and staging hosts that leaked into production, staging., localhost, an internal IP, a *.webflow.io URL. Links that now 404.
Internal leakage and placeholders. lorem ipsum that outlived the launch, a stray [DRAFT] or TODO, a reviewer's note in brackets, an internal ticket number, "DO NOT PUBLISH", an employee's name left in customer-facing copy.
Leaked PII in free-text fields. The one that turns urgent the moment content meets an AI assistant or an export. A support transcript pasted into a deal, a home address in a contact's description, an email and phone left in a note headed for a RAG index. The grep is mechanical, emails, \d{3}[-.\s]\d{3}[-.\s]\d{4} phone runs, card- and SSN-shaped digit strings. The judgment is not: a phone number in a private note is leakage, the same number in a published "call us" block is the point.

Most of this is pattern work a script does in seconds. The expensive part is the judgment call hiding inside each one, which "USA" is really United States and which belongs to "USA Today", whether a bracketed line is an internal note or real copy, and that is exactly the part a regex cannot make and a person should not have to make 4,000 times by hand.

Your options

Find and replace, in the platform

Where it exists at all, it swaps a literal string across records. WordPress leans on search-and-replace plugins for this; HubSpot, Webflow, and Notion give you no bulk find-and-replace for free text at all. The shared limit is that it is pattern-blind. It will turn every "USA" into "US", including the one inside "USA Today", and it cannot tell a leaked email in a private note from a real one in a published testimonial. Either it touches both or you narrow the pattern until it misses cases. Most of these paths have no batch preview and write straight to the live record.

A database-wide search and replace

The migration-tooling route: a serialization-safe replace across the whole database in one command. It is genuinely total and genuinely fast, and that is the danger. It is a sed over your production data with no per-record judgment, and its dry run reports how many rows would match, not what the change would read like in each one. One over-broad pattern rewrites every row, and the only undo is the database backup you remembered to take.

A spreadsheet export with formulas

Export the collection, normalize with SUBSTITUTE, PROPER, and a regex column, re-import. It is free and it is genuinely batch. Then rich text arrives as a wall of raw HTML and the reading stops, the re-import is all or nothing, and a blank cell in the wrong column overwrites the live value with blank, the trap every CSV guide warns about. A formula also cannot make the judgment call; it does the same thing to every row whether it should or not.

Point cleanup tools

Paste-your-HTML cleaners, the "clear formatting" button, a standalone PII scanner. Each does one slice well, strip the Word markup, or flag the emails, and none does the whole job, none is wired to your CMS or CRM, and the PII scanners ask you to upload the very data you were trying to keep contained. You end up shuttling content out to one tool and back, by hand, per slice.

An AI agent on your CMS or CRM API

Now something finally reads context. Wire an agent to the API, through an MCP server or a script, and it can tell "USA Today" from the country and the leaked note from the real testimonial. The cost is in the plumbing. The agent works one record per tool call against a rate-limited API, so cleaning a whole database is thousands of sequential reads narrated one call at a time, and every fix it decides on lands on the live record the moment it runs. No staging, no preview, and on most platforms no version history to fall back on.

Scratch keeps the agent's judgment and adds the two things the live system needs: the work happens on files, and every change is reviewed before it ships. Your agent reads the records as files on your laptop, so the repeatable passes run as scripts it writes once, your content is not round-tripped to a model row by row. The fields that should never move are locked at the connector. Every change comes back as a word-level diff, so an over-eager rule that flattened a real sentence or mis-mapped a country is on your screen, not on your site. The honest trade is that the review is yours, and a few thousand diffs is real reading. Scratch makes it fast. It does not make it zero.

Option	Across every record	Reads context, not just patterns	Review before live	Undo after publish
Find and replace	Where it exists	No	Rarely	No
DB-wide search and replace	Yes	No	Match counts, not the change	Your backup, if you took one
Spreadsheet export	Yes	No	Two columns of raw HTML	A manual export, if you saved one
Point cleanup tools	One slice at a time	Some, per tool	Per paste	No
Agent on the API	Yes, at API pace	Yes	Only if you build it	No
Scratch	Yes, as local files	Yes, your agent	Word-level diff, per field	Per item, even after publish

How the loop works on your records

Scratch pulls your CRM and CMS into files. Every record lands as its own file in a folder on your laptop, every field visible. The editable fields sit next to the ones that are locked at the connector: in HubSpot, contacts, companies, deals, and tickets edit while emails, workflows, and lists stay read-only; in Pipedrive, names, notes, and custom fields edit while owners, values, and timestamps do not; in Notion, the database columns edit while page bodies and computed properties do not; in Webflow, CMS collections and metadata edit while the ecommerce tables and Designer content stay out; in WordPress, post bodies, titles, and slugs edit while templates and plugin-owned SEO meta are held back. Nothing in the live system has changed, and nothing in the next step can change it.
Your AI runs the cleanup on files. Point Claude, Codex, Cursor, or Copilot at the folder and give it the brief. The mechanical offenders fall first: the agent greps every record in about a second and writes scripts for the repeatable passes, mapping every country and company-suffix variant onto one canonical form, fixing name casing, enforcing the title shape, stripping Word markup, repairing the mojibake table, cleaning tracking parameters. Those scripts run locally, so the bulk of the cleanup never sends your records to a model at all. The model's read is spent only on the judgment calls you asked for, which "USA" is United States and which is a brand, whether a bracketed line is an internal note. The agent holds no key to either system, so being wrong cannot reach your data.
You review the diff and publish. Every changed field sits next to its original, word by word, unchanged fields grayed out, so a record where one company name got normalized reads as one highlighted line. Validators run before you read, if you set them: a row whose country is still not canonical, or one where the cleanup dropped a required field, arrives already failed. Approve per item, and Scratch writes back only the fields that changed, so a cleanup pass cannot turn into a stage change or a price edit. Any published item reverts individually, and the original value comes back.

Read-only connectors invert the last step. Stripe and QuickBooks pull but do not publish, so the loop ends in a findings report you clear by hand. That is covered in how to audit your Stripe products and customers with AI.

The cleanup, in one brief

The whole job can ride on a single plain-English brief. A representative one for a CRM about to drive a quarterly report:

Across every contact and company, normalize the country field to its full English name, standardize every phone number to +1 (xxx) xxx-xxxx, fix any name that is in all caps or all lowercase to proper case, and collapse company suffixes so Inc, Inc., and Incorporated all read as Inc. Flag, do not merge, any records that now look like duplicates so I can decide per pair. Leave deal values, owners, and timestamps untouched.

The agent turns that into a handful of scripts and one judgment pass. The same loop runs a CMS cleanup unchanged: swap in put every blog title in sentence case, strip Word markup, fix mojibake, and remove utm and fbclid parameters from every body link and nothing else about the process moves. Calibrate on a slice first: run it on 50 records, read those diffs end to end, and tighten the brief where it over-reached or missed. When the first 50 read right, send it through the rest. The validators you write during calibration, every country is canonical, every title matches the shape, no Mso class survives, become the gate that fails the stragglers before they reach your eyes.

What people use it for

Normalize a CRM before a quarterly report, a QBR, or a system handoff: country and state spellings, phone formats, name casing, company-suffix drift, and the near-duplicate records that hid behind all of it.
Enforce one title convention across a blog and one naming convention across a catalog, instead of the five styles five people left behind.
Push a product or brand rename through every record so the old name does not survive in the half of the catalog nobody got to.
Strip a decade of pasted Word and Google Docs markup out of every post so the design system stops fighting inline styles.
Sweep tracking parameters, staging URLs, and dead links out of body copy across the whole site in one reviewed pass.
Redact leaked PII out of notes and bodies before the content meets a support bot, a RAG index, or an agency export.

Questions people ask

Can the AI pick the right canonical form, or only swap a literal string?

It picks. That is the line between this and a search and replace. You name the target, every country as its full English name, every phone as +1 (xxx) xxx-xxxx, and the agent maps the variants onto it, turning "US", "U.S.", and "United States" into one value while leaving "USA Today" alone because it read the field, not just the string. The calls it is unsure about surface in the diff for you to settle.

Can I standardize CRM contact and company fields in bulk?

Yes, that is the core job. Country and state spellings, phone and date formats, name casing, job titles, company-suffix drift, all of it normalizes to the canonical form you set, across every contact and company at once. Owners, monetary values, and timestamps stay locked at the connector, so a hygiene pass cannot move a deal or rewrite an amount. The platform-specific version for HubSpot lives at bulk update HubSpot contacts with AI.

Will a database-wide change overwrite fields I did not mean to touch?

Not here. Scratch writes back only the fields that changed, so a field the agent never edited is not part of the write and cannot be replaced by a blank or a stale value. That is the failure a DB-wide search and replace and a CSV re-import are both famous for, and it is structurally absent from this loop.

Can I strip pasted Word markup without flattening the real formatting?

Yes, and the diff is the safeguard. The agent removes the Mso classes, o:p tags, inline mso styles, and empty spacers while leaving your headings, lists, and links intact, and every changed body comes back as a diff so a flattened list or a dropped heading is caught before it ships. Validators can require that specific structural tags survive the strip.

Can the AI over-clean, and how would I catch it?

It can, the same way any cleanup can take too much, a casing rule that mangles an acronym, a strip that eats a real sentence. The catch is the review queue: an over-eager change is a highlighted diff you reject in a glance, and a validator can flag any record where the cleanup removed more than a set amount of text. Nothing it over-changes reaches the live system, and if one slips through a batch, that item reverts on its own.

Can I see exactly what changed before it goes live?

Yes. That is the design. Every changed field shows as a word-level diff against the original, with unchanged fields grayed out, and nothing reaches the live system until you approve it per item. A normalization that mapped the wrong value, or a strip that ate a sentence, is highlighted on your screen, not discovered later on your site.

Does my content get uploaded to a model or a SaaS to be cleaned?

Only the part you choose, and less than you would think. Scratch has no built-in AI and sends your records to no model itself; the agent you bring reads the files under your own account and plan. Because the records are files, the repeatable passes run as scripts the agent writes and runs locally, so the normalization and markup-stripping never round-trips your content to a model row by row. The model's read is spent on the judgment calls you asked for. Which model that is stays your call.

Does this work with my CRM and CMS, or only some of them?

Many of them. HubSpot, Pipedrive, Webflow, WordPress, Notion, and more publish edits back; Stripe and QuickBooks are read-only and end in a report. See /for/ for exactly what each connector reads and writes before you plan a cleanup, since the locked fields differ by platform.

Can I undo a cleanup after it publishes?

Yes. The original stays next to every published change, per record, and rejecting it puts the old value back. You do not need the platform's version history, which most CMSs and CRMs do not keep for free-text fields anyway, and you are not restoring a whole-database backup that drags every other record back with it.

Do I need to write regex?

No. You describe the cleanup in plain English and the agent writes whatever regex and scripts the job needs, then hands you diffs, not code. The patterns in this guide are there so you can recognize what you have, not so you have to type them.

See it on your own content

The fastest way to trust a cleanup is to watch it run on your own records and read every change it made. See it run on your content →, or download Scratch free, connect one system, and pull a single collection. Your first diff is about twenty minutes away. Scratch is free to try, and the AI is whichever agent you already pay for.

How to sanitize CRM and CMS content with AI