← /how-to/

How to sanitize CRM and CMS content with AI

Standardize a CRM, enforce one title and product-naming convention, strip pasted Word markup. AI cleans every record as local files, you review each change, and only approved content ships. Try it now free → or book a demo with Curtis

Your CRM has the same country written three ways. "US" on one contact, "USA" on the next, "United States" on the one after that. Names came off a trade-show scanner IN ALL CAPS, the same account carries Inc, Inc., and Incorporated across three company records, and the phone numbers follow no format anyone agreed on. Your CMS is no tidier: blog titles in five capitalizations because five people wrote them, product names that drifted off the convention three rebrands ago, and bodies still carrying class="MsoNormal" from a decade of pasting out of Word. None of it broke anything. Content accretes mess the way a junk drawer does, one reasonable addition at a time.

Then something forces the issue. A report that has to group by country. A migration that inherits every inconsistency. A help center about to feed an AI assistant. A rename that has to reach every record, not the half someone got to. Find and replace is too blunt for this, it swaps a string but cannot decide that "USA" and "United States" should both become "US" while "USA Today" stays put. A database-wide search and replace can do it in one command, which is exactly why one bad pattern rewrites every row with no way back. The job has a reliable shape: pull every record out as files, let AI clean them against your rules, review every change next to the original, and publish only what you approve. The options below each deliver some slice of that. They differ on whether anything makes a judgment call, and whether anyone sees the change before your customers do.

What counts as content that needs sanitizing

Sanitizing is two jobs wearing one word: making content consistent, and getting the junk out of it. Here is what actually accretes, most common first. Most cleanups need several of these at once, and naming which ones you have is half the brief.

Most of this is pattern work a script does in seconds. The expensive part is the judgment call hiding inside each one, which "USA" is really United States and which belongs to "USA Today", whether a bracketed line is an internal note or real copy, and that is exactly the part a regex cannot make and a person should not have to make 4,000 times by hand.

Your options

Find and replace, in the platform

Where it exists at all, it swaps a literal string across records. WordPress leans on search-and-replace plugins for this; HubSpot, Webflow, and Notion give you no bulk find-and-replace for free text at all. The shared limit is that it is pattern-blind. It will turn every "USA" into "US", including the one inside "USA Today", and it cannot tell a leaked email in a private note from a real one in a published testimonial. Either it touches both or you narrow the pattern until it misses cases. Most of these paths have no batch preview and write straight to the live record.

A database-wide search and replace

The migration-tooling route: a serialization-safe replace across the whole database in one command. It is genuinely total and genuinely fast, and that is the danger. It is a sed over your production data with no per-record judgment, and its dry run reports how many rows would match, not what the change would read like in each one. One over-broad pattern rewrites every row, and the only undo is the database backup you remembered to take.

A spreadsheet export with formulas

Export the collection, normalize with SUBSTITUTE, PROPER, and a regex column, re-import. It is free and it is genuinely batch. Then rich text arrives as a wall of raw HTML and the reading stops, the re-import is all or nothing, and a blank cell in the wrong column overwrites the live value with blank, the trap every CSV guide warns about. A formula also cannot make the judgment call; it does the same thing to every row whether it should or not.

Point cleanup tools

Paste-your-HTML cleaners, the "clear formatting" button, a standalone PII scanner. Each does one slice well, strip the Word markup, or flag the emails, and none does the whole job, none is wired to your CMS or CRM, and the PII scanners ask you to upload the very data you were trying to keep contained. You end up shuttling content out to one tool and back, by hand, per slice.

An AI agent on your CMS or CRM API

Now something finally reads context. Wire an agent to the API, through an MCP server or a script, and it can tell "USA Today" from the country and the leaked note from the real testimonial. The cost is in the plumbing. The agent works one record per tool call against a rate-limited API, so cleaning a whole database is thousands of sequential reads narrated one call at a time, and every fix it decides on lands on the live record the moment it runs. No staging, no preview, and on most platforms no version history to fall back on.

Scratch

Scratch keeps the agent's judgment and adds the two things the live system needs: the work happens on files, and every change is reviewed before it ships. Your agent reads the records as files on your laptop, so the repeatable passes run as scripts it writes once, your content is not round-tripped to a model row by row. The fields that should never move are locked at the connector. Every change comes back as a word-level diff, so an over-eager rule that flattened a real sentence or mis-mapped a country is on your screen, not on your site. The honest trade is that the review is yours, and a few thousand diffs is real reading. Scratch makes it fast. It does not make it zero.

Option Across every record Reads context, not just patterns Review before live Undo after publish
Find and replace Where it exists No Rarely No
DB-wide search and replace Yes No Match counts, not the change Your backup, if you took one
Spreadsheet export Yes No Two columns of raw HTML A manual export, if you saved one
Point cleanup tools One slice at a time Some, per tool Per paste No
Agent on the API Yes, at API pace Yes Only if you build it No
Scratch Yes, as local files Yes, your agent Word-level diff, per field Per item, even after publish

How the loop works on your records

  1. Scratch pulls your CRM and CMS into files. Every record lands as its own file in a folder on your laptop, every field visible. The editable fields sit next to the ones that are locked at the connector: in HubSpot, contacts, companies, deals, and tickets edit while emails, workflows, and lists stay read-only; in Pipedrive, names, notes, and custom fields edit while owners, values, and timestamps do not; in Notion, the database columns edit while page bodies and computed properties do not; in Webflow, CMS collections and metadata edit while the ecommerce tables and Designer content stay out; in WordPress, post bodies, titles, and slugs edit while templates and plugin-owned SEO meta are held back. Nothing in the live system has changed, and nothing in the next step can change it.
  2. Your AI runs the cleanup on files. Point Claude, Codex, Cursor, or Copilot at the folder and give it the brief. The mechanical offenders fall first: the agent greps every record in about a second and writes scripts for the repeatable passes, mapping every country and company-suffix variant onto one canonical form, fixing name casing, enforcing the title shape, stripping Word markup, repairing the mojibake table, cleaning tracking parameters. Those scripts run locally, so the bulk of the cleanup never sends your records to a model at all. The model's read is spent only on the judgment calls you asked for, which "USA" is United States and which is a brand, whether a bracketed line is an internal note. The agent holds no key to either system, so being wrong cannot reach your data.
  3. You review the diff and publish. Every changed field sits next to its original, word by word, unchanged fields grayed out, so a record where one company name got normalized reads as one highlighted line. Validators run before you read, if you set them: a row whose country is still not canonical, or one where the cleanup dropped a required field, arrives already failed. Approve per item, and Scratch writes back only the fields that changed, so a cleanup pass cannot turn into a stage change or a price edit. Any published item reverts individually, and the original value comes back.

Read-only connectors invert the last step. Stripe and QuickBooks pull but do not publish, so the loop ends in a findings report you clear by hand. That is covered in how to audit your Stripe products and customers with AI.

The cleanup, in one brief

The whole job can ride on a single plain-English brief. A representative one for a CRM about to drive a quarterly report:

Across every contact and company, normalize the country field to its full English name, standardize every phone number to +1 (xxx) xxx-xxxx, fix any name that is in all caps or all lowercase to proper case, and collapse company suffixes so Inc, Inc., and Incorporated all read as Inc. Flag, do not merge, any records that now look like duplicates so I can decide per pair. Leave deal values, owners, and timestamps untouched.

The agent turns that into a handful of scripts and one judgment pass. The same loop runs a CMS cleanup unchanged: swap in put every blog title in sentence case, strip Word markup, fix mojibake, and remove utm and fbclid parameters from every body link and nothing else about the process moves. Calibrate on a slice first: run it on 50 records, read those diffs end to end, and tighten the brief where it over-reached or missed. When the first 50 read right, send it through the rest. The validators you write during calibration, every country is canonical, every title matches the shape, no Mso class survives, become the gate that fails the stragglers before they reach your eyes.

What people use it for

Questions people ask

Can the AI pick the right canonical form, or only swap a literal string?

It picks. That is the line between this and a search and replace. You name the target, every country as its full English name, every phone as +1 (xxx) xxx-xxxx, and the agent maps the variants onto it, turning "US", "U.S.", and "United States" into one value while leaving "USA Today" alone because it read the field, not just the string. The calls it is unsure about surface in the diff for you to settle.

Can I standardize CRM contact and company fields in bulk?

Yes, that is the core job. Country and state spellings, phone and date formats, name casing, job titles, company-suffix drift, all of it normalizes to the canonical form you set, across every contact and company at once. Owners, monetary values, and timestamps stay locked at the connector, so a hygiene pass cannot move a deal or rewrite an amount. The platform-specific version for HubSpot lives at bulk update HubSpot contacts with AI.

Will a database-wide change overwrite fields I did not mean to touch?

Not here. Scratch writes back only the fields that changed, so a field the agent never edited is not part of the write and cannot be replaced by a blank or a stale value. That is the failure a DB-wide search and replace and a CSV re-import are both famous for, and it is structurally absent from this loop.

Can I strip pasted Word markup without flattening the real formatting?

Yes, and the diff is the safeguard. The agent removes the Mso classes, o:p tags, inline mso styles, and empty spacers while leaving your headings, lists, and links intact, and every changed body comes back as a diff so a flattened list or a dropped heading is caught before it ships. Validators can require that specific structural tags survive the strip.

Can the AI over-clean, and how would I catch it?

It can, the same way any cleanup can take too much, a casing rule that mangles an acronym, a strip that eats a real sentence. The catch is the review queue: an over-eager change is a highlighted diff you reject in a glance, and a validator can flag any record where the cleanup removed more than a set amount of text. Nothing it over-changes reaches the live system, and if one slips through a batch, that item reverts on its own.

Can I see exactly what changed before it goes live?

Yes. That is the design. Every changed field shows as a word-level diff against the original, with unchanged fields grayed out, and nothing reaches the live system until you approve it per item. A normalization that mapped the wrong value, or a strip that ate a sentence, is highlighted on your screen, not discovered later on your site.

Does my content get uploaded to a model or a SaaS to be cleaned?

Only the part you choose, and less than you would think. Scratch has no built-in AI and sends your records to no model itself; the agent you bring reads the files under your own account and plan. Because the records are files, the repeatable passes run as scripts the agent writes and runs locally, so the normalization and markup-stripping never round-trips your content to a model row by row. The model's read is spent on the judgment calls you asked for. Which model that is stays your call.

Does this work with my CRM and CMS, or only some of them?

Many of them. HubSpot, Pipedrive, Webflow, WordPress, Notion, and more publish edits back; Stripe and QuickBooks are read-only and end in a report. See /for/ for exactly what each connector reads and writes before you plan a cleanup, since the locked fields differ by platform.

Can I undo a cleanup after it publishes?

Yes. The original stays next to every published change, per record, and rejecting it puts the old value back. You do not need the platform's version history, which most CMSs and CRMs do not keep for free-text fields anyway, and you are not restoring a whole-database backup that drags every other record back with it.

Do I need to write regex?

No. You describe the cleanup in plain English and the agent writes whatever regex and scripts the job needs, then hands you diffs, not code. The patterns in this guide are there so you can recognize what you have, not so you have to type them.

See it on your own content

The fastest way to trust a cleanup is to watch it run on your own records and read every change it made. See it run on your content →, or download Scratch free, connect one system, and pull a single collection. Your first diff is about twenty minutes away. Scratch is free to try, and the AI is whichever agent you already pay for.

See it run on your own content.

Curtis runs these calls himself. Thirty minutes, no pitch, no slides. He connects your platforms live and shows you your content as an editable, reviewable diff. Bring anything sticky: a refresh, a migration, or a rebrand.

See it run on your content → or download it free

cookies

strictly necessary
required for the site to work. always on.

analytics
google analytics & posthog — anonymous usage, so we can improve the site.