MCP Rate Limits: A Practical Guide for Agent Builders

MCP

Piet Michiel
June 16, 2026

Summarize and analyze this article with:

MCP rate limits are the single most common production failure mode for AI agents in 2026. Every Model Context Protocol (MCP) server you build will hit a throttle by the third or fourth Claude turn, and the question is not whether you will hit 429s – it is whether your architecture recovers gracefully or crashes mid-conversation in front of a customer.

This guide is the developer reference we wish existed when we shipped our first MCP server. It pulls every per-API rate-limit number scattered across our connector posts into one definitive table, layers the three throttling tiers most teams instrument incorrectly, shows the architectural patterns that quietly beat the problem, and walks through the live code that turns a 1,000 requests-per-hour SaaS API into unlimited Claude queries.

If you build agents for production, bookmark this. If you operate them, treat it as the runbook. The numbers move every quarter as SaaS vendors react to AI agent traffic, so we will keep this page current.

What MCP server throttling actually is

Rate limits at a glance for MCP developers

What they are: Ceilings on how often an MCP agent can call a model, a server, or an underlying SaaS API in a given time window.

Where they fire: Three independent layers – Claude LLM tokens, MCP server transport, underlying SaaS API. Most teams only instrument one.

Why they bite: A single user prompt can fire 5-15 tool calls. Multiply by 50 concurrent users and you exhaust any reasonable per-API budget in seconds.

The fix: Warehouse-first caching, per-tenant token buckets, header-aware retry, and on-demand override paths for the rare live-fetch question.

Indeed, most MCP outages we have triaged this year share the same root cause. First, the team built one MCP server, wired one OAuth credential, and asked Claude to “go answer customer questions.” Then Claude happily fired off thousands of tool calls across 50 concurrent conversations, and the underlying SaaS API throttled the first one inside 20 seconds. As a result, the other 49 sessions inherited the 429 cascade. Finally, the customer demo died on stage.

Why MCP rate limits matter more in 2026

Six forces making throttling worse this year

Agentic loops: A single Claude conversation now averages 8-15 tool calls, up from 2-3 in 2024 when MCP first shipped
SaaS vendors react: Notion, Salesforce, HubSpot have tightened limits in 2025-2026 specifically to slow down agent traffic
Multi-tenant by default: One shared OAuth credential across 50 customers means one noisy customer starves the rest
Header literacy gap: Most teams parse the 429 status but ignore the rate-limit-remaining headers on the successful 200 responses
Compliance audit pressure: EU AI Act Article 26 demands reproducible logs, which falls apart when retries cascade
FinOps signals: CFOs trace token cost explosions back to retry loops chewing through Claude tokens on doomed calls

Specifically, the pattern is clear. Previously, rate limits used to be a latency annoyance. However, in 2026 they are the security, FinOps, and reliability problem at the same time. Therefore, the architecture you ship today has to assume the limits will get tighter, not looser. Moreover, we unpack the CTO-grade architecture in our MCP for the EU CTO and CIO reference post.

The 3 layers of MCP rate limits

When an MCP-powered agent runs a single user prompt, three independent rate-limit ceilings can fire. However, most teams instrument one. As a result, the other two cause the production outages. Therefore, each layer needs its own dashboard, its own alert, and its own retry strategy.

Layer 1: LLM token rate limits (Anthropic / Claude)

Buckets: Tokens-Per-Minute (TPM), Input-Tokens-Per-Minute (ITPM), Output-Tokens-Per-Minute (OTPM), Requests-Per-Minute (RPM). Tier 4 organisations get the highest published ceilings.

Signal: Read the anthropic-ratelimit-tokens-remaining and anthropic-ratelimit-requests-remaining response headers on every call.

Quirk: Sonnet and Opus carry separate buckets. A tool that auto-escalates from Sonnet to Opus can hit the Opus ceiling much sooner than expected.

Layer 2: MCP server transport limits

Constraints: Compute, memory, concurrency caps per session. Cloud-hosted MCP gateways cap concurrent tool calls and timeout long-running tools at 30-60 seconds.

Risk: A single noisy customer can starve the others on the same node. Per-tenant resource isolation is the only durable fix.

Pattern: Bulk paths should never share a queue with interactive paths. Separate workers, separate budgets.

Layer 3: Underlying SaaS API limits

Reality: Notion at 3 requests/sec, MEWS at 1,000 requests/hour, Stripe at 100 reads + 100 writes/sec. Claude has no idea these limits exist.

Failure shape: Claude requests a thousand-row scan and watches your tool retry-loop into a 429 cascade. Token cost explodes during the doomed retries.

Fix surface: This is where the warehouse-first MCP pattern earns its seat. See our cross-source MCP SQL post.

Consequently, a robust MCP agent has to track all three layers independently. Furthermore, the layers interact: if your Claude tier rate-limits a Sonnet call, you may stall a long-running tool that already holds 18 of your 25 NetSuite SuiteTalk concurrent threads. As a result, the next user gets a hang, not a clean 429.

The SaaS API rate-limit cheat sheet

Below is the table we keep open in every architecture review. Bookmark it. Specifically, the numbers move every quarter as SaaS vendors react to AI agent traffic, and this is the single source of truth across our connector posts.

Per-API hard ceilings (14 SaaS APIs)

SaaS API	Hard ceiling	Notes
Notion API	3 requests/sec average	Strictest of the major SaaS APIs. Bursts allowed, sustained traffic gets 429s. Pagination locks at 100 items.
Klaviyo	Tiered XS-XL burst + steady	XS = 3 burst / 60 steady. XL = 700 burst / 4000 steady. Per-API tier, not per-account. Native Anthropic MCP across ~200k brands.
Airtable	5 req/sec per base, 50 req/sec PAT ceiling	30-second cool-down after a 429. Personal Access Token aggregates across all bases.
Slack	Web API: 4 tiers + Events 30k/60min	T1 = ~1 req/min (admin). T4 = ~100+ req/min (chat.postMessage). Bot tokens share per-workspace bucket.
Pipedrive	30k base × plan multiplier × seat count	December 2024 moved to token-based budget. Pipedrive AI Sales Assistant handles rate-limit-aware paths natively.
NetSuite	25 concurrent SuiteTalk threads, 100k-row SuiteQL ceiling	Governance units cap REST and SOAP. Saved Searches and SuiteQL have separate per-query timeouts.
Stripe	100 reads + 100 writes per second	Sandbox = 25 reads + 25 writes/sec. Search API has its own lower limit (~20 req/sec).
Zendesk	100 req/min Suite Pro; 2,500 with High Volume add-on	Plus 30 updates per 10 min per ticket. Hits any agent doing fast multi-update workflows.
Brevo	1,000-6,000 req/sec per endpoint	Honours X-Sib-Ratelimit-* response headers. Transactional endpoints higher than marketing ones.
MEWS Connector API	1,000 requests/hour	Strictest enterprise SaaS limit we encounter. Hotel groups with 20+ properties run out of budget in minutes without caching.
Business Central OData	6,000 req/5min per user	Per-environment, per-user bucket. Multi-tenant ISVs hit ceilings fastest.
Xero	60 req/min per tenant, 5,000/day	JAX is Xero’s official AI assistant – use it for in-platform questions, warehouse-first for cross-source.
HubSpot	100 req/10sec (Enterprise: 200)	Daily caps too. Search API stricter than read endpoints.
Exact Online	60 req/min, 5,000 req/day per division	Multi-division accounting firms hit the daily cap quickly. Use bulk endpoints where available.

Patterns to spot in the cheat sheet

First, note the budget pattern by category. Marketing and e-commerce APIs (Klaviyo, Brevo, Stripe) ship generous ceilings because volume is the business model. Conversely, ERP and finance APIs (NetSuite, Business Central, Exact Online, Xero) ship tight ones because they treat their database as the constraint. Furthermore, hospitality and field-service APIs (MEWS) ship the tightest of all. As a result, the answer to “should I live-fetch or cache” depends mostly on which corner of this table your tool lives in.

Moreover, the per-API research that fed this table lives in the connector blogs. Read the deeper breakdowns for Notion MCP, Klaviyo MCP, Airtable MCP, Slack MCP, Zendesk MCP, and Brevo MCP. In fact, each has the per-endpoint rate-limit code snippets we used to build Peliqan’s connectors.

3 architectural patterns that quietly beat rate limits

First, retry loops and exponential backoff are table stakes. Indeed, they keep you alive. However, they do not give you headroom. By contrast, these three patterns do, and they compose: as a result, most production MCP architectures run all three in parallel for different tool surfaces.

Pattern 1: warehouse-first caching

How it works: Land the SaaS data in a Postgres warehouse every 10-15 minutes. Claude queries the warehouse, not the live API.

Why it wins: One MEWS account with a 1,000 requests/hour ceiling now serves unlimited Claude conversations because every conversation hits Postgres, not MEWS.

Peliqan setup: Use the materialize tables flow plus scheduled syncs from any connector.

Pattern 2: federated query with Trino

How it works: Expose each source as a Trino catalog. Claude issues SQL through one MCP endpoint that fans out to native APIs in parallel.

Why it wins: Trino respects per-source connection pools, so an agent can join a 50M-row NetSuite table to a 200k-row Stripe table without your team writing pagination glue.

Peliqan setup: SQL on anything via the federated query documentation.

Pattern 3: scheduled CDC + on-demand override

How it works: Run Change Data Capture every 5-15 minutes for hot tables – orders, tickets, deals, invoices. 99% of reads hit the cache.

Override: For the 1% where the user explicitly says “check live in Salesforce right now,” the MCP tool has a fresh=true argument that bypasses the cache.

Why it wins: This is the pattern production agents converge on – the cheap path is the default and you pay the API tax only when the user demands freshness.

4 failure modes that look healthy in your dashboard

Indeed, every MCP outage we have helped triage falls into one of these four buckets. However, none of them surface as a 5xx in the dashboard. Consequently, they survive long enough to break a customer demo. Furthermore, each one has a clean fix, but you have to know to look for it.

Failure mode 1: silent throttling

Symptom: SaaS API returns a 200 with an empty results array or partial data instead of a 429. Notion, Salesforce, and HubSpot all do this for specific endpoints.

Impact: Claude sees zero rows, concludes the customer has no contacts, and hallucinates a confident answer.

Mitigation: Compare row counts to a sentinel value and alert on suspicious drops. Use Peliqan’s data quality monitoring for automatic checks.

Failure mode 2: cascading retries

Symptom: Claude hits a 429, the MCP server retries with exponential backoff, but inside the same agent turn Claude also calls a second tool that hits the same API.

Impact: Two retry loops on the same bucket, bucket recovery is the longer of the two, and token cost balloons during the doomed retries.

Mitigation: Per-tenant token-bucket middleware on the MCP server, not per-tool. One bucket per (tenant, SaaS API) pair. For production API systems, centralized rate limiting helps enforce tenant-level traffic policies before retries cascade across shared services or upstream SaaS APIs.

Failure mode 3: ignored rate-limit headers

Symptom: Most teams parse the 429 status but ignore the X-Sib-Ratelimit-Remaining, x-ratelimit-remaining, and anthropic-ratelimit-tokens-remaining headers on 200s.

Impact: You discover the budget is gone when the next call returns 429, instead of pacing as the budget shrinks.

Mitigation: Surface remaining budget into observability as a gauge, not a counter. Alert when budget drops below 20% of the bucket.

Failure mode 4: multi-tenant noisy neighbour

Symptom: Single OAuth app or API key shared across customers. One customer’s bulk export starves every other customer’s Claude conversation.

Impact: Random “throttled” errors during peak hours that disappear at 3am. Untraceable until you split the tenants apart.

Mitigation: Per-tenant credentials and per-tenant rate-limit buckets. Bulk path on a separate queue from interactive.

Code: turning a 1,000 req/hour MEWS API into unlimited Claude queries

Specifically, here is the simplified Peliqan pattern. First, pull MEWS reservations every 10 minutes into Postgres. Then expose a single MCP tool that queries the warehouse. As a result, Claude can run a thousand “what is tonight’s occupancy” turns and we pay MEWS exactly 6 API calls per hour.

# 1. CDC job - runs every 10 minutes via Peliqan scheduler
def sync_mews_reservations():
    cursor = get_last_cursor("mews_reservations")
    page = mews.reservations.get_all(
        TimeFilter="Updated",
        UpdatedUtc={"StartUtc": cursor, "EndUtc": now_utc()},
        Limit=1000,
    )
    upsert("warehouse.mews_reservations", page.Reservations)
    set_cursor("mews_reservations", now_utc())
    # ~6 API calls/hour for a 20-property group

# 2. MCP tool - Claude calls this, NOT the live MEWS API
@mcp.tool()
def occupancy_by_property(date: str) -> list[dict]:
    """Returns rooms sold per property for a given date.
    Data freshness: ~10 minutes. Call fresh_occupancy() for live."""
    return query("""
        select property_name, count(*) as rooms_sold
        from warehouse.mews_reservations
        where check_in_date <= %s and check_out_date > %s
          and state = 'Confirmed'
        group by property_name
        order by rooms_sold desc
    """, [date, date])

# 3. Override path - costs 1 MEWS call when explicitly asked
@mcp.tool()
def fresh_occupancy(property_id: str, date: str) -> dict:
    """Live MEWS read. Use only when user says 'right now'."""
    return mews.reservations.get_live(property_id, date)

Moreover, the same pattern works for Notion’s 3 req/sec, Xero’s 60 req/min, Exact Online’s 5,000 req/day, and every other tight-budget API in the cheat sheet. Indeed, the only APIs we live-fetch in Peliqan are the ones with effectively infinite budgets (Stripe writes, Slack T4 chat.postMessage) or where staleness genuinely matters (live availability checks). Furthermore, the full MCP server setup is documented in our MCP overview.

When to live-fetch vs cache

Question	If yes	If no
Does the user need data fresher than 10 minutes?	Live-fetch via override tool	Go to next question
Will the agent join across 2+ sources?	Warehouse-first (Postgres or Trino)	Go to next question
Is the API’s budget below 1,000 req/hr?	Mandatory cache	Go to next question
Will 50+ concurrent users hit it?	Cache + per-tenant buckets	Live-fetch is fine
Is the data write-heavy (Stripe charges, Slack messages)?	Live – never cache writes	Default to cache

Retry strategy: the 3 rules

If you only remember three things about retries

Exponential backoff with jitter: Start at 500ms, double up to 30 seconds, add ±20% jitter to avoid synchronized thundering herd. Never tighter than 500ms.
Respect Retry-After: When the SaaS sends a Retry-After header, treat it as mandatory not advisory. Airtable’s 30-second cool-down is a hard floor.
Never retry inside Claude’s turn: Bubble the 429 back to Claude with a clear error message so Claude can wait, switch tools, or tell the user. Cascading retries inside one turn are the #1 cause of production MCP outages.

Cooperative architecture: when the SaaS ships its own AI

Meanwhile, a growing pattern in 2025-2026: the SaaS vendor ships an official AI assistant that already knows the rate-limit landscape of its own API. First, Klaviyo plus Anthropic shipped a native MCP partnership across roughly 200,000 brands. Similarly, Pipedrive shipped the AI Sales Assistant. In addition, Xero shipped JAX. Finally, Notion shipped Notion AI. As a result, when the question lives entirely inside that one SaaS – “draft this email,” “update this deal,” “summarise this page” – the vendor’s own AI handles the rate-limit-aware path better than you can.

By contrast, where the warehouse-first pattern earns its keep is the cross-source question: “reconcile this customer’s Klaviyo unsubscribes against their Stripe refunds and their Zendesk tickets.” Indeed, that question has no single SaaS to live in. Therefore, it has to live in a warehouse with a federated SQL surface. Furthermore, the connector-specific blog posts walk through this cooperative pattern in depth: Business Central MCP is a particularly clear example of layering native AI with cross-source warehouse reads.

How Peliqan handles MCP rate limits

What ships out of the box for rate-limit safety

Warehouse-first by default: Postgres + Trino federated query across 250+ connectors. Reads hit the warehouse, not the live API.

Per-tenant rate-limit buckets: One bucket per (tenant, SaaS API) pair. No noisy-neighbour starvation.

Header-aware retry: All connector clients parse rate-limit-remaining headers on 200s and pace requests proactively, not reactively.

Override path: Every connector ships a fresh=true variant for the rare “live read” question, separately budgeted.

Reverse ETL writeback: Writes go through the Peliqan reverse ETL surface with audit trail and dry-run mode.

EU-hosted, SOC 2 Type II: Belgium data centre, ISO 27001, GDPR-compliant. Structured logs for EU AI Act Article 26.

Fixed pricing: From €150/month annual. No per-call or per-token surprise invoicing.

Moreover, the MCP server is published as `pip install mcp-server-peliqan` with open-source connector-specific repos on GitHub for Exact Online, Teamleader, AFAS, Odoo, and PowerOffice. Specifically, the platform combines text-to-SQL, RAG, automatic data lineage, and metadata management – the rate-limit work happens transparently underneath. In addition, the detailed setup for connecting Claude or any MCP client to your data sits in the connect to data documentation.

Real-world example: Skindr

Skindr runs advertising and RevOps analytics on Peliqan across multiple ad-platform APIs that all enforce strict rate limits during campaign peaks. The warehouse-first pattern keeps their dashboards and Claude conversations running flat-cost even when individual ad APIs throttle hard. Read the full case study.

Common challenges and quick answers

Challenge: my MCP server worked in dev, breaks at 10 concurrent users

Almost always, this is a per-tenant bucketing problem. Specifically, dev tested one OAuth credential, but prod has one OAuth credential per customer and the noisy customer starved the rest. Therefore, split the credentials per tenant, add a per-(tenant, API) bucket, and add a separate worker pool for bulk vs interactive.

Challenge: Claude keeps retrying when the budget is clearly gone

In short, you are not bubbling the 429 back to Claude. As a result, the MCP server is retrying internally and Claude never sees the error – it just sees a slow tool. Instead, bubble the 429 back as a clean tool error with the Retry-After hint, and let Claude decide whether to wait, switch tools, or surface the error to the user.

Challenge: token cost exploded after we launched to 50 customers

Likewise, two likely culprits explain the explosion. First, no warehouse-first cache for the read-heavy tools – every Claude turn pays for the underlying API plus the tokens to retry. Second, cascading retries inside a single turn doubling or tripling the Claude token spend. Consequently, add warehouse caching for the top 5 read tools and add per-tenant rate-limit buckets at the MCP gateway.

Challenge: silent throttling on Notion returns empty arrays

Specifically, Notion returns 200 with an empty page when an integration hits its rate limit on certain endpoints. Therefore, add a sentinel check: if the row count drops more than 80% versus the last successful call for the same query, alert and fall back to the warehouse copy. However, the Peliqan connector handles this case automatically, but custom MCP servers need the explicit guard.

Conclusion: cache by default, override on demand

If you have read this far, you are likely building or operating an MCP server in production. In short, the shortest version of the advice: cache by default, override on demand, instrument per tenant, and stop pretending Layer 1 and Layer 2 will save you from Layer 3. Furthermore, treat the cheat sheet as a living document – the numbers move every quarter as SaaS vendors react to AI agent traffic.

Indeed, the architecture that ships is boring. First, warehouse-first for reads. Second, override tool for live-fetch. Third, per-tenant buckets at the gateway. In addition, header-aware retry. Finally, bulk and interactive on separate queues. As a result, most teams over-engineer the retry loop and under-engineer the architecture above it. However, once you get the architecture right, the retry loop becomes trivial.

Finally, if you want the warehouse-first half of that architecture without building it from scratch, Peliqan ships per-tenant rate-limit buckets, 10-minute CDC for 250+ connectors, and a Trino-backed federated SQL surface that lets Claude join NetSuite to Stripe to Zendesk in one tool call. Specifically, EU-hosted, SOC 2 Type II, ISO 27001 in progress, fixed pricing from €150/month annual. Therefore, book a demo and we will walk through the cheat sheet against your actual connector mix.

FAQs

What is Claude's rate limit?

Claude’s rate limits are set per Anthropic organization and scale by usage tier. Tier 1 starts at 50 requests per minute and 40,000 input tokens per minute on Sonnet. Tier 4 reaches 1,000 requests per minute and 400,000 input tokens per minute. The MCP transport itself does not add new token limits.

How do you handle MCP rate limits?

Four-step pattern. Cache by default by landing SaaS data in a Postgres warehouse on a 10-15 minute CDC schedule and having Claude query the warehouse, not the live API. Override on demand by exposing a separate fresh=true tool for the rare “right now” question. Instrument per tenant using per-customer credentials and per-customer token buckets so one noisy tenant cannot starve the others. Parse rate-limit headers on 200s, not just 429s – Brevo’s X-Sib-Ratelimit-Remaining, Stripe’s x-ratelimit-remaining, and Anthropic’s headers tell you the budget before you exhaust it.

What is the best MCP server for high-volume workloads?

For high-volume cross-source workloads, the warehouse-first MCP pattern wins because it decouples Claude’s query rate from the underlying API ceiling. Peliqan ships per-tenant rate-limit buckets, 10-minute CDC for 250+ connectors, and a Trino-backed federated SQL surface that lets Claude join NetSuite to Stripe to Zendesk in one tool call. EU-hosted, SOC 2 Type II, ISO 27001, fixed pricing from €150/month annual. For single-source workloads where the SaaS vendor ships its own AI (Klaviyo plus Anthropic, Pipedrive AI Sales Assistant, Xero JAX), let the vendor’s AI handle in-platform questions and use the warehouse MCP for cross-source.

What's the right MCP retry strategy?

Three rules. First, exponential backoff with jitter on 429s – start at 500ms, double up to 30 seconds, add ±20% jitter, never tighter than 500ms. Second, respect the Retry-After header when present (Airtable’s 30-second cool-down is a hard floor). Third, never retry inside Claude’s turn – bubble the 429 back to Claude with a clear error message so Claude can wait, switch tools, or surface the error to the user. Cascading retries inside one turn are the #1 cause of production MCP outages.

Piet Michiel

Co-Founder Peliqan.io. Passionate about building innovative products opening up the digital world to an as broad as possible audience. Previous with Blendr.io and now with Peliqan.io we are building a low-code SaaS platform that enables both business and technical user to access and activate their data.

All-in-one Data Platform

Built-in data warehouse, superior data activation capabilities, and AI-powered development assistance.

All-in-one data platform

Solutions

Connectors

Popular sources

Databases

Resources

MCP Rate Limits: A Practical Guide for Agent Builders

Table of Contents

What MCP server throttling actually is

Rate limits at a glance for MCP developers

Why MCP rate limits matter more in 2026

Six forces making throttling worse this year

The 3 layers of MCP rate limits

Layer 1: LLM token rate limits (Anthropic / Claude)

Layer 2: MCP server transport limits

Layer 3: Underlying SaaS API limits

The SaaS API rate-limit cheat sheet

Per-API hard ceilings (14 SaaS APIs)

Patterns to spot in the cheat sheet

3 architectural patterns that quietly beat rate limits

Pattern 1: warehouse-first caching

Pattern 2: federated query with Trino

Pattern 3: scheduled CDC + on-demand override

4 failure modes that look healthy in your dashboard

Failure mode 1: silent throttling

Failure mode 2: cascading retries

Failure mode 3: ignored rate-limit headers

Failure mode 4: multi-tenant noisy neighbour

Code: turning a 1,000 req/hour MEWS API into unlimited Claude queries

When to live-fetch vs cache

Retry strategy: the 3 rules

If you only remember three things about retries

Cooperative architecture: when the SaaS ships its own AI

How Peliqan handles MCP rate limits

What ships out of the box for rate-limit safety

Real-world example: Skindr

Common challenges and quick answers

Challenge: my MCP server worked in dev, breaks at 10 concurrent users

Challenge: Claude keeps retrying when the budget is clearly gone

Challenge: token cost exploded after we launched to 50 customers

Challenge: silent throttling on Notion returns empty arrays

Conclusion: cache by default, override on demand

FAQs

Piet Michiel

Table of Contents

All-in-one Data Platform

Related blog posts

Ready to get instant access to all your company data ?