MCP rate limits are the single most common production failure mode for AI agents in 2026. Every Model Context Protocol (MCP) server you build will hit a throttle by the third or fourth Claude turn, and the question is not whether you will hit 429s – it is whether your architecture recovers gracefully or crashes mid-conversation in front of a customer.
This guide is the developer reference we wish existed when we shipped our first MCP server. It pulls every per-API rate-limit number scattered across our connector posts into one definitive table, layers the three throttling tiers most teams instrument incorrectly, shows the architectural patterns that quietly beat the problem, and walks through the live code that turns a 1,000 requests-per-hour SaaS API into unlimited Claude queries.
If you build agents for production, bookmark this. If you operate them, treat it as the runbook. The numbers move every quarter as SaaS vendors react to AI agent traffic, so we will keep this page current.
What MCP server throttling actually is
Rate limits at a glance for MCP developers
Indeed, most MCP outages we have triaged this year share the same root cause. First, the team built one MCP server, wired one OAuth credential, and asked Claude to “go answer customer questions.” Then Claude happily fired off thousands of tool calls across 50 concurrent conversations, and the underlying SaaS API throttled the first one inside 20 seconds. As a result, the other 49 sessions inherited the 429 cascade. Finally, the customer demo died on stage.
Why MCP rate limits matter more in 2026
Six forces making throttling worse this year
- Agentic loops: A single Claude conversation now averages 8-15 tool calls, up from 2-3 in 2024 when MCP first shipped
- SaaS vendors react: Notion, Salesforce, HubSpot have tightened limits in 2025-2026 specifically to slow down agent traffic
- Multi-tenant by default: One shared OAuth credential across 50 customers means one noisy customer starves the rest
- Header literacy gap: Most teams parse the 429 status but ignore the rate-limit-remaining headers on the successful 200 responses
- Compliance audit pressure: EU AI Act Article 26 demands reproducible logs, which falls apart when retries cascade
- FinOps signals: CFOs trace token cost explosions back to retry loops chewing through Claude tokens on doomed calls
Specifically, the pattern is clear. Previously, rate limits used to be a latency annoyance. However, in 2026 they are the security, FinOps, and reliability problem at the same time. Therefore, the architecture you ship today has to assume the limits will get tighter, not looser. Moreover, we unpack the CTO-grade architecture in our MCP for the EU CTO and CIO reference post.
The 3 layers of MCP rate limits
When an MCP-powered agent runs a single user prompt, three independent rate-limit ceilings can fire. However, most teams instrument one. As a result, the other two cause the production outages. Therefore, each layer needs its own dashboard, its own alert, and its own retry strategy.
Layer 1: LLM token rate limits (Anthropic / Claude)
Layer 2: MCP server transport limits
Layer 3: Underlying SaaS API limits
Consequently, a robust MCP agent has to track all three layers independently. Furthermore, the layers interact: if your Claude tier rate-limits a Sonnet call, you may stall a long-running tool that already holds 18 of your 25 NetSuite SuiteTalk concurrent threads. As a result, the next user gets a hang, not a clean 429.
The SaaS API rate-limit cheat sheet
Below is the table we keep open in every architecture review. Bookmark it. Specifically, the numbers move every quarter as SaaS vendors react to AI agent traffic, and this is the single source of truth across our connector posts.
Per-API hard ceilings (14 SaaS APIs)
Patterns to spot in the cheat sheet
First, note the budget pattern by category. Marketing and e-commerce APIs (Klaviyo, Brevo, Stripe) ship generous ceilings because volume is the business model. Conversely, ERP and finance APIs (NetSuite, Business Central, Exact Online, Xero) ship tight ones because they treat their database as the constraint. Furthermore, hospitality and field-service APIs (MEWS) ship the tightest of all. As a result, the answer to “should I live-fetch or cache” depends mostly on which corner of this table your tool lives in.
Moreover, the per-API research that fed this table lives in the connector blogs. Read the deeper breakdowns for Notion MCP, Klaviyo MCP, Airtable MCP, Slack MCP, Zendesk MCP, and Brevo MCP. In fact, each has the per-endpoint rate-limit code snippets we used to build Peliqan’s connectors.
3 architectural patterns that quietly beat rate limits
First, retry loops and exponential backoff are table stakes. Indeed, they keep you alive. However, they do not give you headroom. By contrast, these three patterns do, and they compose: as a result, most production MCP architectures run all three in parallel for different tool surfaces.
Pattern 1: warehouse-first caching
Pattern 2: federated query with Trino
Pattern 3: scheduled CDC + on-demand override
4 failure modes that look healthy in your dashboard
Indeed, every MCP outage we have helped triage falls into one of these four buckets. However, none of them surface as a 5xx in the dashboard. Consequently, they survive long enough to break a customer demo. Furthermore, each one has a clean fix, but you have to know to look for it.
Failure mode 1: silent throttling
Failure mode 2: cascading retries
Failure mode 3: ignored rate-limit headers
Failure mode 4: multi-tenant noisy neighbour
Code: turning a 1,000 req/hour MEWS API into unlimited Claude queries
Specifically, here is the simplified Peliqan pattern. First, pull MEWS reservations every 10 minutes into Postgres. Then expose a single MCP tool that queries the warehouse. As a result, Claude can run a thousand “what is tonight’s occupancy” turns and we pay MEWS exactly 6 API calls per hour.
# 1. CDC job - runs every 10 minutes via Peliqan scheduler def sync_mews_reservations(): cursor = get_last_cursor("mews_reservations") page = mews.reservations.get_all( TimeFilter="Updated", UpdatedUtc={"StartUtc": cursor, "EndUtc": now_utc()}, Limit=1000, ) upsert("warehouse.mews_reservations", page.Reservations) set_cursor("mews_reservations", now_utc()) # ~6 API calls/hour for a 20-property group # 2. MCP tool - Claude calls this, NOT the live MEWS API @mcp.tool() def occupancy_by_property(date: str) -> list[dict]: """Returns rooms sold per property for a given date. Data freshness: ~10 minutes. Call fresh_occupancy() for live.""" return query(""" select property_name, count(*) as rooms_sold from warehouse.mews_reservations where check_in_date <= %s and check_out_date > %s and state = 'Confirmed' group by property_name order by rooms_sold desc """, [date, date]) # 3. Override path - costs 1 MEWS call when explicitly asked @mcp.tool() def fresh_occupancy(property_id: str, date: str) -> dict: """Live MEWS read. Use only when user says 'right now'.""" return mews.reservations.get_live(property_id, date)
Moreover, the same pattern works for Notion’s 3 req/sec, Xero’s 60 req/min, Exact Online’s 5,000 req/day, and every other tight-budget API in the cheat sheet. Indeed, the only APIs we live-fetch in Peliqan are the ones with effectively infinite budgets (Stripe writes, Slack T4 chat.postMessage) or where staleness genuinely matters (live availability checks). Furthermore, the full MCP server setup is documented in our MCP overview.
When to live-fetch vs cache
Retry strategy: the 3 rules
If you only remember three things about retries
- Exponential backoff with jitter: Start at 500ms, double up to 30 seconds, add ±20% jitter to avoid synchronized thundering herd. Never tighter than 500ms.
- Respect Retry-After: When the SaaS sends a Retry-After header, treat it as mandatory not advisory. Airtable’s 30-second cool-down is a hard floor.
- Never retry inside Claude’s turn: Bubble the 429 back to Claude with a clear error message so Claude can wait, switch tools, or tell the user. Cascading retries inside one turn are the #1 cause of production MCP outages.
Cooperative architecture: when the SaaS ships its own AI
Meanwhile, a growing pattern in 2025-2026: the SaaS vendor ships an official AI assistant that already knows the rate-limit landscape of its own API. First, Klaviyo plus Anthropic shipped a native MCP partnership across roughly 200,000 brands. Similarly, Pipedrive shipped the AI Sales Assistant. In addition, Xero shipped JAX. Finally, Notion shipped Notion AI. As a result, when the question lives entirely inside that one SaaS – “draft this email,” “update this deal,” “summarise this page” – the vendor’s own AI handles the rate-limit-aware path better than you can.
By contrast, where the warehouse-first pattern earns its keep is the cross-source question: “reconcile this customer’s Klaviyo unsubscribes against their Stripe refunds and their Zendesk tickets.” Indeed, that question has no single SaaS to live in. Therefore, it has to live in a warehouse with a federated SQL surface. Furthermore, the connector-specific blog posts walk through this cooperative pattern in depth: Business Central MCP is a particularly clear example of layering native AI with cross-source warehouse reads.
How Peliqan handles MCP rate limits
What ships out of the box for rate-limit safety
Moreover, the MCP server is published as `pip install mcp-server-peliqan` with open-source connector-specific repos on GitHub for Exact Online, Teamleader, AFAS, Odoo, and PowerOffice. Specifically, the platform combines text-to-SQL, RAG, automatic data lineage, and metadata management – the rate-limit work happens transparently underneath. In addition, the detailed setup for connecting Claude or any MCP client to your data sits in the connect to data documentation.
Real-world example: Skindr
Skindr runs advertising and RevOps analytics on Peliqan across multiple ad-platform APIs that all enforce strict rate limits during campaign peaks. The warehouse-first pattern keeps their dashboards and Claude conversations running flat-cost even when individual ad APIs throttle hard. Read the full case study.
Common challenges and quick answers
Challenge: my MCP server worked in dev, breaks at 10 concurrent users
Almost always, this is a per-tenant bucketing problem. Specifically, dev tested one OAuth credential, but prod has one OAuth credential per customer and the noisy customer starved the rest. Therefore, split the credentials per tenant, add a per-(tenant, API) bucket, and add a separate worker pool for bulk vs interactive.
Challenge: Claude keeps retrying when the budget is clearly gone
In short, you are not bubbling the 429 back to Claude. As a result, the MCP server is retrying internally and Claude never sees the error – it just sees a slow tool. Instead, bubble the 429 back as a clean tool error with the Retry-After hint, and let Claude decide whether to wait, switch tools, or surface the error to the user.
Challenge: token cost exploded after we launched to 50 customers
Likewise, two likely culprits explain the explosion. First, no warehouse-first cache for the read-heavy tools – every Claude turn pays for the underlying API plus the tokens to retry. Second, cascading retries inside a single turn doubling or tripling the Claude token spend. Consequently, add warehouse caching for the top 5 read tools and add per-tenant rate-limit buckets at the MCP gateway.
Challenge: silent throttling on Notion returns empty arrays
Specifically, Notion returns 200 with an empty page when an integration hits its rate limit on certain endpoints. Therefore, add a sentinel check: if the row count drops more than 80% versus the last successful call for the same query, alert and fall back to the warehouse copy. However, the Peliqan connector handles this case automatically, but custom MCP servers need the explicit guard.
Conclusion: cache by default, override on demand
If you have read this far, you are likely building or operating an MCP server in production. In short, the shortest version of the advice: cache by default, override on demand, instrument per tenant, and stop pretending Layer 1 and Layer 2 will save you from Layer 3. Furthermore, treat the cheat sheet as a living document – the numbers move every quarter as SaaS vendors react to AI agent traffic.
Indeed, the architecture that ships is boring. First, warehouse-first for reads. Second, override tool for live-fetch. Third, per-tenant buckets at the gateway. In addition, header-aware retry. Finally, bulk and interactive on separate queues. As a result, most teams over-engineer the retry loop and under-engineer the architecture above it. However, once you get the architecture right, the retry loop becomes trivial.
Finally, if you want the warehouse-first half of that architecture without building it from scratch, Peliqan ships per-tenant rate-limit buckets, 10-minute CDC for 250+ connectors, and a Trino-backed federated SQL surface that lets Claude join NetSuite to Stripe to Zendesk in one tool call. Specifically, EU-hosted, SOC 2 Type II, ISO 27001 in progress, fixed pricing from €150/month annual. Therefore, book a demo and we will walk through the cheat sheet against your actual connector mix.



