How a Poor Database Sharding Design Can Destroy Your Entire System
Introduction: The Midnight Crisis
It's 2 AM, and the office is lit up like a living room on New Year's Eve. Product manager Xiao Li's coffee consumption has reached "medical observation" levels, tech lead Old Zhang's hairline has receded another half centimeter under repeated scratching, and the operations colleague stares at the screen with constantly appearing red alarms, expression as grave as if looking at their own medical report.
This is already the third time this month the system has "performed a free fall" late at night. When tracing the accident root cause, everyone was dumbfounded—it was precisely the database sharding solution that the entire team collectively praised as "stable as an old dog" last year, which has now become a "data bomb" detonating on schedule every night.
Why Database Sharding Exists
Before complaining about those blood-pressure-raising failed designs, let's first understand one thing: database sharding is essentially a large-scale "data relocation," not some mysterious "black technology."
Imagine your database is a single-person apartment's wardrobe. Initially with few items, underwear and socks can be stuffed anywhere, and you can find them in three seconds even when rushing in the morning. But after business expansion, data floods in like Double Eleven packages—the user table becomes a tens-of-millions "big city," the order table expands into a hundred-million-level "super metropolis," and the log table directly occupies several "provinces." At this point, problems emerge:
- Queries slow as a sloth meeting: Want to find a user's orders from last month? The database must search through hundreds of millions of data pieces like finding a needle in a haystack, comparable to finding a charger used last year in a garage piled with clutter.
- Writes congested like downtown rush hour: Several thousand orders per second flood in simultaneously, and the database connection pool instantly transforms into a "internet-famous milk tea shop queuing scene."
- Backup and recovery comparable to Yugong moving mountains: One backup takes dozens of GB, and recovery is enough for you to finish watching all three "Lord of the Rings" extended editions.
- Hardware limitations like a small apartment forcing three generations to live together: CPU, memory, and disk IO all have ceilings, like a 20-square-meter apartment trying to fit a family of five plus two dogs.
Database sharding is a "property upgrade" for data: sharding tables is converting a large flat into a loft with layered management; sharding databases is directly buying an entire building, with each floor housing different people. But remember: if you're currently living comfortably in 50 square meters and insist on taking a loan to buy a villa, the monthly "mortgage payment" (operations cost) might make you question life.
Five Classic "Crash" Postures
Throughout my technical career, I've witnessed various brain-opening database sharding designs. The most "amazing" was a team using "user nickname length" as the sharding key—long nickname users in one table, short nickname users in another table. When querying, you had to first guess the user's nickname length like fortune-telling, simply turning a technical problem into a metaphysics problem.
The following real-existing "god operations" will guarantee you'll want to both laugh and send the designer a blade upon reading.
Crash One: Choosing the Wrong Sharding Key
Choosing a sharding key is like choosing a class monitor—choose wrong and the whole class suffers. The most common death-seeking behavior is using auto-increment IDs as sharding keys.
A social APP team sharded the user table into 8 "classes" by auto-increment ID. As a result, all newly registered users crowded into the last class because IDs are incremental! During peak hours, the last class's "homeroom teacher" (database) worked until CPU burned to 100 degrees, while the first seven classes' "homeroom teachers" were drinking tea and watching dramas in the office. Users complained registration was slow as a snail, and the team watched helplessly as new users ran to the neighboring "school" (competitor).
There was also a food delivery platform that sharded the order table by "rider ID," but there was a "order king" rider on the platform who single-handedly undertook 30% of the platform's orders. As a result, his sharded data volume was several orders of magnitude more than others, and queries still stalled into PPT. This is called "data monopoly," like having one person write the whole class's homework while other classmates are responsible for applauding.
Crash Two: Overly Rigid Sharding Strategies
Some teams' sharding strategies are as rigid as old state-owned enterprise systems, such as fixed sharding quantities—regardless of how much data you have, I'm just sharding this much.
I've seen a logistics system hard-cut the waybill table into 10 parts. Two years later, each part exceeded 100 million, and queries slowed back to liberation era speeds. Want to expand? Need to change 10 parts to 20 parts, and during data migration you have to hang a "system maintenance" sign, users call to complain, and the customer service little sister says tearfully "the tech brothers are working hard."
Even more incredible is sharding by "second"—yes, there really was a design of one table per second. As a result, querying three months of data required merging 7,776,000 tables, and the database directly "lay down and went on strike," with programmers collectively "online begging Buddha."
Crash Three: Cross-Database JOIN
Performing cross-database JOIN after database sharding is like having Beijing and Guangzhou branch companies hold offline meetings every day—the airfare and time costs can bankrupt you.
An e-commerce platform placed user, order, and product tables in different "cities." During major promotions, when querying "what did the user buy," three "cities" had to hold video conferences. During peak hours, this video conference occupied all "network bandwidth," and normal business calls couldn't get through—the system went directly "out of contact."
Even more骚 operation is creating "data clones"—copying the product table to every shard. As a result, when product prices changed, information had to be synchronized to all "clones," and there were always a few "clones" reacting half a beat slow, with users seeing prices like playing "spot the difference."
Crash Four: No Expansion Plan
Many teams only think about "how to live today" during design, not "what to do when there are more people tomorrow." When truly there are too many people and no room to live, they start frantically "building illegal structures."
A short video platform's user behavior table started with 4 "rooms." Half a year later, data increased fivefold, and they wanted to expand to 8 "rooms." They chose "secret construction at 2 AM"—moving user "luggage" from one room to another at 2 AM. As a result, the moving company (migration tool) was unreliable, putting Zhang San's underwear in Li Si's wardrobe. In the morning, users woke up to find "why did my viewing history become food videos," and complaint calls were flooded.
Crash Five: Monitoring as Decoration
After database sharding, monitoring remains at the single-machine era level, like using an abacus to manage a multinational enterprise—the numbers all add up, it's just too late.
One team only monitored "overall health," not each "organ's" health. As a result, one shard's disk was full, the database started "constipation," but overall indicators still showed "healthy." Until this shard completely got "gastroenteritis,"大量 requests failed, and the team discovered the problem. Recovering data took four hours, and lost orders were enough to give all employees three years of year-end bonuses.
Even more outrageous was alerts only going to email, and to the company email. When the system crashed at night, everyone was sleeping, and emails lay quietly in the inbox like a love letter that would never be seen.
Five-Step "Sure Win" Guide
After seeing enough crash scenes, now here's a set of "pit-avoiding house-building guide." Follow this, and your data can upgrade from "group rental" to "smart community."
Step One: Choose Sharding Key
Remember one sentence for choosing sharding keys: your common query methods are your sharding key selection criteria.
For e-commerce order tables, most frequently queried are "my orders" and "order details," so use user ID as "ID card"—hash it and you know which "community" you live in. Occasionally need to query by order number? Design the order number format as "user ID + timestamp + random number," and at a glance you know which building unit to search.
For logistics waybills, commonly queried by waybill number, so shard by waybill number, and simultaneously attach a "door number" (index) to recipient phone numbers. Log tables are always queried by time, so shard by time, and for months with large data volumes, you can create "buildings within buildings."
Blood and tears loyalty advice: Don't use fields with only two or three values like gender or status as sharding keys, unless you want to witness the spectacle of "girls' dormitory" overflowing while "boys' dormitory" sits empty.
Step Two: Sharding Strategy
Sharding strategies need "development vision," can't paint yourself into a corner. Recommend two smart approaches:
- Plan "communities" in advance: Build 128 "units" (logical shards) from the start, even if currently each unit only houses three households. Databases handle small units easily, and when there are more people in the future, directly build new "buildings" (add physical instances) nearby, and move some units over to live—no need to make old residents move.
- Intelligent dynamic partitioning: Use "smart property management" like ShardingSphere, which automatically creates "new units" when data volume grows. For example, sharding by day, and if one day has an event with data surge, automatically split into morning table and afternoon table, saving worry and effort.
Step Three: Avoid Cross-Database JOIN
Cross-database JOIN is a performance "money shredder," avoid if possible. Two practical clever tricks:
- Appropriate "redundancy" of information: If order tables frequently need to display product names, copy the product name into the order table. This way, when querying orders, you don't need to visit the product table to "visit relatives." Worried about information inconsistency? Build a "community broadcast station" (message queue); when products change names, broadcast it, and all order tables synchronize updates.
- Query "split actions": To query "what did the user buy," first send someone to the order table to query what was bought, get the product ID list, then send someone to the product table to query details, and finally assemble the information at the "community center" (application layer). Although this requires an extra trip, it's much more efficient than cross-community meetings.
Step Four: Expansion Plan
Expansion is a necessary path for growth. Two elegant "renovation plans":
- "Dual-write" transition method: During renovation, write new data to both old and new "stores" simultaneously, customers query the old store first. When the new store's goods are fully stocked, quietly guide customers to the new store, finally close the old store. Advantage is no need to suspend business; disadvantage is needing to watch closely whether both stores' prices are consistent.
- "Smart property" management: Use "smart property companies" like ShardingSphere; for expansion, just make a phone call "add a building," and the property automatically handles resident allocation, not even needing to worry about moving.
Step Five: Monitoring and Alerts
After database sharding, monitoring must be detailed to each "resident's" health status. Recommend configuring like this:
| Monitoring Object | Recommended Tools | Alert Threshold (alarm when exceeded) |
|---|---|---|
| Server Resources | Prometheus monitoring suite | CPU usage (>80%), Memory (>85%), Disk (>80%) |
| Database Health | Database-specific health checker | Connections (>80% of max), Slow queries (>5/minute), Query speed (>500ms) |
| Shard Status | Middleware's built-in "butler" | Shard down (report immediately), Data delay (>100ms), Data volume (watch growth trend) |
| Business Metrics | Custom instrumentation | Order success rate, Payment speed, Error rate and other core metrics |
Alert methods should be diverse: WeChat ding, SMS vibration, phone call, and for important issues call the person directly, ensuring even if they're showering they know "the community is on fire."
Summary
Database sharding is not a technologist's "medal," but a "tool" for solving specific problems. Many teams fighting fires late at night, the root cause isn't technical inability, but overcomplicating simple problems—using cool technology for the sake of using cool technology.
Before deciding to "split the family," ask three soul-searching questions:
- Is data really numerous enough to require splitting? (Don't mess with it before reaching tens of millions level)
- Have you figured out your query habits? (Think through how 80% of queries will go)
- Is there a plan for future growth? (Don't wait until squeezed to think about adding floors)
Technology always serves business, not for decorating resumes. A simple, stable, easy-to-maintain solution is far more valuable than a "cool but fragile" design.
After all, everyone hopes to receive a girlfriend's goodnight message at 2 AM, not a server alert SMS. Design rationally, implement cautiously, and let your system truly "sleep more soundly than you."