mengu on web programming.

A Depressive Journey With MongoDB

Disclaimer ---------- You are about to read a long story on how I got burnt with MongoDB and depressed with it. I am not blaming MongoDB, anyone using, advocating or developing it. I am blaming myself for this. MongoDB is a good tool. You can use it but just make sure it is what you need and it handles your requirements very well. This is not specific to MongoDB but applies to every tool we use. A Brief Intro ---------- I am working for the top entertainment tv production company in Turkey. This season we have launched a new show called "The Voice of Turkey". Americans would know this show as just "The Voice" while Dutchs as "The Voice of Holland". Long story short, boss called us and explained he wanted a page for contestants with their detailed information and a wall on it where members and the contestants would post. We thought OK, we can use our current infrastracture and go with caching for reads. After he told he is going to make live announcements the only thing I had in my mind was MongoDB. I knew that it could handle heavy reads and writes so why not use it? I've discussed with the rest of the team. We used it. First Fail: Too Many Open Files ---------- We have developed the new application in 4 days with two MongoDB instances, 1 master and 1 slave. We could not test the application because we were told about it on tuesday and it had to be ready on sunday, the day the show will be on live air. The time has come and the boss had made the announcement, "go in to this web site and post to your favourite contestant's wall!" Oh my god! 30K people jumped in, like Battlestar Galactica jumping from one coordinate to another, in 1-5 secs, right after he mentioned the web site. We have failed so hard. MongoDB instances were throwing "could not start thread, too many open files" errors. In a very short time I figured it was an "ulimit" issue. I have raised "open files" limit and they could open the new connections. They could not respond to the queries, they were just accepting the connections. Second Fail: Maximum Connection Limit ---------- After solving the first problem, I saw a new error in the logs mentioning it had reached the maximum number of connections so it could not open new connections. I have made a research on how to overcome this and found that i also had to increase "max user processes" limit. I have increased this on the both machines and the new limit was 20k. No matter how big that figure was, MongoDB stayed with the 20K. I guess this is the maximum number of connections a MongoDB instance can accept. If there is any other setting to increase this limit, please let me know. The Most Exhausting Day ---------- The most exhausting day of my life was over. Again, we have failed so hard. You cannot imagine the shock we were in. We had more than 30K requests in that very short timeline and could handle them with no problem. The app was always responsive. But this was the first time we have used MongoDB with this app. My trust in MongoDB was shaken, I realized that it was overrated. We could implement this with Redis pub/sub and MySQL and use Netscaler for caching with the reads. Our manager accused me and MongoDB for the failure while we all have agreed to use MongoDB in the team. None of my teammates stood up. As a side note, there are 4 screens used for statistics and displaying some information about the current contestant in the backstage of the show. When the host said now we are back to the backstage, due to the errors in the app, they were all displaying exceptions. Scandalous. This was the third fail. Preperation For The Next Week ---------- In monday I have started further investigation of what we could do. * I have switched to Replica Set from Master/Slave structure so we get automatic failovers. * Added 2 more MongoDB instances. * With the replica set I had a stupid problem. We could not write to primary with replica sets. Unfortunately we could not get it to happen so we switched back to Master / Slave structure. * This time we had 1 master and 3 slaves. * I've tried persistent connections. It fires up connections as much as the Apache instances you have. When Apache kills the child processes over the limit of MaxSpareServers directive, these connections to MongoDB are killed as well. However, as I needed more connections due the to load I am closing the connections explicitly. * Within our 30K test, master instance had opened up connections, written to the db and closed the connections. It had passed the test. Zero errors. * We were not happy with the read test results. We started distributing the read queries with Netscaler to all 3 slaves. This was good but one thing was wrong. Why in the earth these connections were not closing even though we were closing them in the code? I wanted to check what was going on with Netscaler. Appearantly someone set the connection timeout to 9000 secs which means a connection to a MongoDB instance through Netscaler is open for 2.5 hours. I have set it to 20 secs. I am not looking for someone to blame but probably it's our sys admin. * Read queries are now cached as well. Only 1 connection is opening up for reads. * Added 1 more slave but it is only used by those screens so they don't get any exceptions which means no scandals. * Voila! The results are great. I'm happy and the last sunday was a breeze. Some Word On The Community ---------- In the past 11 days I have asked various questions on both mongodb-users mailing list and the #mongodb channel on freenode. Some of my questions were answered and some were went unanswered. Among my experience with the communities such as Apache, nginx, MySQL, PostgreSQL, Python, Ruby, PHP, Grails, Django and Rails communities I can tell you this experience was the worst. There are helpful people of course, I deeply appricate their existence and help. I don't remember names but when you ask any question on both platforms, you'll face the same helpful people or ignorant smart-ass ones. I guess they think we are making up problems since they don't have those. By the way, we didn't buy commercial support. When all my questions is not answered and I could not prove that the cost is worth to it, how can I convince my managers to buy commercial MongoDB support that costs $4K? I cannot. Heck, even I am not convinced to use it for anything else again. The Moral Of This Story ---------- * Do not accept responsibility for anything that you had to do in a very limited time. * Do not accept the job if the timeline is short, the work is big and the load is heavy. * Load test your application no matter what the cost. If they want to get them all, they need to pay for them all. * Increase your "open files" limit. For every server. * Increase your "max user processes" limit. For every server. * Cache everything you can. * Do not have trust in any of your tools until they prove themselves. * Add up another instance of whatever you are using and only use it for displaying data elsewhere, safely. * Do not treat anything like Redis. Redis is heaven sent. I did. I got burnt. Questions to MongoDB core/users ---------- If I am going to cache every read, why would I continue using MongoDB? For heavy writes, I can fire up Redis instances for pub/sub and do the write operations through them to MySQL and read from MySQL as well with caching. Had we gotten all those benchmarks wrong? Is MongoDB overrated at all? What are we doing wrong? How we should have built our infrastructure? Is it PHP and it's driver? Is anyone else getting such loads and handling them very well with another language and driver? Thanks for reading. Go ahead and share your experience with MongoDB. Better yet, guide our way here and share your knowledge as well.
Did you enjoy this post? You should follow me on twitter here.


Jaromir Dvoracek said on 12/01/2012 16:44 PM
Hello, we are hosting mongodb replica set for heavy loaded server writen in php served by apache. Everything what can I say is that php mongo driver is buggy when set up with replica set - with mongoid for ruby is everything all right. We workarounded it with monitoring which is reconfiguring application config on the fly.

ahofmann said on 12/01/2012 16:56 PM
I don't think it's a good idea to crank up a new DB System on a high traffic website without having enough time to test for problems like this. You could have written this article about CouchDB or even MySQL if you never used it before and tried to implement such a feature. So IMHO MongoDB is not the problem here. And yes: you can achieve the same thing with MySQL and Redis/Memcached. But still a few points to your MongoDB experience: Master/Slave in MongoDB is deprecated. We use replica sets since 1.6. MongoDB is at a fast pace and you should use the most recent version. If you don't want to update every 10 weeks, don't use MongoDB. Next point: your opening 30k connections to MongoDB. Why? I'm managing 1000 concurrent users on my servers with 400 PHP processes = 400 connections to MongoDB. And yes, I've hit the open files problem too. This was fixed once on the servers and that's it. Nothing to complain about. MySQL has about 400 config vars that you can tune: so again: you have to know you db systems if you want to use it on a high traffic application. To the question if mongodb is overrated: it depends. It's not a magic cure for every problem that you get in MySQL. It's not a replacement for MySQL. But at its sweet spots (e.g. having all data in RAM and be as fast as memcached) its wonderful.

@chrisco said on 12/01/2012 17:25 PM
Interesting report, thanks. Sounds less like a Mongo problem and more like a "trying to do too much, too fast, too big" problem. The Greek call this "Hubris." No shame. We all learn about it one (or more) times :)

Brent Hoover said on 12/01/2012 17:57 PM
Thanks for sharing your story with us. I do think this not so much about Mongo as a cautionary tale of "when your butt is on the line, go with what you know and has been proven". That kind of load with that kind of visibility is no joke and if you have't been there you don't what it's like. Everybody like to use "the new hotness", but you have say something for things that are tried and true, even when you know those things have severe flaws. I have learned this the hard way as well. Hopefully someone else will rethink their strategy and save themselves some grief thanks to you.

Manas Rawat said on 12/01/2012 18:19 PM
Sorry you had to go through so much trouble. But with 5 days to deliver and such high traffic, a new database is clearly bad idea. Rolling it out without load testing and even knowing the config settings was clearly risky. We are going to build a new product very soon and I am researching MondoDB. Your story is going to help me keep it real :).

Adrian O'Connor said on 12/01/2012 18:25 PM
Brent and @crhisco said the same thing, but I'll say it again anyway because it is your problem here: don't try out new stuff on important projects. Go with what you know, even if it isn't perfect. If you want to try out something new, do it on a project that doesn't matter if it goes wrong.

Scott said on 12/01/2012 18:41 PM
I wonder how many req/sec did you have and whats are the specs of the servers (cpu/ram), because we handle pretty heavy sites quite well using just two servers for mongo (2ghz/4gb ram)

Nate said on 12/01/2012 20:40 PM
I hope another lesson learned is given such a short window (4 days, right?) It's probably not the time to try out technologies that are new to you. You mentioned you could have done this with MySql and Netsclar etc. If you can't test your app at all, much less load test because of the time constraints, go with what you know. Go with a system that you could build in your sleep. Doing this minimizes the risk, because you'll know how MySql behaves under pressure. When things start failing with new systems, it could be anything./

GuinessDraft said on 12/01/2012 21:00 PM
Let me get this straight (15+ years of software development here): You had a very quick burn project. You decided to base it on top of a product that you had not used in a production capacity before, and you're blaming the product for the problems that happened? This may sound harsh, but you need to take ALL of the responsibility for this failure. Any software developer that knows anything would understand that you just don't make this type of mistake. I've been using MongoDB for 1.5 years now. I will be the first to admit it has problems, but overall the system does work *if you take the time to learn it*. That doesn't mean just work through the tutorial and think you're ready to go production. I understand it's nearly impossible to test a site in that amount of time. You still needed to put in the due diligence. Yes, you tried the mailing list and so on. Did you Google anything? You should have done a LOT more research on deployment issues with MongoDB. YouTube has lots of videos with people discussing it, some even from 10gen explaining things that don't work and other things to watch out for. There are also hundreds of articles written about it as well. This is certainly not a MongoDB failure, but a developer failure. Write this episode up as a good learning experience. if you continue on your journey as a software engineer, I can guarantee you will run into a similar situation again. Heed the lessons learned from this experience and use it to guide you to future success.

michael said on 12/01/2012 22:03 PM
It sounds like you would not have had time for load testing even if you had gone with tools you already knew. I think the lesson here is to set the expectations of the customer. The customer needed to be told that the system would be in an alpha state and subject to failure. I guess I'm fortunate enough to work for companies that don't expect to go from zero to polished production-quality in 4 days.

Will said on 12/01/2012 22:47 PM
One lesson that seems to be missing: use connection pooling.There is no reason you need tens of thousands of open connections to your db. Your app should be re-using a finite number of connections. This is true of all DBs and using a connection pool (which most MongoDB drivers have) is how you do it.

Kent said on 12/01/2012 23:02 PM
Mengu should slow down and learn his tools before he uses them in production. No time for testing? Really?

Mengu Kagan said on 13/01/2012 03:43 AM
I think some of the readers got me wrong. I should have expressed myself better. We were already playing with MongoDB. We have also done a fair share of research before deciding as our web site is a high traffic one. We have no right and excuse to try new things with it. We checked out who were using MongoDB in our circle. We saw at least two of high traffic apps -that we have friends working on- like ours were using MongoDB without any problems. The reason we have chosen MongoDB was its capability of fast writes under high load. MongoDB has not failed me with writes. What we had done wrong is treating MongoDB like it was Redis. We had not cached read queries, we had open thousands of connections for reading. We also tried persistent connections and had problems with it. I don't know how we skipped connection pooling. Even we would use Redis pub/sub and MySQL with caching, we would fail again. We had no time to load test this app. As I have said at the beginning, this was not something specific to MongoDB. This is why I have not bashed or blamed MongoDB. I have only mentioned some problems with it. This failure has taught me well. I am not deploying any single application to production without any load tests. I am going to learn every single possible fail points with the tools I am going to use in production.

Andy Marks said on 13/01/2012 07:19 AM
This is a great post, but I also agree with the other posters that have emphasised the issues have mainly come about through an unrealistic deadline for the nature of the project. It's a shame things didn't work out for you but I think your conclusions are great lessons to learn going forward.

lucas renan said on 13/01/2012 17:23 PM
do you tried to optimize your queries?

Didier Rano said on 13/01/2012 19:26 PM
Often Redis is compared with Memcached. In fact, Redis is also a Mongodb equivalent. If your data could fit in RAM (And it is often the case, except ego problems !), Redis is really a perfect solution. I've tried Mongodb, but I had to migrate to another solution: Redis. Why ? My debian vm crashed several times by day...

Brent Hoover said on 14/01/2012 20:42 PM
Somebody smart should build something that is pre-made for this sort of high-traffic, short-lived, marketing-related micro-sites. This is a classic last minute marketing request and with all the hindsight in the world, there are so many mistake that could be made and typically they are high-visibilty with no room for failure. Not only are they risky, but they are a huge drain on engineering staff. Maybe there is something out there already, I don't know. But I literally had someone walk in yesterday and say "Hey, we just need to put up a mobile website real quick for this promotion." Probably all of us have a bit of macho in us that says "f* yeah, I can do that", but we also probably have other work that still needs to get done that can't afford us to disappear for four days.

Leave a Response

No HTML allowed. You can use markdown.
E-Mail* (not published):
Web site: