A Depressive Journey With MongoDB
2012-01-11 05:36:55 | 18 Comments
Disclaimer ---------- You are about to read a long story on how I got burnt with MongoDB and depressed with it. I am not blaming MongoDB, anyone using, advocating or developing it. I am blaming myself for this. MongoDB is a good tool. You can use it but just make sure it is what you need and it handles your requirements very well. This is not specific to MongoDB but applies to every tool we use. A Brief Intro ---------- I am working for the top entertainment tv production company in Turkey. This season we have launched a new show called "The Voice of Turkey". Americans would know this show as just "The Voice" while Dutchs as "The Voice of Holland". Long story short, boss called us and explained he wanted a page for contestants with their detailed information and a wall on it where members and the contestants would post. We thought OK, we can use our current infrastracture and go with caching for reads. After he told he is going to make live announcements the only thing I had in my mind was MongoDB. I knew that it could handle heavy reads and writes so why not use it? I've discussed with the rest of the team. We used it. First Fail: Too Many Open Files ---------- We have developed the new application in 4 days with two MongoDB instances, 1 master and 1 slave. We could not test the application because we were told about it on tuesday and it had to be ready on sunday, the day the show will be on live air. The time has come and the boss had made the announcement, "go in to this web site and post to your favourite contestant's wall!" Oh my god! 30K people jumped in, like Battlestar Galactica jumping from one coordinate to another, in 1-5 secs, right after he mentioned the web site. We have failed so hard. MongoDB instances were throwing "could not start thread, too many open files" errors. In a very short time I figured it was an "ulimit" issue. I have raised "open files" limit and they could open the new connections. They could not respond to the queries, they were just accepting the connections. Second Fail: Maximum Connection Limit ---------- After solving the first problem, I saw a new error in the logs mentioning it had reached the maximum number of connections so it could not open new connections. I have made a research on how to overcome this and found that i also had to increase "max user processes" limit. I have increased this on the both machines and the new limit was 20k. No matter how big that figure was, MongoDB stayed with the 20K. I guess this is the maximum number of connections a MongoDB instance can accept. If there is any other setting to increase this limit, please let me know. The Most Exhausting Day ---------- The most exhausting day of my life was over. Again, we have failed so hard. You cannot imagine the shock we were in. We had more than 30K requests in that very short timeline and could handle them with no problem. The app was always responsive. But this was the first time we have used MongoDB with this app. My trust in MongoDB was shaken, I realized that it was overrated. We could implement this with Redis pub/sub and MySQL and use Netscaler for caching with the reads. Our manager accused me and MongoDB for the failure while we all have agreed to use MongoDB in the team. None of my teammates stood up. As a side note, there are 4 screens used for statistics and displaying some information about the current contestant in the backstage of the show. When the host said now we are back to the backstage, due to the errors in the app, they were all displaying exceptions. Scandalous. This was the third fail. Preperation For The Next Week ---------- In monday I have started further investigation of what we could do. * I have switched to Replica Set from Master/Slave structure so we get automatic failovers. * Added 2 more MongoDB instances. * With the replica set I had a stupid problem. We could not write to primary with replica sets. Unfortunately we could not get it to happen so we switched back to Master / Slave structure. * This time we had 1 master and 3 slaves. * I've tried persistent connections. It fires up connections as much as the Apache instances you have. When Apache kills the child processes over the limit of MaxSpareServers directive, these connections to MongoDB are killed as well. However, as I needed more connections due the to load I am closing the connections explicitly. * Within our 30K test, master instance had opened up connections, written to the db and closed the connections. It had passed the test. Zero errors. * We were not happy with the read test results. We started distributing the read queries with Netscaler to all 3 slaves. This was good but one thing was wrong. Why in the earth these connections were not closing even though we were closing them in the code? I wanted to check what was going on with Netscaler. Appearantly someone set the connection timeout to 9000 secs which means a connection to a MongoDB instance through Netscaler is open for 2.5 hours. I have set it to 20 secs. I am not looking for someone to blame but probably it's our sys admin. * Read queries are now cached as well. Only 1 connection is opening up for reads. * Added 1 more slave but it is only used by those screens so they don't get any exceptions which means no scandals. * Voila! The results are great. I'm happy and the last sunday was a breeze. Some Word On The Community ---------- In the past 11 days I have asked various questions on both mongodb-users mailing list and the #mongodb channel on freenode. Some of my questions were answered and some were went unanswered. Among my experience with the communities such as Apache, nginx, MySQL, PostgreSQL, Python, Ruby, PHP, Grails, Django and Rails communities I can tell you this experience was the worst. There are helpful people of course, I deeply appricate their existence and help. I don't remember names but when you ask any question on both platforms, you'll face the same helpful people or ignorant smart-ass ones. I guess they think we are making up problems since they don't have those. By the way, we didn't buy commercial support. When all my questions is not answered and I could not prove that the cost is worth to it, how can I convince my managers to buy commercial MongoDB support that costs $4K? I cannot. Heck, even I am not convinced to use it for anything else again. The Moral Of This Story ---------- * Do not accept responsibility for anything that you had to do in a very limited time. * Do not accept the job if the timeline is short, the work is big and the load is heavy. * Load test your application no matter what the cost. If they want to get them all, they need to pay for them all. * Increase your "open files" limit. For every server. * Increase your "max user processes" limit. For every server. * Cache everything you can. * Do not have trust in any of your tools until they prove themselves. * Add up another instance of whatever you are using and only use it for displaying data elsewhere, safely. * Do not treat anything like Redis. Redis is heaven sent. I did. I got burnt. Questions to MongoDB core/users ---------- If I am going to cache every read, why would I continue using MongoDB? For heavy writes, I can fire up Redis instances for pub/sub and do the write operations through them to MySQL and read from MySQL as well with caching. Had we gotten all those benchmarks wrong? Is MongoDB overrated at all? What are we doing wrong? How we should have built our infrastructure? Is it PHP and it's driver? Is anyone else getting such loads and handling them very well with another language and driver? Thanks for reading. Go ahead and share your experience with MongoDB. Better yet, guide our way here and share your knowledge as well.