What I’ve been up to? Nine lessons learned.
What I’ve been up to?
It’s been some time. This last eight months or so I haven’t been doing much software development. Most of my work has been in support. I’ve done consulting work.
This autumn I was hired to help a team with an application that was constantly crashing. The application was closed source, so we could not easily change it or trace how it worked. The first thing we did was to start monitoring stats. I wrote a simple monitor that tracks the health of a Tomcat process. This gives a feeling for what’s happening inside the application, for what happens in normal times and in times when it’s stressed. We kept a monitor open with baretail open on these log files and watched what was happening.
Every two or three days, the count of busy threads in Tomcat started rising, from the normal level of 2-3 up to the maximum of 150. At that point there was nothing to do but restart the server.
Lesson 0: set up watch dogs. It’s easy to write a custom status page sniffer. Nagios keeps an eye over many variables (memory, CPU load, failing services and the like.) New Relic is a very good all-round solution when you don’t know where to start.
The usual suspect in such cases is the database, and watching the process monitor on SQL Server revealed that many connections were blocking each other. We called in one consultant, then another, and finally a third one who knew how to deal with this problem.
Lesson 1: a consultant may know how to deal with your problem, or he may be out of his depth. You better make sure that any consultant you call in know what they’re doing. Warning signs is when they try stuff at random (I read in a blog that changing parameter XYZ could be helpful!) or just give you canned responses (uh, better install the latest service pack…)
The last SQL Server consultant really knew his stuff, and was able to solve the SQL Server hangups problems by cleaning up and optimizing the SQL Server installation. It turned out that this database has been running with no maintenance for years.
Lesson 2: if you run a business that depends on database-backed applications, you better have a full-time DBA looking over maintenance.
After the SQL Server tune up the application ran considerably better. But there were more problems on the horizon. We still had a few occasions when the application did block. I didn’t know what to do to get insight on what happened in those cases. I wanted a thread dump of the Java Virtual Machine, but on Windows it turns out it’s not an easy thing to do.
Lesson 3: Windows works. You can’t really say it doesn’t. And this is the most charitable thing I can say about it. All the same, Unix is so much better than Windows in every which way. It has always been.
In the end I found a neat utility called psexec that helped us solve that problem. We set up a script that checked the number of busy threads on Tomcat every 30 seconds, and created a thread dump for us. This showed that our Tomcat threads were busy trying to deliver a mail message to a SMTP server. A quick investigation found that the application was somehow configured to use a temporary, slow SMTP server nobody remembered about. We changed the configuration to point to a fast server and this problem was fixed.
The last failure mode we encountered happened rarely; at times the memory consumption in Tomcat started growing from the usual 100-150MB up to its configured maximum of 500MB, and stay there. At that point, the application was slowing to a crawl, and there was nothing but, again, reboot it. Try to imagine our feelings of helplessness while this was happening!
Our next tool here was to run JMap to get a memory dump, and then analyzing it with the Eclipse Memory Analyzer. The culprit seemed to be some data objects that we knew the application was allocating to contain the result of queries. Whenever a user was requesting a report of, say, 50K records, the app allocated memory to hold all of the result in memory, and saved it… guess where…. in the user session! Then it showed the first 100 rows to the user. Since that query was slow, the user was staring at a blank screen for a minute or two, and then reload (F5!), which caused Tomcat to perform another query. Each of these 50K rows queries was allocating 50MB of ram in the JVM. Boom, boom, boom, once this started it brought Tomcat to its knees.
Lesson 4: Don’t store more than a few bytes in the user session. Being stateless is the key to scaling. Which brings us to the next…
Lesson 5: Pagination is best done in the database. Caching is best done in the database. Filtering is best done in the database. Heavy computations with the data are best done in the database. Brush up on your SQL skills! If you’re doing business software, your app’s performance and capabilities will be dominated by how well you use your database.
We could not change the application code, but luckily there was a configuration parameter that limited the maximum number of returned rows. Setting it to 1000 solved the problem.
Lesson 6: whenever you perform a query, you must be certain that the number of rows it will return is bounded.
And so our job with this application was finished. We went on tackling the next… One amazing thing that I learned in the process is that placing Apache in front of Tomcat is not always gueranteed to work well. We had some really rough time when we discovered we had this devastating Apache bug, and later we found a different bug in a different app with the same setup.
Lesson 7: try to use as few pieces (processes, computers, things) as possible. Every piece you have may be a cause of failure. Whomever says “… and by using XYZ, we get ZYX for free” is a fool! Nothing comes for free.
One of the collegues who valiantly supported me in this adventure was so much overwhelmed by the stressful situation that he resigned. I don’t know what to think of this; certainly this was not the sort of job that we were trained to deal with. This was Operations, not Software Development. And this was the life of the consultant. Diving into a problem not knowing the solution, but with the confidence that we will find the means to find the means of solving it. This is the opposite of “you tell me what I must do and I do it”, which is an unfortunately common misinterpretation of “customer collaboration”.
Lesson 8: Act as a consultant, not as a contractor (as @p_pugliese would say). Find your own ways of gathering information, making plans, trying solutions. Find your own ways of checking if you’re making progress.
In the end, my collegue will have to decide for himself if this is a crisis that leads to coming back to this job stronger than before, or a signal that his deep desire is to do something else.
February 8th, 2011 at 10:38
Thanks a lot for sharing these lessons. Another insightful post on your blog!
Regarding lesson#1: which social-technique to deal with a consultant that starts random attempts? What are the right words to make him step back if the messy-situation-he-did is not enough for him to give up?
Awful personal experience…
February 8th, 2011 at 11:21
Ciao Davide! Long time no hear!
thanks for your comment. Well one thing I would do is to ask them for their diagnosis and their plan before allowing a consultant to change anything. Then another thing is to make sure you back up configuration files before you let them change them. But this is not an answer to your question, right? :-)
It depends, of course (see how I’m always practicing my consultant-talk? :-) One idea is to call a “status meeting” and then ask the consultant for his diagnosis of the problem. Ask them if they are sure, and let them explain how they know that they are sure. Have they ever seen a similar problem? Dig, dig, dig. Then ask them what is the plan. Ask them if they are sure it will work and what makes them sure it will work. Ask for an estimate of when the problem will be fixed. Ask them to try the solution in a test environment.
In summary, the trick is to bring the conversation away from the people (you’re good, you’re no good) to the concreteness of the problem: here’s what we know, here’s what we don’t know, here are our options.
Hope this helps!
February 8th, 2011 at 23:23
[…] This post was mentioned on Twitter by Luca Mearelli, realbot. realbot said: Great post! RT @xpmatteo Blogged about lessons learned in Application Support: http://matteo.vaccari.name/blog/archives/502 […]
February 9th, 2011 at 15:58
What an amazing read, thanks for sharing those valuable gems.
February 10th, 2011 at 03:29
Great post. Also your comment about the technique for handling the “random changes” consultant is good. Re: lesson 2, is it always necessary to have a full-time DBA? What about a consultant to perform occasional tune ups?
February 11th, 2011 at 19:27
@Craig: ah well, I suppose it depends :-) The important part is to have someone create and follow a maintenance schedule.
February 14th, 2011 at 22:39
[…] http://matteo.vaccari.name/blog/archives/502 […]
February 16th, 2011 at 01:35
Random suggestion for the future (could have helped about the SMTP issue and perhaps about lection 4.): user wireshark / tcpdump and look at what gets on the wire.
Big fun ! (Big surprises more often than not).
Clearly those tools need some skill to be really useful, but are really worth the time spent learning.