Learning Node.js and elasticsearch

Latest Updates

At my DAY JOB I'm logging an increasingly large amount of stuff and then wanting to be able to quickly search through it for specific things and also have tools doing the logging also looking for specific things. I've been doing this with a combination of a wide variety of tools (some of which I've talked about here like oak), but most recently we had a small start-up write a prototype for us that used elasticsearch and logstash. The bit they wrote was a web interface for searching the DB and making jobs that run periodically that search for whatever and report what they find.

Well, that business folded. So I'm back in the market for a log DB and don't have any budget at the moment. I might just use elasticsearch and kibana and RTFM enough on logstash to figure out how to make it index stuff, but...

I've been meaning to teach myself about a language called Node.js anyway, so I spent the recent holidays learning a bit about it and a bit about javascript. Any time I embark on learning a new language I always start by finding a project I want that the language is suited for and start writing it as I learn. So, I started making my own syslog server, just for grins.

Node should be a good fit for it, I think, since it's whole claim to fame is that it's event driven, so less time is wasted waiting on disk/network I/O and instead is used handling the next event. Interesting. So, my (possibly dumb) idea was to have it accept syslog messages, pick out the facility/severity and then pass it on to various handlers, which would then write it to a file or add it to a database or pass it to a log message analyzer or whatever. I also found out that these could be run as separate processes. Neat - a multi-threaded syslog daemon? One that if one handler (say, one inserting messages into a DB) starts to bog down doesn't cause any dropped messages maybe? (I still need to do some benchmarking and play "what if" games with it).

Anyway, I've got a first cut at one now that logs to a file and logs to elasticsearch. And the handler that writes stuff to elasticsearch also picks interesting things out of the log messages to index by like "user" or "command" or "tty" (all stuff seen in auth.* messages). So far, so good.

I still need to spend a lot more quality time with kibana figuring it out. I think it'll be a powerful tool (proably more so than the one made by the start-up) though it's not exactly intuitive. But it's a start and gave me a good excuse to learn a bit about Node.js. :-)

If I wind up with a program that I think is useful, I'll probably stick it in github for the world to laugh at... er, I mean to use. Hopefully it'll scale well. At SGI it's not unheard of for someone to mistakenly cause over 8000 DNS queries per second which all get logged (ugh) and I don't even wanna think about the number of log messages per second we get from all the firewalls in all the different offices. It's obscene.

But my intention is to make it so I can have all these devices/servers logging to a LOCAL server (one in each geography - less WAN traffic) which logs it to one or more LOCAL elasticsearch servers which are all working as a cluster so kibana can search (in parallel?) any or all of them for whatever I need to find. A distributed logging system that's centrally searchable.

Yeah, there's probably lots of off-the-shelf tools like splunk or logrhythm that do this but as I said I have no budget at the moment. And in the meantime I'll add Node.js to my toolbox. :-)

Ok color me officially impressed. Once I got my syslog daemon logging things into elasticsearch, I started playing with the latest version of kibana. And it's a very nice tool. I've still got lots to learn about it but search youtube for videos demonstrating it. It's very nice.

I also have been spending evenings and weekends working on my little syslog daemon and it's coming along nicely too. I tweaked it to start tracking how many log events per second it's handling, how long (on average) each destination handler is taking to parse it, and how long the inserts into elasticsearch are taking. After this I found out that doing hundreds of inserts per second didn't scale - after 8 or 9 hours of running I'd see my per-message handling times going from .1 or .2 ms (roughly) to 20-30 ms.

However, last night I tweaked it so the elasticsearch-based handlers now just queue up each object to insert into the database and then every X seconds (once a second right now) I do a bulk insert. MUCH better performance. After running it about 12 hours, my average handling times are still .01 to .02 ms per event, and the insert times are still around 0.17 ms per event.

And I haven't even parallelized anything yet. I'll either bang together some new parsers for other types of data or I'll dive into the pm2 module and figure out how to parallelize it next. Or maybe I'll tidy up the code and modularize it a bit more so specifying what "handlers" to use and what files to import them from is in a config file instead of just hardwired in the code.

And at some point, I'll have the handlers take their performance stats and stick them into elasticsearch too. That way kibana can graph them for me. I'll be able to make pretty dashboards with pie charts showing what types of log data we're ingesting and line charts showing average parsing and inserting times, how many events per second each handler is doing, etc. Since kibana is so good at that, I figure, 'might as well'.

Anyway... It's coming along nicely. I'm currently around 20k to 25k events per minute and it's only consuming (for now) around 20 percent of one cpu core (for both elasticsearch and my syslog)...

Another breakthrough... Last night I figured out how to bend elasticsearch and kibana to my will so that I could keep my data local to a specific geography but still have it be centrally searchable. This turned out to be a challenge at first because elasticsearch clusters really wants to spread the data out across all the nodes in the cluster. Whereas, I wanted to keep log data in CA in the elasticsearch node/s in CA, log data in WI in the nodes in WI, log data in London in the nodes in London, etc.

As it turns out, the way to do this is to have each geography have it's own elasticsearch cluster (with it's own cluster name), and it's own indexes (this is important). Then, on the system you're gonna run kibana, you setup elasticsearch again but in THIS config, you set it up to use "tribes". Each tribe you configure causes elasticsearch to connect to a separate cluster as a client. Then you point kibana at this local elasticsearch node. It does all the queries across all the different tribes - across the different clusters - aggregates the results, and feeds them back to kibana.

So, I had to tweak my syslog.js script so the index name used still started with logstash- (which is what kibana looks for by default), but was unique for each geography. I also had to connect kibana to one of the clusters directly, once, so it could create it's .kibana index, then switch it back to it's local tribe-based elasticsearch server.

This will work out great because it means if things ever start to not scale well enough, I can just build up that elasticsearch cluster - add a new node. Or if I decide I want to retire/replace a node in one of the clusters, I can set it to be master: false, data: false, and then spin up a new node in the cluster and all the new data will go only to the new node and then eventually I can tear down the old one when all of it's data has expired. Slick.

At the moment, I'm handling well over 1.5 million log events per hour and the elasticsearch nodes (my "clusters" are only one node each at the moment) are ticking along fine at 20% of one CPU core. :-) I'm sure the CPU usage and memory needs will increase as the amount of data I'm keeping increases, but for now that's just dandy. These are just prototypes after all. At some point I'll spin up production nodes with significant amounts of memory and a BIG chunk of fast storage. Then I'll really have somethin'!