Searching logfiles with multiple threads


At my day job it's not uncommon that I'll get a list of IP addresses, networks, and/or hostnames that are suspected to be used by various bad actors, and then I'll need to go search through previous firewall, web-access, and DNS query logs to see if they show up anywhere. And sometimes there's a LOT of log data to search through, or a lot of suspect data to look for. I've found that grep -f (aka fgrep) sometimes freaks out if I have a really large number of regex patterns to search for, or takes a very long time to complete, or never completes. More than once I've wound up writing a script to break up a list of things I'm grep'ing for into multiple files so I can search for them in parallel and because fgrep sometimes just doesn't work well wil large files.

I see a hand out there... Yeah, I know, I should have a better logging solution than just syslog-ng splitting stuff up into multiple log files that get rotated and gzip'd nightly. Yeah, yeah... Sue me. I've been working on other things first such as getting some functional network intrusion detection working and monitoring more than just internet/dmz/edge traffic. And, frankly, what I've got with syslog-ng and oak is far superior to what I found when I first started. :-) What can I say, it's a work in progress.

Anyway, in the meantime, I needed a better/faster way to search things. For instance, when the FBI announced the blackshades operation, I suddenly had over 13,000 hostnames/domains to search for in recent DNS query logs. Ugh. Try that with fgrep. So I blew the dust off my KnR book and wrote a multi-threaded C program to do that. I mentioned it in passing to my brother and he made a similar tool in a language he uses at work called scala. :-) But he also pointed out (duh!) that what I was looking for in the case of the blackshades data didn't require full regex parsing and mebbe I should just use substrings at least as an option.

Boy did I feel dumb.

So now not only does my tool use multiple threads for doing regex searches, it will also (by default now) just do substring searches which is by far faster. You can download it here if you want it. Note that this isn't at all polished, merely functional. For instance, it doesn't support reading from stdin, just from another file like this:
tfgrep2 --patternfile=bsdomains.txt --inputfile=query.log --threads=8

There's some hardwired limitations that I haven't bothered to code around yet. For instance, there's a hard limit on the maximum number of patterns it will support. Sometime I'll rewrite the code that reads the pattern file so it uses malloc and realloc so you can have as many as you have memory for. There's also a maximum line length at the moment (though I'm careful to make sure if you have an input line longer than that (8k I think) it will just break it into multiple lines, not suffer a buffer overflow. But for my purposes this is no big deal.

The reason it doesn't do stdin, if you're curious, was I wanted to optimize how I parallelized it. I wanted to avoid using mutexes if I could. One thing I later found was that when there's a really large number of patterns like this blackshades example, using mutexes wouldn't hurt performance because far more time is spent checking for matches than is spent reading lines. So, at some point I'll add a feature so you can read from stdin and the main program will just build a work queue using a mutex for safety and each worker thread can pull a line out of this queue to work on.

However, where this won't work nicely is when there's a really big file you want to search (another good reason to parallelize) but not a lot of things to search for. In that case, the work-queue mutex will could be a big bottleneck. In fact, the way I parallelized tfgrep2.c right now is very simple and would work with a big file and small number of patterns. What it does is gets the size of the file in bytes, divides it by the number of worker threads, then it seeks to the start of this split, reads to the start of the next line, notes the file position and then spawns a thread starting from byte offset 0 and using this current position as a ending offset. It uses it as the starting offset for the next thread, finds the end of the line after the next break and spawns the next thread. So each worker thread has a roughly equal number of lines it needs to read and they all open the file with their own file handle and reads as fast as it wants - no queues, no mutexes, tidy, simple, fast.

And no stdin... :-( But mebbe when I add support stdin, I'll try 'n keep the mutexes from being a bottleneck by feeding the input file in chunks to each worker thread rather than just line by line. Maybe I just have a CLI option to set the number of lines at a time to feed to each worker thread? Or maybe each worker thread will look at the clock at the start and end of a task and if it's less than, say, a second or two it (using another mutex) doubles the "chunk size" - the number of lines the main program reads into the work queue before unlocking the mutex and the number of lines each worker thread reads out of the queue before proceeding to go to work on them? I'm not sure what the best approach here will be but it'll be interesting to find out. :-)

Of course, with smaller numbers of patterns, and having to read things in chunks to feed to the worker threads will mean this work queue will need a lot more memory. But that's ok. :-)