Searching logfiles with multiple threads
At my day job it's not uncommon that I'll get
a list of IP addresses, networks, and/or hostnames that are suspected to
be used by various bad actors, and then I'll need to go search through
previous firewall, web-access, and DNS query logs to see if they show up
anywhere. And sometimes there's a LOT of log data to search through, or
a lot of suspect data to look for. I've found that grep -f (aka fgrep)
sometimes freaks out if I have a really large number of regex patterns to
search for, or takes a very long time to complete, or never completes.
More than once I've wound up writing a script to break up a list of things
I'm grep'ing for into multiple files so I can search for them in parallel
and because fgrep sometimes just doesn't work well wil large files.
I see a hand out there... Yeah, I know, I should have a better logging
solution than just syslog-ng splitting stuff up into multiple log files that
get rotated and gzip'd nightly. Yeah, yeah... Sue me. I've been working on
other things first such as getting some functional network intrusion
detection working and monitoring more than just internet/dmz/edge traffic.
And, frankly, what I've got with syslog-ng and oak is far superior
to what I found when I first started. :-) What can I say, it's a work in
progress.
Anyway, in the meantime, I needed a better/faster way to search things. For
instance, when the FBI announced the blackshades operation, I suddenly had
over 13,000 hostnames/domains to search for in recent DNS query logs. Ugh.
Try that with fgrep. So I blew the dust off my KnR book and wrote
a multi-threaded C program to do that. I mentioned it in passing to my
brother and he made a similar tool in a language he uses at work called
scala. :-) But he also pointed out (duh!) that what I was looking
for in the case of the blackshades data didn't require full regex parsing
and mebbe I should just use substrings at least as an option.
Boy did I feel dumb.
So now not only does my tool use multiple threads for doing regex searches,
it will also (by default now) just do substring searches which is by far
faster. You can download it here if you want it. Note
that this isn't at all polished, merely functional. For instance, it doesn't
support reading from stdin, just from another file like this:
tfgrep2 --patternfile=bsdomains.txt --inputfile=query.log --threads=8
There's some hardwired limitations that I haven't bothered to code around
yet. For instance, there's a hard limit on the maximum number of patterns
it will support. Sometime I'll rewrite the code that reads the pattern file
so it uses malloc and realloc so you can have as many as you have memory for.
There's also a maximum line length at the moment (though I'm careful to make
sure if you have an input line longer than that (8k I think) it will just
break it into multiple lines, not suffer a buffer overflow. But for my
purposes this is no big deal.
The reason it doesn't do stdin, if you're curious, was I wanted to optimize
how I parallelized it. I wanted to avoid using mutexes if I could.
One thing I later found was that when there's a really large number of
patterns like
this blackshades example, using mutexes wouldn't hurt performance because far
more time is spent checking for matches than is spent reading lines. So,
at some point I'll add a feature so you can read from stdin and the main
program will just build a work queue using a mutex for safety and each
worker thread can pull a line out of this queue to work on.
However, where this won't work nicely is when there's a really big file you
want to search (another good reason to parallelize) but not a lot of things
to search for. In that case, the work-queue mutex will could be a big
bottleneck. In fact, the way I parallelized tfgrep2.c right now is very
simple and would work with a big file and small number of patterns. What it
does is gets the size of the file in bytes, divides it by the number of worker
threads, then it seeks to the start of this split, reads to the start of the
next line, notes the file position and then spawns a thread starting from
byte offset 0 and using this current position as a ending offset. It uses it
as the starting offset for the next thread, finds the end of the line after
the next break and spawns the next thread. So each worker thread has a
roughly equal number of lines it needs to read and they all open the file
with their own file handle and reads as fast as it wants - no queues, no
mutexes, tidy, simple, fast.
And no stdin... :-( But mebbe when I add support stdin, I'll try 'n keep the
mutexes from being a bottleneck by feeding the input file in chunks to each
worker thread rather than just line by line. Maybe I just have a CLI option
to set the number of lines at a time to feed to each worker thread? Or
maybe each worker thread will look at the clock at the start and end of a
task and if it's less than, say, a second or two it (using another mutex)
doubles the "chunk size" - the number of lines the main program reads into
the work queue before unlocking the mutex and the number of lines each
worker thread reads out of the queue before proceeding to go to work on them?
I'm not sure what the best approach here will be but it'll be interesting to
find out. :-)
Of course, with smaller numbers of patterns, and having to read things in
chunks to feed to the worker threads will mean this work queue will need
a lot more memory. But that's ok. :-)