If you can't solve a problem, then there is an easier problem you can solve: find it. - George Polya
A few months ago I got itchy coding fingers and revisited an old project where I had "implemented" a socket based logging server that rotates log files based on the number of lines. I'm using quotes because the server component was really horrendous and would fall over if you looked at it the wrong way. When I went back to it I tried to make it more robust by finishing the parts I had started but hadn't finished. I succeeded in some respects and failed in others. So to wrap up the exercise here are the things I learned along the way and things that I could have done better. As usual, the code is not meant for production environments and is posted on github for learning purposes.
The main success is that I mostly finished the project and gained some more respect for production logging servers like rsyslog. It is surprisingly tricky to write a robust logging server. My toy implementation has some mitigations for the obvious failure modes but making things truly robust is left as an exercise for the reader.
I'm proud of myself for coming up with an asynchronous line counting trick by using an in-process pipe. It's not an exact counter so the log file will sometimes overflow before the monitoring thread rotates the log. There is a performance trade-off here and I opted to allow the log file to overflow instead of blocking clients or dropping log lines. I didn't do any benchmarking but making things exact would increase lock contention so it would definitely slow things down.
I also commented copiously because I know from previous experiences that any code that involves threads and locks can be notoriously tricky to make sense of. The comments definitely helped me uncover a deadlock or two while I was testing things and the experiences are documented in the comments as a warning to myself and others.
The main failure is that it is still unfinished. I think right now it is 80% to 90% done and taking it to the finish line will probably require reworking some of the cleverness.
Speaking of cleverness, I think I was a bit too clever with some of the tricks. Although they work there are still plenty of race conditions and I might not have made the right performance vs correctness trade-offs.
The code is also not very readable and it kinda has to be read bottom-up instead of top-down to make sense of. The parts that start the processing loops for handling clients and rotating logs are all the way at the bottom and even though it is about a hundred or so lines of code it could definitely be re-organized to be a bit more readable.
There are no tests other than a simple shell script to start the server and a few concurrent clients that write a few thousand log lines. The script forcefully kills the server after a few seconds and I have to manually verify that the correct number of lines where written to the log files and that the log rotations happened as expected. I should have written more tests and let the tests guide the implementation. I suspect the code would have ended up cleaner if I had written the tests first. I might revisit this at some point and try to make the code more testable as another learning exercise.
I gained some more respect for real logging servers like rsyslog and got re-acquainted with all the failure modes of concurrent code that uses locks. I have a tentative plan to revisit it and refactor it to make the logic and failure modes more obvious to me and anyone else that might want to learn from the code.