Log and Pony Show - Log System Performance & Reliability

brencronin
Mar 21, 2023
4 min read

Updated: Nov 15, 2023

This blog post is going to discuss several aspects of the performance of your cyber security logging system. The logs need to be stored somewhere. The backed log storage impacts both logging system costs and logging system performance.

Logs come in different formats and need to be able to be queried. the example diagram below highlights the difference by showing three separate common log types: Windows, Linux server logs, and network logs.

The first step is the logs need to be parsed into a schema. In database speak this is called normalization. For example, which field in the string is the IP address, which field is the date/time etc? In the backend storage the Date/time data should go into a date/time field, and Ip address field into an IP address field etc. In SIEM systems the additional metadata like fields are added to the original log to try to give the SIEM some additional context to search. The blog post https://www.croninity.com/post/the-log-and-pony-show-understanding-the-security-log-pipeline focuses heavily on log injest and parsing.
Next there is an index function which allows for the data to more easily be searchable in the database. At a basic level database indexing allows searching of database data to be more efficient through B tree type searches. There are other indexing and searches like Bloom, but B trees are sued here to keep the explanation simple. The simplest explanation of B tree searches is a card search analogy in a deck of cards. If you have to search for a random card by flipping each card it will take on average 26 flips. if you can match a suit 1st, it will take an average 7-9. The card to indexing analogy is explained here https://www.essentialsql.com/what-is-a-database-index/

Backend Log Storage

The simplified question related to your log storage is, how many logs are you collecting and how long do you have to store them for? Indexing will only take you so far. If you are doing a lot of queries against a lot of index data it can still take a long time. There are numerous horror stories of queries that take days to complete. For example, an organization will say something basic like, "We have to store logs for six months or year, etc." This oversimplification leads people down paths that are expensive and wasteful. The 1st thing the organization needs to do is Define Operational Logs versus Security Logs.

Defining Operational Logs versus Security Logs

Logs are telemetry data triggered by instrumentation. The telemetry of log data generation needs to be configured (i.e., audit setting set etc). Much of this telemetry isn't really security data, it is really operational data related to the system. I will agree this operational data can be useful for security investigations, but much of it is often not that useful. I find it ironic when organizations turn on some new auditing setting that suddenly produces millions of new logs a day and now the organization has to store those logs for some mandated lengthy time frame (i.e., one year, etc). When this happens, I wonder, "Thats strange that now you have to store all of these random logs for a year plus time? You were running your system for the last five years not even collecting these types of logs. Where was your log retention requirement back then?"

So, the 1st thing to define is what are your operational logs versus your security logs? Put your security logs under your log retention requirements and put your operational logs under your own retention requirements (i.e., two weeks, etc). If you are in a situation where when it comes to logs, "if its created it must be maintained per full retention requirements no matter what the log is" be careful about all the logs you create by all of your audit settings.

It is also a commonly acceptable to practice to change the logging levels depending upon the criticality of the device.

Log Repos

One of the ways you control log storage retentions is through a concept called repos. This can allow for more efficient search. This also has the added benefit of setting different retention periods per repo and controlling who has access to repos. Logs in the logging system can still be searched across repos by moving the "in repo" condition in the query.

The Costs of Storing Data

All of that log data needs to be stored somewhere and there is a cost of that storage. There are several tradeoffs that factor into log storage analysis. Storage is disk and disk cost money.

How much data do you want to store?
- This is where retention requirements, the logs you apply those requirements to, and how much you audit factors in.
How is the compression of the logging system?
How much backup (replication) of the data do you want?
How much data do you want to be store to be searchable?
- Logging systems like Splunk and Elastic make data searchable by indexing the data. Indexing the data increases the size of the data. A trick in some logging deployments to reduce storage size is to remove the indexes from the data after a certain period of time (i.e., 2 weeks, etc). The current two weeks of data is immediately searchable and called "Hot". The data that doesn't have an index is not immediately searchable and is called "Cold". In order to search the cold data it needs to be re-indexed.
How fast do you want to search? (solid state etc)
- Solid state drives are faster and therefore more expensive. Everyone wants fast. Fast query results will also get your users data more quickly which is good. There are certain design decisions around the length of data you are trying to query. If 95% of your users' queries are within a week you might want to put 7 days worth of data on faster more expensive storage, and older data on cheaper storage like Amazon S3. Queries for older time periods will take longer, but it's a tradeoff if they don't occur as much.

Log and Pony Show - Log System Performance & Reliability

Recent Posts

Commentaires