Fix the log parsing issue and analyse the logs in Splunk


Computer tech cartoon

I am writing these posts a little less frequently now due to resuming my MSc after the winter break. I have an Artificial Intelligence and Machine Learning exam in March which also marks the end of the taught, technical part of my degree. From then on I’ll move on to my own research project. First a taught module to create a project proposal and then into the research project itself. Exciting times!

Introduction

We are still John.

This is the challenge room to complete the Advanced Splunk module. We have to fix a bunch of problems with the Splunk configuration.

As there is only one ‘task’ in this room. I will separate each question out to its own heading.

We are told there are three levels to this room:

  • Level 1 - We will fix the event boundaries.
  • Level 2 - Extract new fields.
  • Level 3 - Analyse the data.

Once the VM has started and Splunk has had a couple of minutes to start up. You can access it from the split screen or your own machine via VPN at:

http://MACHINE_IP:8000

Like the last rooms, if you want to access the machine via ssh you can reset the password for the ubuntu user in split screen view terminal and then ssh in using those credentials.

To reset the credentials, open a terminal:

sudo su
passwd ubuntu
  • Set a new password.

In your own terminal connected to the THM VPN:

ssh ubuntu@MACHINE_IP

type the new password you set when ready and elevate to root

sudo su

Let’s go!


Q1: What is the full path of the FIXIT app directory?

Hint: We learned where apps are installed by default in the previous room. This will also be the same location in which we created our DataApp in the previous room.

Spoiler warning: Answer /opt/splunk/etc/apps/fixit

Q2: What Stanza will we use to define Event Boundary in this multi-line Event case?

Hint: If we take a look at how the events are displayed we can see individual events are being split into two. So we are aiming for an action that only splits in certain cases.

Spoiler warning: Answer BREAK_ONLY_BEFORE

Q3: In the inputs.conf, what is the full path of the network-logs script?

Let’s look at the input.conf file. It’s in the default directory for the fixit app.

cat /opt/splunk/etc/apps/fixit/default/inputs.conf
Spoiler warning: Answer /opt/splunk/etc/apps/fixit/bin/network-logs

Q4: What regex pattern will help us define the Event’s start?

Hint: I used regex101 to come up with the answer. We need to split before each ‘[network_log]’. Note: Square brackets [] denote a range of values in regex, so we need to escape the square brackets with backslashes.

Spoiler warning: Answer \[Network-log\]

Q5: What is the captured domain?

Hint: We don’t actually need to implement the props.conf file yet, but one currently doesn’t exist and it would be easier to read the logs with it. Let’s implement it

cd /opt/splunk/etc/apps/fixit/default/  
echo -e "[network_logs]
SHOULD_LINEMERGE = true
BREAK_ONLY_BEFORE = \[Network-log\]" > props.conf

Restart Splunk and our events should be correctly separated.

/opt/splunk/bin/splunk restart

We can see a domain appearing again and again.

Spoiler warning: Answer   Cybertees.THM

Q6: How many countries are captured in the logs?

To answer this question, we’re going to need to extract a ‘country’ field from the log. Let’s get to work making a regex that separates each field logically.

I’m going to setup Splunk to answer the next 4 questions which all follow the pattern ‘How many x are captured in the logs?’.

Currently we have the long string that includes a lot of information. One example from the log file is:

User named Patricia Allen from Custom department accessed the resource Cybertees.THM/signup.html from the source IP 192.168.1.3 and country Australia at: Wed Nov 22 05:01:31 2023

From this we can reason that some useful fields to extract would be:

  • Username
  • Department
  • Resource Accessed
  • Source IP
  • Country
  • Date
  • Time
  • Year

Note: In the task materials only 5 fields are suggested and I couldn’t get my regex to work for all the fields I just mentioned. I will leave it here because I’d love feedback if someone knows why it didn’t work. The first 5 fields worked fine and then I was getting $5, $6 output in Splunk for the latter fields.

I put the whole thing into regex101 and got to work designing a regex to separate out each value. I started with the Username mostly borrowed from the exercise in the previous room.

Here’s the messy snake that I came up with:

User\snamed\s([\w\s]+)\sfrom\s([\w\s]+)\sdepartment\saccessed\sthe\sresource\s([^\s]+)\sfrom\sthe\ssource\sIP\s([\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}]+)\sand\scountry\s\n([\w\s]+)\sat:\s(.{3}\s.{3}\s\d{2})\s(\d{2}:\d{2}:\d{2})\s(\d{4})

I’m sure there’s a much neater way to do that, please let me know your solution! I realise the IP address could be obtained with just ([^/s]+) but this way seemed more explicit. What do you think? Is it better to be concise or explicit when dealing with regular expressions? I always prefer the extra clarity of longer expressions when writing code but regular expressions often look obtuse anyway (to me). I’m torn on which is better and I’d love to know the best way.

Now we have the regex let’s extract those fields. I counted the colour coded fields in regex101 to do a sanity check. I extracted 8 fields in total. Should Date and Year be separate? I’m not sure exactly how to combine them. I think they should probably be joined because if one is searching for events on say, May 22nd 2022 we could just search one field. Another user might mistakenly believe that values in the Date field are already distinguished by year and end up with values from multiple years in their search. I don’t think we need to do it for these questions but I’ve noted it for further investigation. I imagine the solution will involve using sed to adjust the string to have date and year combined and then adjusting the regular expression to capture it as one field.

While the regex I created seemed to work all values from $5 (Country) on were not processed correctly. I’m not sure what was going on - probably an error in my regex or a difference in how Splunk interprets the regex. Please let me know if you can clear that up for me. I decided to focus on the fields suggested in the task ie:

  • Username
  • Country
  • Source_IP
  • Department
  • Domain (Resource)

Note: I think the error was due to not adding /g global tag at the end?

So here’s my new regex:

User named ([\w\s]+)\sfrom\s([\w\s]+)\sdepartment[\w\s]+resource\s([^\s]+)\sfrom[\w\s]+IP\s(\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})\sand country\s([\w\s]+)\sat.

A little shorter than the last one. We can write our transforms.conf file like so:

echo -e "[networks_custom_fields]
REGEX = User named ([\w\s]+)\sfrom\s([\w\s]+)\sdepartment[\w\s]+resource\s([^\s]+)\sfrom[\w\s]+IP\s(\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})\sand country\s([\w\s]+)\sat.*
FORMAT = Username::\$1 Department::\$2 Domain::\$3 Source_IP::\$4 Country::\$5
WRITE_META = true" > transforms.conf

Next, add the link in the props.conf

echo "TRANSFORM-networks = networks_custom_fields" >> props.conf

Now let’s make our fields.conf

echo -e "[Username]
INDEXED=true
[Department]
INDEXED=true
[Domain]
INDEXED=true
[Source_IP]
INDEXED=true
[Country]
INDEXED=true" > fields.conf

And we’re ready to go!

/opt/splunk/bin/splunk restart

Hint: index=main sourcetype=”network_logs” |  stats dc (Country) Or just look left - the value should appear next to the field name in grey on the left hand side.

Spoiler warning: Answer 12

Q7: How many departments are captured in the logs

Hint: Again look left.

Spoiler warning: Answer 6

Q8: How many usernames are captured in the logs?

Hint: Left hand side or use distinct-count dc(Username) as before.

Spoiler warning: Answer 28

Q9: How many source IPs are captured in the logs?

Hint: Look at the left side.

Spoiler warning: Answer 52

Q10: Which configuration files were used to fix our problem? [Alphabetic order: File1, file2, file3]

Hint: We wrote to three files to perform the previous four questions. If you know your ABC then you will know that f comes before t which comes before p.

Spoiler warning: Answer fields.conf, transforms.conf, props.conf

Q11: What are the TOP two countries the user Robert tried to access the domain from?

[Answer in comma-separated and in Alphabetic Order][Format: Country1, Country2]?

Hint: index=main sourcetype=”network_logs” Username=”Robert Wilson” \ top limit=0 Country | fields Country count. The top 2 values are the answer

Spoiler warning: Answer Canada, United States

Q12: Which user accessed the secret-document.pdf on the website?

Let’s use wildcards to search for a Domain containing the secret-document.pdf string.

index=main sourcetype=”network_logs” Domain=secret-document.pdf

Spoiler warning: Answer Sarah Hall

Conclusion

It was really fun and educational to come up with the regular expression for this task. It’s elevated my regex skills and I feel much more fluent in the regex lingo. Of course, it has also taught me a lot about Splunk and how it can be configured to extract meaningful data.

That’s the end of the Advanced Splunk module we are now experts…almost. Next up Elk!