Learn How to Parse and Manipulate Data in Splunk


Computer tech cartoon

Introduction

Here we’re going to walk through the TryHackMe room Splunk: Data Manipulation. Part of the Advanced Splunk module on the SOC Level 2 learning path.

We’re expected to already have knowledge of regex and Splunk basics.


Task 2 - Scenario and Lab Instructions

We are John (not Bob).

Here we can start the machine. Unfortunately we’re not given login credentials. looking at the /etc/shadow file it looks like the ubuntu user is not able to authenticate with a password. The ! preceding the passwords SHA512 hash denotes that this user cannot authenticate by password.  


Q1: Connect to the Lab.

If you do want to connect via ssh then you can follow the method I used here. Let me know if you have another way in! I reset the ubuntu user password and sshed in with the new password.

From the machine’s terminal in split screen browser:


sudo passwd ubuntu # Enter a new password when prompted

We can now ssh in from our machine over the VPN using:


ssh ubuntu@VM_IP_ADDRESS # enter our new password when prompted
sudo su # Elevate to root when ssh is connected

No answer needed


Q1: How many Python scripts are present in the ~/Downloads/scripts directory?

cd Downloads/scripts
ls

Hint: count ‘em!

    Spoiler warning: Answer     3

Just a quick note here. We’ll need these scripts later on so let’s copy them to the splunk scripts directory now.

cp -a /home/ubuntu/scripts/ /opt/splunk/etc/apps/DataApp/bin/ # -a preserves file attributes

Task 3 - Splunk Data Processing: Overview

Some reading to do. Don’t skip it!


Q1: Understand the Data Processing steps and move to the next task.

No answer needed


Task 4 - Exploring Splunk Configuration Files

More reading about config files and stanzas. Stanzas are how individual data points are expressed, processed and indexed within Splunk.

Q1: Which stanza is used in the configuration files to break the events after the provided pattern?

Hint: ‘Break’ and ‘after’ give us a good hint here.

Spoiler warning: Answer BREAK_ONLY_AFTER

Q2: Which stanza is used to specify the pattern for the line break within events?

Hint: It’s another breaker and the task info spells it out.

Spoiler warning: Answer LINE_BREAKER

Q3: Which configuration file is used to define transformations and enrichments on indexed fields?

Hint: the keyword here is transformations.

Spoiler warning: Answer transforms.conf

Q4: Which configuration file is used to define inputs and ways to collect data from different sources?

Hint: keyword: inputs

Spoiler warning: Answer inputs.conf

Task 5 - Creating a Simple Splunk App

We are told that installed splunk instances are located here:

/opt/splunk/etc/apps

Q1: If you create an App on Splunk named THM, what will be its full path on this machine?

Hint: It’ll be in the apps directory.

Spoiler warning: Answer /opt/splunk/etc/apps/THM

Task 6 - Event Boundaries - Understanding The Problem

Do work through the examples, creating the example DataApp - you’ll need it later on.

Here’s a summary of the steps from creation to event boundaries.


/opt/splunk/bin/splunk start # Start Splunk from a root shell

  • In the Splunk interface home click on the cog wheel near Apps as directed  
  • Name the app DataApp (to follow the example) and give some random values for name  
  • After saving your new app click Launch in the actions column  
  • From your root shell prompt navigate to and open the inputs.conf file:  
cd /opt/splunk/etc/apps/DataApp/default
vim inputs.conf # (or whatever editor you prefer)
  • Copy the 4 lines of code that will check the vpnlogs
  • :wq (save and quit)
mv /home/ubuntu/Downloads/scripts/vpnlogs ../bin/ # copy the scrip to the apps bin directory

/opt/splunk/bin/splunk restart # restart Splunk # Give it a couple of minutes before trying to access in the browser
  • search: index=main sourcetype=vpnlogs and set Time setting - All time (real-time)

Okay we have logs coming in but there seem to be multiple data points in each log. We need to establish event boundaries. We do this from a props.conf file.

vim props.conf (or your preferred editor)
  • Copy the 3 lines of code given in the task instructions
/opt/splunk/bin/splunk restart # restart Splunk again

You should see the logs comming in nicely formatted now.


Q1: Which configuration file is used to specify parsing rules?

Hint: Back in Task 3 we learned about the configuration file that defines data parsing settings for specific sourcetypes or data sources.

Spoiler warning: Answer props.conf

Q2: What regex is used in the above case to break the Events?

Hint: In the task materials a site is used to generate a regex that finds the event boundary - either CONNECTED or DISCONNECTED

Spoiler warning: Answer (CONNECTED|DISCONNECTED)

Q3: Which stanza is used in the configuration to force Splunk to break the event after the specified pattern?

Hint: We can see from the screenshots that multiple events are being treated as a single event. We need to get in there and break each event after it’s ‘Action’ value.

Spoiler warning: Answer MUST_BREAK_AFTER

Q4: If we want to disable line merging, what will be the value of the stanza SHOULD_LINEMERGE?

Hint: We can see in the props.conf file that the current value of SHOULD_LINEMERGE is true. The opposite of true is…

Spoiler warning: Answer false

Task 7 - Parsing Multi-line Events

Let’s take a look at some longer events and how the event boundaries might be established and defined.


Q1: Which stanza is used to break the event boundary before a pattern is specified in the above case?

Hint: Look at the section that explains how to define the event boundary and shows the entry for the props.conf file. The last line is our man.

Spoiler warning: Answer BREAK_ONLY_BEFORE

Q2: Which regex pattern is used to identify the event boundaries in the above case?

Hint: Look at the props.conf - what is the value assigned to the stanza. Note that we need to use a backslash ‘' to escapethe square bracket. Simply, we need to tell regex that these square brackets are not stating a range of values which is the syntax for regex but that we actually mean a litteral square bracket.

Spoiler warning: Answer \[Authentication\]

Task 8 - Masking Sensitive Data

Everyone has some sensitive data - whether it’s financial information, an important private number like a US social security number or even just an email address or telephone number you don’t want to publish for everyone to see.

Handling sensitive data is also often a matter of standards compliance and sometimes legal compliance.


Q1: Which stanza is used to break the event after the specified regex pattern?

Hint: Take a look at the props.conf file

Spoiler warning: Answer MUST_BREAK_AFTER

Q2: What is the pattern of using SEDCMD in the props.conf to mask or replace the sensitive fields?

Hint: Again look in the props.conf file

Spoiler warning: Answer s/oldValue/newValue/g

Task 9 - Extracting Custom Fields

There seems to be a typo where we are instructed to create a fields.conf file. However, at one point this is referred to as ‘Fields.conf’. I think the lowercase version should be used if only to keep consistency with other conf file naming used in these examples.

To answer the questions in this task we’re going to need to apply what we learned to extract fields from the purchase logs.

This task took me quite a while to get through because I kept getting pulled away to do other things just when I got the machine set up. I’ll assume that anyone following this is also starting this task from a newly spawned VM. So, from the newly started bare machine.

In the terminal:

sudo su
/opt/splunk/bin/splunk start

In the browser:  

  • Navigate to http://MACHINE_IP:8000  
  • Create the DataApp  
  • Launch the DataApp  

Back in the terminal as root:

cp -a /home/ubuntu/Downloads/scripts/. /opt/splunk/etc/apps/DataApp/bin/
cd /opt/splunk/etc/apps/DataApp/default

Write the inputs.conf file:

echo -e "[script:///opt/splunk/etc/apps/DataApp/bin/purchase-details]
interval = 5
index = main
source = purchase_logs
sourcetype= purchase_logs
host = order_server" > inputs.conf

Write the props.conf file:

echo -e "[purchase_logs]
SHOWLD_LINEMERGE=true
MUST_BREAK_AFTER = \d{4}\.
TRANSFORM-purchase = purchase_custom_fields" > props.conf

Write the transforms.conf file:

echo -e "[purchase_custom_fields]
REGEX = User\s([\w\s]+).(made a purchase with credit card).(\d{4}-\d{4}-\d{4}-\d{4}).
FORMAT = Username::$1 Message::$2 CardNumber::$3
WRITE_META = true" > transforms.conf

Write the fields.conf file:

echo -e "[Username]
INDEXED = true
[Message]
INDEXED = true
[CardNumber]
INDEXED = true" > fields.conf

Restart splunk:

/opt/splunk/bin/splunk restart

As you can see I only added the purchase_logs sourcetype for this task in the inputs.conf file.  In the props.conf file I defined the section to look for in the transforms.conf file. In the transforms.conf file we define the regex to separate the fields. I realise specifically matching the message string is probably not going to be reliable in a real situation if there are multiple payment methods, but it works for our tasks here. In the transforms.conf file we also define the format of the data specifying how the strings are ordered. Finally the fields.conf file we set every field defined in transforms.conf to be indexed. Finally we restart Splunk.


Q1: Create a regex pattern to extract all three fields from the VPN logs.

No answer needed - But, anyway, the regex is defined earlier in the task materials.


Q2 Extract the Username field from the sourcetype purchase_logs we worked on earlier. How many Users were returned in the Username field after extraction?

Hint: If you have followed the steps above you should be good to go.

  • In Splunk, search:  index-main sourcetype=”purchase_logs”  
  • The number of Usernames appears in the interesting fields section on the left side.  
Spoiler warning: Answer 14

Q3: Extract Credit-Card values from the sourcetype purchase_logs, how many unique credit card numbers are returned against the Credit-Card field?

Hint: Again, the number of unique values appears in the interesting fields section on the left side.

Spoiler warning: Answer 16

Recap and Conclusion

I kept getting pulled away in this room so it took a while to get through it. But it was a great learning tool and I feel much more comfortable configuring Splunk to ingest new and different data sources, creating detailed events with meaningful event boundaries and using regular expressions to extract interesting fields. Splunk also has a built-in ‘Extract New Fields’ function which could be worth exploring for those who dislike creating their own regular expressions. I quite enjoy the challenge of learning regex. With the help of regex101, I find it fun and satisfying.

Next up in the series - Fixit