Splunk: Data Manipulation

Saniye Nur
9 min readOct 22, 2023

--

Data manipulation in products like Splunk refers to various operations performed to make data more meaningful and useful. These products are data analytics and security information management platforms that can collect, index, and analyze large volumes of data from different sources. Here are some data manipulation operations performed in products like Splunk:

  1. Data Filtering: Filtering out unnecessary or unwanted data from data sources can make the analysis process more effective. For example, you can select specific user transaction logs or data within a certain date range.
  2. Data Transformation: Data can be processed to represent it in different formats or convert it. For instance, you can convert text data into numerical data or change the format of date and time information.
  3. Data Integration: You can combine data from different sources by performing data linking. For example, you can merge a user’s web interaction data with their mobile app interaction data.
  4. Enrichment and Creating Enriched Data: Data can be augmented or enriched to add more meaning to existing data. For instance, you can map IP addresses to geographic locations or add descriptions to product names.
  5. Aggregation: Data can be grouped and summarized to obtain insights from large datasets. For example, you can calculate the number of transactions or total revenue for a specific date range.
  6. Data Cleansing: Data cleaning operations can be performed to correct data errors or fill in missing information. For example, you can fill in missing data or rectify data input errors.
  7. Custom Queries and Expressions: You can process data using custom queries or expressions in databases or log files. This is a powerful method for extracting or processing specific data segments.

Products like Splunk provide user-friendly tools and query languages for performing these data manipulation operations. These operations are crucial for data analysis, reporting, security monitoring, and many other applications. By making data more meaningful and usable, organizations can be better equipped to support decision-making and quickly identify issues.

Splunk uses a customizable query language known as SPL (Splunk Processing Language) to perform these and similar data manipulation operations. SPL is a powerful tool within Splunk for data processing, analysis, and querying. Data manipulation is an essential part of the data analysis process in Splunk, allowing users to better understand their data, identify issues, and make informed decisions.

Now we are going to practice Splunk Data Manipulation

Topics:

  1. Understanding the Parsing of Events in Splunk
  2. Significance of Configuration Files: inputs.conf, transform.conf, and props.conf
  3. Extracting Custom Fields and Their Application for Filtering
  4. Uncovering Timestamps in Event Logs
  5. What is the Stanza?
  6. Responding to a security incident by fixing Event Boundaries

You are a SOC Analyst.
Splunk needs to be properly configured to parse and transform the logs appropriately. Some of the issues being highlighted are:

1-Event Breaking:

As a SOC Analyst, I’d like to clarify that “event breaking” refers to the process of breaking down a stream of raw log data into individual events. In the context of configuring Splunk, it’s crucial to define how log entries are separated and organized into discrete events. This separation allows Splunk to interpret and analyze each event separately, making it easier to extract meaningful information and apply transformations as needed. Proper event breaking is essential for accurate parsing and analysis of log data within Splunk.

2-Multi-line Events:

In the context of log and event data, “multi-line events” refer to log entries that span multiple lines rather than being contained within a single line. In many log files, especially those generated by applications or systems, a single log entry may be broken into multiple lines for readability or due to the formatting used.

Handling multi-line events can be a challenge in log analysis and management. In Splunk, for example, you need to properly configure log ingestion to ensure that these multi-line events are correctly recognized, treated as a single event, and parsed for analysis. Incorrect handling of multi-line events can lead to fragmented and inaccurate log data analysis.

3-Masking

“Masking” refers to the process of concealing, hiding, or obfuscating sensitive or confidential information within data, typically to protect privacy or security. This is often done by replacing sensitive data with a placeholder or a character, such as asterisks (*) or Xs, so that the original information is not exposed. The purpose of masking is to prevent unauthorized individuals or systems from accessing or viewing sensitive data.

Common use cases for data masking include protecting personal identifiable information (PII), financial data, passwords, and other confidential information. Data masking helps organizations comply with data privacy regulations and secure sensitive data during testing, development, or data sharing processes.

4-Extracting custom fields:

“Extracting custom fields” refers to the process of identifying, capturing, and isolating specific pieces of information or attributes from raw data, typically in the context of log or event data. This is done to make it easier to search, analyze, and visualize the data based on these custom attributes.

In the context of tools like Splunk or log management systems, extracting custom fields involves defining rules or patterns to recognize and extract particular data elements from log entries. These data elements could be anything of interest, such as IP addresses, usernames, error codes, transaction IDs, or any other relevant information.

Custom field extraction is essential for making sense of large volumes of log data by breaking it down into structured and manageable components. It allows for more efficient querying, reporting, and analysis, making it easier to monitor and troubleshoot systems and applications, as well as gain insights from the data.

Splunk stands as a robust data analytics tool employed to search, monitor, and scrutinize extensive volumes of data generated by machines. The process of data parsing within Splunk encompasses the extraction of pertinent data fields and the conversion of data into an organized structure for streamlined analysis. Below, with support from Tryhackme you’ll find a comprehensive walkthrough on how data undergoes parsing in Splunk, with insights into the role of props.conf

Let’s start by understanding data formats.

Begin by gaining a clear comprehension of the data format you intend to parse. Splunk offers compatibility with a multitude of data formats, including CSV, JSON, XML, syslog, and numerous others. Ascertain the specific data source format and pinpoint the essential fields you aim to extract.

Second, define the resource type. Sourcetype, within the realm of Splunk, serves as the defining template for the data being indexed, ensuring the application of pertinent parsing rules. In instances where your data source lacks a pre-existing sourcetype, Splunk empowers you to craft a custom one to suit your needs.

Why the props.conf?

The props.conf file is a structure that defines the properties (props) and settings that Splunk will use to analyze data. This file is used to apply certain rules or functions during the processing of data. For example, this file can be used to sort data by timestamp, field parsing, creating custom fields, and more.

The props.conf file allows you to give Splunk direction to better understand the data and analyze it more effectively. You can make adjustments through this file based on data types, sources, or specific requirements.

The props.conf file increases Splunk’s flexibility and customization capabilities and helps you optimize your data analysis processes. When editing this file, you can define specific rules for how Splunk should process data.

It resides in the $SPLUNK_HOME/etc/system/local directory.

[source::/path/to/your/data] 
sourcetype = your_sourcetype

In this example, /path/to/your/dataveri is the path to your source and your_sourcetype is the name of the source type you want to assign to the data.

For Field Extractions:

[your_sourcetype] 
EXTRACT-fieldname1 = regular_expression1
EXTRACT-fieldname2 = regular_expression2

Save and Restart Splunk.

Splunk Configuration Files:

Splunk has a lot of configuration files.alert_actions.conf , app.conf , audit.conf, authentication.conf ,authorize.conf ,bookmarks.conf checklist.conf etc.

List of configuration files => https://docs.splunk.com/Documentation/Splunk/9.1.1/Admin/Listofconfigurationfiles

  • Example: Suppose you want to forward your indexed data to a remote Splunk indexer. You can configure outputs.conf:
[tcpout] defaultGroup = my_indexers 
[tcpout:my_indexers]
server = remote_indexer:9997
  • Example: Suppose you want to enable LDAP authentication for Splunk users. You can configure authentication.conf:
[authentication] 
authSettings = LDAP
[authenticationLDAP]
SSLEnabled = true

What is the Stanza?

“Stanza” is a term commonly used, particularly in software like Splunk, to refer to sections in configuration files that define specific settings or properties. A stanza is a structure where customized settings and filters are grouped for a particular component, data source, or process.

In Splunk, stanzas are typically used within two main configuration files, “inputs.conf” and “props.conf.” Here are examples of a stanza in each:

Example 1 — A stanza within “inputs.conf”:

[monitor:///var/log/application.log]
disabled = false
index = my_index
sourcetype = my_app_logs

This stanza is configured to monitor the /var/log/application.log file, and settings like "index" and "sourcetype" are specified within this stanza.

Example 2 — A stanza within “props.conf”:

[my_sourcetype]
TRANSFORMS-sethost = set_host

This stanza applies a transformation process for a specific “sourcetype” called “my_sourcetype.” The transformation process is defined by another stanza, “set_host,” which is used to apply transformations to data.

Stanzas make Splunk configuration modular and readable. Each stanza groups together settings for a specific data source or process, allowing for customization and configuration of Splunk to meet specific requirements.

Right now,

I have such a network raw log. I will try to transfer it to Splunk correctly or fix the corrupted network log.

Splunk cannot determine the Event boundaries, as the events are coming from an unknown device.Therefore fix event boundaries.

Now I logged into my local splunk and as you can see in the example, the events are broken.

As a result of my research, I found that we will use BREAK_ONLY_BEFORE for Event Boundary fix.

Once the event boundaries are defined, it is time to extract the custom fields to make the events searchable.

  • Source_IP
  • Destination_IP
  • Country
  • Username
  • Domain
  • Timestamp
  • Port Information
  • Transaction Time
  • Error or Warning Messages

I found the directory with the inputs.conf file to edit the network log(Look at the in /bin and /default):

Let’s look inside:

This one seems to be fine, it seems to fit the format.
Now let’s look at props.conf. There is no props.conf.

Save,and for transforms.cong

Yes saved.

And

Let’s say Splunk reastart and get splunk up and running again.

Now we can come back to the splunk dashboard and see our fields that are missing or corrupted.

To summarize,

I need to think of all kinds of scenarios in case of an incident, so in this case I need to correct the data I want and find what I am looking for.

Examining the Data:
As a first step, I need to examine the bad data. In Splunk, I need to identify sample events and data to understand what bad data looks like. Because understanding how corrupted data looks like will make the remediation process easier.

Going to the Data Source:
My next step is to try to identify the source of the corrupted data. By going to the data source, I can make corrections there or try to get more accurate data.

Then I can correct and process my corrupted data with SPL.
SPL is a useful tool for data manipulation and transformation.

Before data correction, it is important to make a backup of the original corrupted data. This way, you can compare it with the corrected data and help in undoing errors.

After applying corrections, I should run tests to make sure that the data has been corrected correctly and verify data accuracy by checking certain conditions or examining sample data.

Reference

--

--