Zivaro Blog

Fake It to Make It: Tips and Tricks for Generating Sample Splunk Data Sets

As you continue to work with Splunk and the number of underlying use cases within your organization grows, you will ultimately encounter a situation where you need to generate some “fake” data. Perhaps you need to create a visualization to use for a proof of concept; perhaps you are trying to master a specific search […]

As you continue to work with Splunk and the number of underlying use cases within your organization grows, you will ultimately encounter a situation where you need to generate some “fake” data. Perhaps you need to create a visualization to use for a proof of concept; perhaps you are trying to master a specific search or visualization; or perhaps you quickly need a few pieces of data for demonstrating a feature to a colleague.

As a Splunk Solution Architect and Consulting Engineer at GTRI, I often make use of synthesized data for all of these reasons and many more. While there are many methods for obtaining sample data for your Splunk needs, in this article I will focus on two methods for creating sample Splunk data sets that do not require any indexing.

Generating Time-series Data for Sample Visualizations

If you’ve worked with Splunk for very long, you quickly realize that users can be VERY particular about the format and appearance of visualizations. The associated search for this example enabled me to quickly generate a few days of hourly data points that I could use to iteratively tweak the colors and chart format for the customer to review.

This search uses a combination of the gentimes, eval, and chart commands to produce a visual output that can be added to a dashboard prototype.

| gentimes start=07/23/2016 increment=1h | eval myValue=random()%500| eval myOtherValue=random()%300 | eval starttime=strftime(starttime, "%m-%d-%Y  %H:%M:%S") | chart max(myValue) AS myValue max(myOtherValue) as myOtherValue over starttime

Let’s break down this search:

The gentimes command on its own creates a series of timestamps beginning with the date specified in the start argument. In this example, I’ve added the increment argument to further specify the interval for each timestamp (“1h” or hourly in this case). The net effect is to create 1-hour timestamps up until the current date/time.

The search exports the output of the gentimes command (hourly timestamps) into a series of two eval commands that are simply creating two fictitious fields and values to associate with each timestamp that I generated. For these first two eval commands, I used the random function with the %<integer> argument to return a random number between 0 and the <integer> I specified.

The chart command simply outputs my fictitious data into a tabular format that can be used to render visualizations via Splunk’s easy-to-use visualization tools.

Executing the search above lets you quickly generate charts like the one in the screenshot below that can be used for tasks such as modifying simple XML to specify color settings.

data-chart

Various forms of this command can be used to create visualizations that mimic a data source that a customer uses (or plans on using) but cannot provide. This search can easily be modified to create any number of fields by adding additional eval statements. Generating a large number of discreet events can be achieved quickly by playing with the start and increment arguments to the gentimes command. If you have longer term need of the data, you could even write it to an index/summary index.

Creating Tabular Data

In some instances, generating a small set of tabular data may prove useful. Often times I work with customers who want to render Splunk search results in a table with no drilldown. With this quick and simple search, I can generate a small number of results in a tabular format. The search is particularly useful because it creates results with a wide variety of data types: timestamps, counts, string data, numerical data, and both single and multi-value fields.

|noop| makeresults | eval field1 = "abc def ghi jkl mno pqr stu vwx yz" | makemv field1 | mvexpand field1

| eval multiValueField = "cat dog bird" | makemv multiValueField

| streamstats count | eval field2 = random()%100

| eval _time = now() + random()%100 | table _time count field1 field2 multiValueField

At first pass, there appears to be a lot going on here. In reality, it isn’t too complicated.

The noop command is listed as a Splunk debugging command. In practice I have only ever used it for generating sample data in scenarios such as this one. In distributed environments, it prevents the search from being sent to the various indexers. The command is used here for the purposes of speed as it basically tells Splunk to complete no operations (i.e., noop) and count the result.

The makeresults command is required here because the subsequent eval command is expecting (and requires) a result set on which to operate or it will raise an error. It creates the specified number of results (or in this case the default number of results which is 1) and passes them to the next pipe in the search.

The eval field1 command is creating a text field with the value “abc def ghi … … …”

The makemv command converts field1 from a single value field to a multivalue field by breaking up the values using the default whitespace character as a delimiter

The magic happens with the mvexpand command. It takes the values of a multivalue field (created with the preceding makemv command) and creates an individual event for each value. Here, this results in the creation of nine separate events.

The eval multiValueField = “cat dog bird” | makemv multiValueField commands simply create an additional field and populate it with multiple values.

The streamstats count command is calculating a statistic (in this example, the count of total events) once for each event that is returned in the search. As you can see above with the mvexpand command, the text string is being expanded into nine total events. Thus, for each event, the streamstats count command adds a field to each event that represents the total number of events returned thus far.

Using eval field2 creates a fictitious numerical field whose value will be a number between 0 and 100. This is the same technique used in the previous search above.

The eval _time = now() + random()%100 creates pseudo-random timestamps for each of the nine events.

The final table command simply specifies the fields and their order for display.

The net result is the table below. You could also use the chart command to render it as a pie chart or other visualization:

tabular-data-chart

But What If I Need to…

The two techniques discussed in this article are versatile, quick methods that can be used to generate usable samples of data for various purposes. Unfortunately, they won’t cover every conceivable solution. There are certainly times where sample data sets of a specific source and format are the only way to fulfill a request. If you have questions about Splunk data sets, feel free to connect with me on LinkedIn.

Scott DeMoss is a Solution Architect for Data Center and Big Data in Professional Services at GTRI.

3900 E Mexico Avenue, Suite 1000,
Denver, CO 80210