In the post on search for information about malware samples in open sources I briefly mentioned the use of YARA rules and described the basics of using them in context HybridAnalysis. However, this tool is so important and universal in the work of a CTI analyst, incident responder or threat hunter that it is definitely worth devoting a separate piece to it.
Starting from the basics, YARA is a tool that allows you to search and classify files of various types based on the data strings they contain, and using the extended functionality also the file's features. So you can say that it is an extended version of the Grep tool known from Linux systems, which facilitates searching for specific content in files. So let's take a look at how the YARA rule is structured:
So we start with the import command, which allows us to use modules that extend the rules functionality with additional functions. In this case, we use the PE module that adds functionality related to the analysis of executable files (portable executable), but we will see its use only in the section specifying the rule conditions.
We start the actual rule with its name, which we define after "rule" and open the curly brace starting the rule text. The first part is usually the "meta". Although this section is optional, adding it is definitely a good practice. Let's put there the date of the rule creation, our contact details (let's finally boast about what capable detection engineers of us 🙂), a short description of what the rule is for and the file hashes on the basis of which we created it. In this way, we will greatly facilitate the work of those who will use our work to search for threats.
The next two sections will define the proper detection scope of the rule. The first is strings where we specify the data strings we want to search. YARA supports three kinds of data: text, hexadecimal, and regular expressions.
Text data (strings) is the most basic format, but at the same time the one we will use most often. Searching for unusual words, sentences contained in the code or characteristic typos very often allows you to search for related malware samples and track developers' activities. Adding a text value is very simple - we use the $ character to mark a string of data, after the character we put the chosen name that we will use to denote the conditions, and finally after the = sign in quotation marks we give the value. Additionally, we can use modifiers that will tell YARA how to interpret our value. Listing their functionality:
- nocase - case insensitivity
- ascii - ascii-encoded data
- wide - unicode encoded data
- xor - Perform an xor operation on the data with a key that is one byte long
- base64 - apply base64 encoding, we can also use base64wide
- fullword - if we want our data to precede and end a non-alphanumeric character, such as: "."
- private - data will not be displayed in the results
Another type of data are those in the hexadecimal system, i.e. popular hexes. The syntax is going to be a lot here, but we're replacing the quotes with curly braces. Since the hex patterns found in files or memory dumps will not always be neatly arranged, we can use many helpful data modifiers. Let's look at an example value:
{F4 23 (62 B4 | 56) 45 C4 [2-4] B3 [8-] ?? D1 [-]}
In addition to the "normal" hex values, I have included a few special entries, so let's analyze them one by one:
- (62 B4 | 56) - alternative values, so the rule will be fulfilled if either "62 B4" or "56" appears here.
- [2-4] - any data sequence with a length of 2 to 4 bytes can appear here.
- [8-] - similar to the above, but we are looking from 8 to infinity.
- ?? - any hexadecimal value, we can also use for some value like D? or? 4.
- [-] - again any values, but of any length - from zero to infinity.
For hexadecimal values, we can also use the private modifier so that the value is not displayed in the results.
Finally, we can use regular expressions to specify a more complex string of data. This is by far the most powerful tool, but we must also remember that searching for data based on regular expressions will be the most resource-intensive. In order to present the syntax of popular "regexes" one would need a whole separate article, but let's look at the presented example to understand the principle in the context of YARA:
$s3 = / [0-9]? [Az] {3} / i
We put the regular expression between the / signs, here we added the "i" modifier, which means that it will not be case-sensitive. As for the content of the expression itself:
[0-9]? - a digit between 0 and 9, which may or may not occur - what is defined by the "?"
[az] {3} - 3 letters between "a" and "z".
For example, this expression will be satisfied with the string "4fSa". Note the different case that may occur due to the "i" modifier added in $s3.
Once we have defined the data of interest to us, let's deal with the conditions that will cause a specific file to be considered compliant with the rule. Let's go back to the third section, "conditions":
Starting with the first condition, we use here the PE module imported in the first line of the file. From this module, we use the function that allows us to determine the number of sections in the executable file and we determine that we want to search for those where this number will be equal to three.
Then we use the YARA option to read data in specific offsets and by marking zero offset we look for the 5A (Z) and 4D (M) characters at the beginning of the files. For many analysts, these letters will certainly be familiar, but let us remind you that the MZ at the beginning of the file is the designation of the executable files. Let's also pay attention to the order of the letters - we finally gave Z first and then M. This is due to the fact that we will normally use little endian here, i.e. from the least to the most important bytes. We could convert to big endian by adding "be" to uint16.
Another condition is the size of the file. We specify with the filesize variable that we are looking for files larger than 500 kb.
Finally, we are concerned with determining how to treat the strings that we entered in the strings section. There are many options here, we could list individual of them by pointing to the names we have given or, as in this case, by indicating that any of them will suffice. The opposite would be to use the phrase all of them. If the naming convention we have adopted allows it, we can also use "*" to refer to a group of values. In our case, we could use "$s *" which would mean any of our assigned values.
Now let's see how we can use YARA to scan files in practice. There are many sources of ready-made rules - from companies in the security sector, through independent researchers to government organizations. The latter group includes the German Federal Office for the Protection of the Constitution. And so in January this year it informed about the activity of the Chinese-linked APT27 group against German companies. Along with the warning, they have just been published YARA rules enabling checking if the implant files are in our environment. So we copy the rules and save for example as APT27.yar.
In real conditions, we would scan file systems in our environment, but for our example, we will simply download a sample from the example, available on MalwareBazaar here. Then we use the YARA tool, specify the rule file and the area to be scanned:
As we can see, the DLL file got caught by the "vftrace_loader" rule placed in our .yar file.
So here are the basics of writing and using YARA rules to classify and find malware samples. Of course, the possibilities and the number of functions available in this tool are much greater, I encourage you to read the documentation as well as the rules created by others and available on the Internet. A separate topic is also file analysis and finding fragments that can be included in the rules to detect malware from a given family, despite the differences between the samples. We will certainly come back to the topic in future posts on counterintelligence.pl 🙂