Semgrep A Practical Introduction

Static Application Security Testing or SAST is a testing methodology that analyses application source code to identify security vulnerabilities (such as, but not limited to, the Injection vulnerabilities, any Insecure Functions, Cryptographic Weaknesses and more). Typically, SAST includes both manual and automated testing techniques which complement each other.

In this blogpost, Rohit Salecha will discuss an open source, multi-language tool called Semgrep . Semgrep is a fork of Sgrep tool, which was originally created at Facebook for performing SAST scans.

Semgrep offers various unique advantages as compared to other open source SAST tools as listed below:

Semgrep allows us to define custom rules for identifying vulnerabilities, thus helping us run a contextual scan on our code. Additionally, Semgrep offers a public registry of such custom rules that can be used.
Semgrep is extremely fast and is the most suitable to be introduced in a DevOps pipeline.
It spools a well-formatted and stable JSON output.
It is extremely lightweight and has an easy to install binary. Can also be run using Docker.
Most importantly, Semgrep supports Python, JavaScript, Java, Go, C and JSON syntaxes!

To further understand Semgrep, we will write Semgrep-rules for Java and identify vulnerabilities in the WebGoat application. So let the hunt begin!

Running Semgrep

Semgrep can be executed through its binary in CLI/Docker or by using its live interface.

CLI

By executing ‘semgrep -f /path/to/semgrep/rules.yml’ in the directory where source code resides, all the rules described in the ‘rules.yml’ file will be executed.

A sample YAML file which identifies object deserialization vulnerabilities in Java is given here- https://github.com/returntocorp/semgrep-rules/blob/develop/java/lang/security/audit/object-deserialization.yaml

YAML Rule File:

rules:
- id: object-deserialization
  metadata:
       cwe: 'CWE-502: Deserialization of Untrusted Data'
       owasp: 'A8: Insecure Deserialization'
       source-rule-url: https://find-sec-bugs.github.io/bugs.htm#OBJECT_DESERIALIZATION
       references:
       - https://www.owasp.org/index.php/Deserialization_of_untrusted_data
  message: |
       Found object deserialization using ObjectInputStream. Deserializing entire
       Java objects is dangerous because malicious actors can create Java object
       streams with unintended consequences. Ensure that the objects being deserialized
       are not user-controlled. If this must be done, consider using HMACs to sign
       the data stream to make sure it is not tampered with, or consider only
       transmitting object fields and populating a new object.
  patterns:
  - pattern: new ObjectInputStream(...);
  severity: WARNING
  languages:
  - java

The ‘id’ in the YAML file acts as a primary key and using this multiple rules can be added in the same file. The most important parts in YAML file are:

patterns – Defining various patterns of code that are to be scanned
languages – The language whose syntax we need to match.

When the above given ruleset file is run against WebGoat, the resultant output will be as shown below. Observe that Semgrep has identified a pattern for the vulnerable class invocation ‘ObjectInputStream’.

Semgrep Live

Semgrep has an online editor where rules on code snippets can be executed, something like regex101.com. You can view the deserialization rule in execution by visiting the link below:

https://semgrep.live/rohitnss:deserialization

Let us now look at how we can create different types of patterns and utilise different constructs of Semgrep to identify some common vulnerabilities.

SQL Injection

Consider the below Code snippet derived from WebGoat application.

https://github.com/WebGoat/WebGoat/blob/ef6993c636799a274e08a0fc2fec3e68ad9d6967/webgoat-lessons/sql-injection/src/main/java/org/owasp/webgoat/sql_injection/introduction/SqlInjectionLesson5a.java#L60

Can you identify a ‘possible’ SQL Injection in here?

On Line number 60, there is a dynamic SQL query being generated using the ‘accountName’ parameter which indicates a ‘possible’ SQL Injection. Now ‘possible’ because we don’t know yet whether the ‘accountName’ is a user-controlled parameter or not.

Check. Can you identify a pattern here?

Something like:

query = statement + statement/variable + statement;

Let me rewrite this:

$SQL = $X + $Y + $Z;

Now let’s use and run this pattern to identify concatenated strings using the below YAML file

https://gist.github.com/rohitnss/e485f483e7f2ceef8c0ac9572f381a62

rules:
- id: sql.injection
  message: |
    SQL Injection.
  metadata:
    owasp: "A1: Injection"
  severity: ERROR
  patterns:
    - pattern: $SQL = $X + $Y + $Z;
  languages:
  - java

Let’s execute the above file on WebGoat using the command below:

semgrep -f ~/semgrep/sql_injection.yml webgoat-lessons/sql-injection/src/main/java/org/owasp/webgoat/sql_injection/introduction

As can be seen from the above output, Semgrep has identified all the possibilities that satisfy our target pattern on lines 58,60,59,133 and 61.

$SQL, $X, $Y, $Z are meta-variables which can take arbitrary names but need to be strictly in capital letters.

In Semgrep we can define Methods, Classes, Objects and Variables that we wish to match as meta-variables.

However, just having a concatenated string does not really help in identifying an issue. If we were to run this over WebGoat entirely, we would have received plenty of such strings.

The next pattern that interested us was to find that –

‘Is there any such concatenated string that is being passed onto a SQL function like executeQuery()’

The resultant YAML pattern for this would be something like below:

https://gist.github.com/rohitnss/a366f071c4e41475672b94df34a1ac2e

rules:
- id: sql.injection
  message: |
    SQL Injection.
  metadata:
    owasp: "A1: Injection"
  severity: ERROR
  patterns:
  - pattern: | #executeQuery
      $RETURN $METHOD(...,String $VAR, ...) {
        ...
        $SQL = $X + $VAR + $Y;
        ...
        $W.executeQuery($SQL, ...);
        ...
      }
  languages:
  - java

Let’s execute the above file on WebGoat using the command below:

semgrep -f ~/semgrep/sql_injection.yml webgoat-lessons/sql-injection/src/main/java/org/owasp/webgoat/sql_injection/introduction

$RETURN, $METHOD are the meta-variables for the function signature and $VAR is the meta-variable for the parameter.
The ellipsis ‘…’ keyword signifies zero or more number of occurrences of a parameter. The ellipsis, when present in the pattern body, signifies any number of statements between the start of the function up to the next pattern.
Next we are searching for a pattern where multiple strings are being concatenated
Lastly, we are checking if the concatenated string $SQL is being passed directly to the Sink function i.e. executeQuery

In the above example, Semgrep would search for only executeQuery() function patterns. However, if you browse the code of WebGoat you’ll realise that there are multiple ways in which SQL statements can be executed like execute(),prepareStatement(),executeUpdate() and so on.

So how can we search for multiple such functions?

We can combine multiple searches using the ‘pattern-either’ directive as shown in the below YAML file which is basically a Logical OR for pattern matching.

https://gist.github.com/rohitnss/6d3965059f75095e59f7719322a0fd7d

rules:
- id: sql.injection
  message: |
    SQL Injection.
  metadata:
    owasp: "A1: Injection"
  severity: ERROR
  patterns:
  - <strong>pattern-either</strong>:
    - pattern: | #executeQuery
        $RETURN $METHOD(...,String $VAR, ...) {
          ...
          $SQL = $X + $VAR + $Y;
          ...
          $W.executeQuery($SQL, ...);
          ...
        }
    - pattern: | #execute
        $RETURN $METHOD(...,String $VAR, ...) {
          ...
          $SQL = $X + $VAR + $Y;
          ...
          $W.execute($SQL, ...);
          ...
        }
    - pattern: | #prepareStatement
        $RETURN $METHOD(...,String $VAR, ...) {
          ...
          $SQL = $X + $VAR + $Y;
          ...
          $W.prepareStatement($SQL, ...);
          ...
        }
  languages:
  - java

Let’s execute the above file on WebGoat using the command below:

semgrep -f ~/semgrep/sql_injection.yml webgoat-lessons/sql-injection/src/main/java/org/owasp/webgoat/sql_injection/introduction

Insecure Cryptography

Let’s say we wish to identify usage of the SHA1 algorithm in our target code. Note that there are various ways in which SHA1 can be instantiated For Ex:

Signature instance = Signature.getInstance("SHA1");
Signature instance = Signature.getInstance("SHA-128");

So how can we identify? One way is by using the OCaml Regex support provided by Semgrep as shown below:

pattern: $<span class="SpellE">METHOD.getInstance</span><b>(...,"=~/<span class="GramE">SHA[</span>-]?\([0-9]+\)/</b>",...);

Another interesting way in which we can use regex is by using the below pattern. What we are telling Semgrep here is to find a regex pattern that matches ‘MD5’ string (and its variations) INSIDE the pattern ‘MessageDigest $MD = $W.getInstance(…);’

https://gist.github.com/rohitnss/1ccd3e8267a4eb80ae8a70fc02f60f59

rules:
- id: md5.crypto
  message: |
    Search for MD5
  severity: ERROR
  patterns:
  - pattern-either:
    - pattern-regex: 'MD5'
    - pattern-regex: 'Md5'
    - pattern-regex: 'md5'
  - pattern-inside: MessageDigest $MD = $W.getInstance(...);
  languages:
  - java

Output of running the above YAML file on WebGoat

semgrep -f ~/semgrep/crypto1.yml

Hard-Coded Credentials/Tokens

How can we identify hard-coded credentials/keys/tokens?

https://gist.github.com/rohitnss/0d086b4aadd8d1e2175d4e06e1d5174e

rules:
- id: crypto
  message: |
    Insecure Cryptographic Functions being Used. Please investigate further
  severity: ERROR
  patterns:
  - pattern-either:
    - pattern: | #Identify any parameter value being equated to Strings using
        $RETURN $METHOD(...,String $VAR, ...) {
          ...
<strong>          if(&lt;... $VAR.equals("...") ...&gt;)</strong>
          ...
        }
    - pattern: String secret = "..." ;
    - pattern: String key = "..." ;
    - pattern: String token = "..." ;
  languages:
  - java

Here we are making use of an ‘deep expressions operator’ (<… $VAR.equals(“…”) …>

This is extremely useful in situations where there are many statements, but we require to focus on only one of them.

Executing the above file we get a few vulnerable code snippets as shown below:

semgrep -f ~/semgrep/crypto.yml

SSRF

To identify SSRF vulnerabilities in Java, we need to search for the URL class instantiation with a parameter. The below pattern can easily find that for us. However, there is another directive that I’ve added and that is ‘pattern-not’. What this statement reads is that find patterns that match a parameter being passed into URL class, however if there is a hard-coded string passed to URL object then things are fine as it is not user-controlled. This helps in ironing out false-positives as well. Just note the position of ‘pattern-not’, it has to be inline with ‘pattern-either’. So if there are any patterns you would like to eliminate then add in ‘pattern-not’.

https://gist.github.com/rohitnss/c1293b78c50325f0cfa1cfe151bbcaa1

rules:
- id: SSRF
  message: |
    Generic SSRF Java
  metadata:
    cwe: "CWE-X"
    owasp: "A1: Injection"
  severity: ERROR
  patterns:
  - pattern-either:
    - pattern: | #execute Directly
        $RETURN $METHOD(...,String $VAR, ...) {
          ...
          URL $URL = new URL($VAR);
          ...
        }
    - pattern: $URL = new URL($VAR);
  - pattern-not: $URL = new URL("...");
  languages:
  - java

Executing the above YAML file on WebGoat gives the below result:

semgrep -f ~/semgrep/ssrf.yml

XXE

One BIG advantage of Semgrep over other scanners is that we can write rules that check for enforcement of security best practices.

For Ex: while instantiating ‘DocumentBuilderFactory’ it is necessary to turn off certain features by explicitly calling the ‘setFeature’ function. So how can we create a pattern where ‘DocumentBuilderFactory’ is being instantiated AND ‘setFeature’ function is NOT being called?

Using Semgrep pattern in a Logical AND format

The rule and the sample code can be executed here https://semgrep.live/rohitnss:xxe

rules:
- id: XXE
  message: |
    Generic XXE
  metadata:
    owasp: "A4: XXE"
  severity: ERROR
  patterns:
  - pattern-either:
    - pattern: DocumentBuilderFactory $DBF = $W.newInstance();
  - pattern-not-inside: |
      $RETURNTYPE $METHOD(...) {
        ...
        $DBF.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
        ...
      }
  languages:
  - java

So what’s really happening here is that we are first asking Semgrep to find any and all instances where the pattern of ‘DocumentBuilderFactory $DBF = $W.newInstance();’ is being defined.

Next, using the ‘pattern-not-inside’ directive we are basically asking Semgrep to identify any such function where ‘$DBF’ is declared (which is a meta-variable for DocumentBuilderFactory) AND setFeature() function is not being called.

As shown in the highlighted image, Semgrep skips the first function where the ‘setFeature’ function is being called and picks up the second function where the ‘setFeature’ function is not being called.

Active Community

Semgrep has a very active community and you can ask your questions/doubts on their official Slack channel http://r2c.dev/slack

There is a dedicated repo for community rules here https://github.com/returntocorp/semgrep-rules where you can contribute your own set of rules to the community. We added semgrep rules to detect SSRF vulnerabilities in Java https://github.com/returntocorp/semgrep-rules/blob/develop/contrib/owasp/java/ssrf.yaml

Conclusion

Semgrep leverages human intelligence to identify vulnerable code and does not rely on mere regular expressions. The power of Semgrep is only limited to your creativity.