CodeQL & Golang regexp

To Write Query for Go Project using CodeQL, mastering Regular Expression skills is very important.

CodeQL CWE-020 Regex Example

IncompleteHostnameRegexp.go:

package main

import (
	"errors"
	"net/http"
	"regexp"
)

func checkRedirect(req *http.Request, via []*http.Request) error {
	// BAD: the host of `req.URL` may be controlled by an attacker
	re := "^((www|beta).)?example.com/"
	if matched, _ := regexp.MatchString(re, req.URL.Host); matched {
		return nil
	}
	return errors.New("Invalid redirect")
}

IncompleteHostnameRegexpGood.go:

package main

import (
	"errors"
	"net/http"
	"regexp"
)

func checkRedirectGood(req *http.Request, via []*http.Request) error {
	// GOOD: the host of `req.URL` must be `example.com`, `www.example.com` or `beta.example.com`
	re := "^((www|beta)\\.)?example\\.com/"
	if matched, _ := regexp.MatchString(re, req.URL.Host); matched {
		return nil
	}
	return errors.New("Invalid redirect")
}

Compare:

^((www|beta).)?example.com/
^((www|beta)\.)?example\.com/

CodeQL Snippet:

/**
 * Holds if `pattern` is a regular expression pattern for URLs with a host matched by `hostPart`,
 * and `pattern` contains a subtle mistake that allows it to match unexpected hosts.
 */
bindingset[pattern]
predicate isIncompleteHostNameRegexpPattern(string pattern, string hostPart) {
  hostPart =
    pattern
        .regexpCapture("(?i).*?" +
            // an unescaped single `.`
            "(?<!\\\\)[.]" +
            // immediately followed by a sequence of subdomains, perhaps with some regex characters mixed in,
            // followed by a known TLD
            "(([():|?a-z0-9-]+(\\\\)?[.])?" + commonTLD() + ")" + ".*", 1)
}

🐰: ??????

Regular Expression Website

  • regex101.com is an amazing website to test and learn regular expression for PHP, JavaScript, Python and Golang.

2020_12_17_1

^((www|beta).)?example.com/

BAD: the host of req.URL may be controlled by an attacker

^((www|beta).)?example.com/

  • ^ asserts position at start of a line
  • It is trivial that example.com/ will be an exact match.

((www|beta).)?

  • 1st Capturing Group ((www|beta).)? ? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)

  • 2nd Capturing Group (www|beta)

    • 1st Alternative www: www matches the characters www literally (case sensitive)
    • 2nd Alternative beta: beta matches the characters beta literally (case sensitive)

. matches any character (except for line terminators) example matches the characters example literally (case sensitive)

In this case, careless developer will think . in regex will match . exactly, however, it matches any character, haha. Moreover, it will pass the testing case, because . will match . in regex.

2020_12_17_2

If hacker registered an domain called wwwaexample.com, then it will bypass the regex check.

^((www|beta)\.)?example\.com/

GOOD: the host of req.URL must be example.com, www.example.com or beta.example.com

2020_12_17_3

As shown in the image, the invalid domains do not match the regex any more.

Basic Symbols

  1. \ is the escape character
  • \n new line
  • \\ is the \
  1. ^ is the Start of String

  2. $ is the End of String

  3. *

  • {0,n}
  • ha* will match to h, ha, and haa <- h(a*)
  1. +
  • {1,n}
  • ha+ will match to ha, haa and haaa.
  1. ?
  • {0,1}
  • fabbit(s)? will be fabbit or fabbits.
  • if ? follows other symbol like (*,+,?,{n},{n,},{n,m}), it will be Ungreedy. o+? will match one o in string oooo

Greedy: 2020_12_17_4

Ungreedy: 2020_12_17_5

Thanks lg for providing this example.

  1. {}
  • number of matching times
  • c{2} will be cc
  • c{2,} will be cc, ccc, cccc, and so on.
  • c{2,3} will only match cc and ccc.
  • c{n,m}: n <= m.
  1. .
  • matches any individual character except \n. If include \n, will be (.|\n)
  • \s any whitespace char, \S non-whitespace char
  • \d any digit, \D any non-digit
  • \w any word char, \W any non-word char
  1. |
  • a or b
  • (f|r)abbit will be fabbit or rabbit
  1. []
  • [abc] matches either an a, b or c character
  • [abc]+ matches a, bb, or ccc
  • [^abc]+ matches any character except for an a, b or c
  • [a-zA-Z] matches any characters between a-z or A-Z
  1. Group Constructs
  • () captures everything enclosed.

    • (he)+ matches heheh, he, and heh
  • (?P<name>re): named & numbered capturing group (submatch)

  • (?:re): non-capturing group

  • (?flags): set flags within current group; non-capturing

  • (?flags:re): set flags during re; non-capturing

Flag syntax is xyz (set) or -xyz (clear) or xy-z (set xy, clear z). The flags are:

  • i: case-insensitive (default false)
  • m: multi-line mode: ^ and $ match begin/end line in addition to begin/end text (default false)
  • s: let . match \n (default false)
  • U: ungreedy: swap meaning of x* and x*?, x+ and x+?, etc (default false)
  1. Flag/Modifiers
  • g: Golbal

  • m: Multiline

  • i: Case Insensitive

  • s: Single Line

  • U: Ungreedy

  • More on GO - Package syntax

GoLang Regular Expression Examples


// multiple letter match
\w\w\w  // matches 3 letter
\w+     // matches all possible consecutive letters
\w{3}   // matches exact 3 consecutive letters
\w{2,3} // 2 to 3
\w*     // 0 or more
 
[abc]   // match in abc
r[ua]n  // "run" or "ran"
 
// digits
\d      // single digit
\d\d    // 2 digit
\d+     // as many as possible
\d{4}   // just 4 digits
\d{3,4} // 3 to 4 digit
\d*     // 0 or more
 
// general tokens
\t      // matches tab
\r      // Carriage return
\0      // null character
 
// groups
(\d\w\d) // matches 1a1, 2e3 etc

Golang Examples

Check if match

package main
 
import (
    "fmt"
    "regexp"
)
 
func main() {
    fmt.Println(regexp.Match(`\d`, []byte("1 12 23")))

    // FindString method
    re := regexp.MustCompile("et$")
    fmt.Println(re.FindString("cricket"))
    fmt.Println(re.FindString("hacked"))

    // FindStringIndex method
    re := regexp.MustCompile("tel")
    fmt.Println(re.FindStringIndex("telephone"))

    // FindStringSubmatch method
    re := regexp.MustCompile("p([a-z]+)ch")
    fmt.Println(re.FindStringSubmatch("peach punch"))
    fmt.Println(re.FindStringSubmatch("cricket"))
}

2020_12_17_6

Scraping HTML

package main

import (
    "fmt"
    "html"
    "io/ioutil"
    "net/http"
    "regexp"
)

func main() {
    resp, err := http.Get("https://f4bb1t.com")
    if err != nil {
        fmt.Println("http get error.")
    }
    defer resp.Body.Close()
    body, err := ioutil.ReadAll(resp.Body)
    if err != nil {
        fmt.Println("http read error.")
        return
    }

    src := string(body)

    r, _ := regexp.Compile("\\<li\\>.*\\</li\\>")
    rHTML, _ := regexp.Compile("<[^>]*>")
    titles := r.FindAllString(src, -1)

    for _, title := range titles {
        cleanTitle := rHTML.ReplaceAllString(title, "")
        fmt.Println(html.UnescapeString(cleanTitle))
    }

}

2020_12_17_7

Review on CWE-020 Regex Example

CodeQL Snippet:

/**
 * Holds if `pattern` is a regular expression pattern for URLs with a host matched by `hostPart`,
 * and `pattern` contains a subtle mistake that allows it to match unexpected hosts.
 */
bindingset[pattern]
predicate isIncompleteHostNameRegexpPattern(string pattern, string hostPart) {
  hostPart =
    pattern
        .regexpCapture("(?i).*?" +
            // an unescaped single `.`
            "(?<!\\\\)[.]" +
            // immediately followed by a sequence of subdomains, perhaps with some regex characters mixed in,
            // followed by a known TLD
            "(([():|?a-z0-9-]+(\\\\)?[.])?" + commonTLD() + ")" + ".*", 1)
}

2020_12_17_8

Other resources

Reference