CodeQL & Golang regexp

2020-12-17

security

CodeQL Golang

To Write Query for Go Project using CodeQL, mastering Regular Expression skills is very important.

CodeQL CWE-020 Regex Example

IncompleteHostnameRegexp.go:

package main

import (
	"errors"
	"net/http"
	"regexp"
)

func checkRedirect(req *http.Request, via []*http.Request) error {
	// BAD: the host of `req.URL` may be controlled by an attacker
	re := "^((www|beta).)?example.com/"
	if matched, _ := regexp.MatchString(re, req.URL.Host); matched {
		return nil
	}
	return errors.New("Invalid redirect")
}

IncompleteHostnameRegexpGood.go:

package main

import (
	"errors"
	"net/http"
	"regexp"
)

func checkRedirectGood(req *http.Request, via []*http.Request) error {
	// GOOD: the host of `req.URL` must be `example.com`, `www.example.com` or `beta.example.com`
	re := "^((www|beta)\\.)?example\\.com/"
	if matched, _ := regexp.MatchString(re, req.URL.Host); matched {
		return nil
	}
	return errors.New("Invalid redirect")
}

Compare:

^((www|beta).)?example.com/
^((www|beta)\.)?example\.com/

CodeQL Snippet:

/**
 * Holds if `pattern` is a regular expression pattern for URLs with a host matched by `hostPart`,
 * and `pattern` contains a subtle mistake that allows it to match unexpected hosts.
 */
bindingset[pattern]
predicate isIncompleteHostNameRegexpPattern(string pattern, string hostPart) {
  hostPart =
    pattern
        .regexpCapture("(?i).*?" +
            // an unescaped single `.`
            "(?<!\\\\)[.]" +
            // immediately followed by a sequence of subdomains, perhaps with some regex characters mixed in,
            // followed by a known TLD
            "(([():|?a-z0-9-]+(\\\\)?[.])?" + commonTLD() + ")" + ".*", 1)
}

🐰: ??????

Regular Expression Website

regex101.com is an amazing website to test and learn regular expression for PHP, JavaScript, Python and Golang.

2020_12_17_1

^((www|beta).)?example.com/

BAD: the host of req.URL may be controlled by an attacker

^((www|beta).)?example.com/

^ asserts position at start of a line
It is trivial that example.com/ will be an exact match.

((www|beta).)?

1st Capturing Group ((www|beta).)? ? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
2nd Capturing Group (www|beta)
- 1st Alternative www: www matches the characters www literally (case sensitive)
- 2nd Alternative beta: beta matches the characters beta literally (case sensitive)

. matches any character (except for line terminators) example matches the characters example literally (case sensitive)

In this case, careless developer will think . in regex will match . exactly, however, it matches any character, haha. Moreover, it will pass the testing case, because . will match . in regex.

2020_12_17_2

If hacker registered an domain called wwwaexample.com, then it will bypass the regex check.

^((www|beta)\.)?example\.com/

GOOD: the host of req.URL must be example.com, www.example.com or beta.example.com

2020_12_17_3

As shown in the image, the invalid domains do not match the regex any more.

Basic Symbols

\ is the escape character

\n new line
\\ is the \

^ is the Start of String
$ is the End of String
*

{0,n}
ha* will match to h, ha, and haa <- h(a*)

{1,n}
ha+ will match to ha, haa and haaa.

{0,1}
fabbit(s)? will be fabbit or fabbits.
if ? follows other symbol like (*,+,?，{n}，{n,}，{n,m}), it will be Ungreedy. o+? will match one o in string oooo

Greedy: 2020_12_17_4

Ungreedy: 2020_12_17_5

Thanks lg for providing this example.

{}

number of matching times
c{2} will be cc
c{2,} will be cc, ccc, cccc, and so on.
c{2,3} will only match cc and ccc.
c{n,m}: n <= m.

matches any individual character except \n. If include \n, will be (.|\n)
\s any whitespace char, \S non-whitespace char
\d any digit, \D any non-digit
\w any word char, \W any non-word char

a or b
(f|r)abbit will be fabbit or rabbit

[]

[abc] matches either an a, b or c character
[abc]+ matches a, bb, or ccc
[^abc]+ matches any character except for an a, b or c
[a-zA-Z] matches any characters between a-z or A-Z

Group Constructs

() captures everything enclosed.
- (he)+ matches heheh, he, and heh
(?P<name>re): named & numbered capturing group (submatch)
(?:re): non-capturing group
(?flags): set flags within current group; non-capturing
(?flags:re): set flags during re; non-capturing

Flag syntax is xyz (set) or -xyz (clear) or xy-z (set xy, clear z). The flags are:

i: case-insensitive (default false)
m: multi-line mode: ^ and $ match begin/end line in addition to begin/end text (default false)
s: let . match \n (default false)
U: ungreedy: swap meaning of x* and x*?, x+ and x+?, etc (default false)

Flag/Modifiers

g: Golbal
m: Multiline
i: Case Insensitive
s: Single Line
U: Ungreedy
More on GO - Package syntax

GoLang Regular Expression Examples


// multiple letter match
\w\w\w  // matches 3 letter
\w+     // matches all possible consecutive letters
\w{3}   // matches exact 3 consecutive letters
\w{2,3} // 2 to 3
\w*     // 0 or more
 
[abc]   // match in abc
r[ua]n  // "run" or "ran"
 
// digits
\d      // single digit
\d\d    // 2 digit
\d+     // as many as possible
\d{4}   // just 4 digits
\d{3,4} // 3 to 4 digit
\d*     // 0 or more
 
// general tokens
\t      // matches tab
\r      // Carriage return
\0      // null character
 
// groups
(\d\w\d) // matches 1a1, 2e3 etc

Golang Examples

Check if match

package main
 
import (
    "fmt"
    "regexp"
)
 
func main() {
    fmt.Println(regexp.Match(`\d`, []byte("1 12 23")))

    // FindString method
    re := regexp.MustCompile("et$")
    fmt.Println(re.FindString("cricket"))
    fmt.Println(re.FindString("hacked"))

    // FindStringIndex method
    re := regexp.MustCompile("tel")
    fmt.Println(re.FindStringIndex("telephone"))

    // FindStringSubmatch method
    re := regexp.MustCompile("p([a-z]+)ch")
    fmt.Println(re.FindStringSubmatch("peach punch"))
    fmt.Println(re.FindStringSubmatch("cricket"))
}

2020_12_17_6

Scraping HTML

package main

import (
    "fmt"
    "html"
    "io/ioutil"
    "net/http"
    "regexp"
)

func main() {
    resp, err := http.Get("https://f4bb1t.com")
    if err != nil {
        fmt.Println("http get error.")
    }
    defer resp.Body.Close()
    body, err := ioutil.ReadAll(resp.Body)
    if err != nil {
        fmt.Println("http read error.")
        return
    }

    src := string(body)

    r, _ := regexp.Compile("\\<li\\>.*\\</li\\>")
    rHTML, _ := regexp.Compile("<[^>]*>")
    titles := r.FindAllString(src, -1)

    for _, title := range titles {
        cleanTitle := rHTML.ReplaceAllString(title, "")
        fmt.Println(html.UnescapeString(cleanTitle))
    }

}

2020_12_17_7

Review on CWE-020 Regex Example