CodeQL & Golang regexp
To Write Query for Go Project using CodeQL, mastering Regular Expression skills is very important.
CodeQL CWE-020 Regex Example
package main
import (
"errors"
"net/http"
"regexp"
)
func checkRedirect(req *http.Request, via []*http.Request) error {
// BAD: the host of `req.URL` may be controlled by an attacker
re := "^((www|beta).)?example.com/"
if matched, _ := regexp.MatchString(re, req.URL.Host); matched {
return nil
}
return errors.New("Invalid redirect")
}
IncompleteHostnameRegexpGood.go
:
package main
import (
"errors"
"net/http"
"regexp"
)
func checkRedirectGood(req *http.Request, via []*http.Request) error {
// GOOD: the host of `req.URL` must be `example.com`, `www.example.com` or `beta.example.com`
re := "^((www|beta)\\.)?example\\.com/"
if matched, _ := regexp.MatchString(re, req.URL.Host); matched {
return nil
}
return errors.New("Invalid redirect")
}
Compare:
^((www|beta).)?example.com/
^((www|beta)\.)?example\.com/
CodeQL Snippet:
/**
* Holds if `pattern` is a regular expression pattern for URLs with a host matched by `hostPart`,
* and `pattern` contains a subtle mistake that allows it to match unexpected hosts.
*/
bindingset[pattern]
predicate isIncompleteHostNameRegexpPattern(string pattern, string hostPart) {
hostPart =
pattern
.regexpCapture("(?i).*?" +
// an unescaped single `.`
"(?<!\\\\)[.]" +
// immediately followed by a sequence of subdomains, perhaps with some regex characters mixed in,
// followed by a known TLD
"(([():|?a-z0-9-]+(\\\\)?[.])?" + commonTLD() + ")" + ".*", 1)
}
🐰: ??????
Regular Expression Website
- regex101.com is an amazing website to test and learn regular expression for PHP, JavaScript, Python and Golang.
^((www|beta).)?example.com/
BAD: the host of
req.URL
may be controlled by an attacker
^((www|beta).)?example.com/
^
asserts position at start of a line- It is trivial that
example.com/
will be an exact match.
((www|beta).)?
-
1st Capturing Group ((www|beta).)?
?
Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy) -
2nd Capturing Group (www|beta)
- 1st Alternative
www
: www matches the characters www literally (case sensitive) - 2nd Alternative
beta
: beta matches the characters beta literally (case sensitive)
- 1st Alternative
.
matches any character (except for line terminators) example matches the characters example literally (case sensitive)
In this case, careless developer will think .
in regex will match .
exactly, however, it matches any character, haha. Moreover, it will pass the testing case, because .
will match .
in regex.
If hacker registered an domain called wwwaexample.com
, then it will bypass the regex check.
^((www|beta)\.)?example\.com/
GOOD: the host of
req.URL
must beexample.com
,www.example.com
orbeta.example.com
As shown in the image, the invalid domains do not match the regex any more.
Basic Symbols
\
is the escape character
\n
new line\\
is the\
-
^
is the Start of String -
$
is the End of String -
*
- {0,n}
ha*
will match toh
,ha
, andhaa
<-h(a*)
+
- {1,n}
ha+
will match toha
,haa
andhaaa
.
?
- {0,1}
fabbit(s)?
will befabbit
orfabbits
.- if
?
follows other symbol like(*,+,?,{n},{n,},{n,m})
, it will be Ungreedy.o+?
will match oneo
in stringoooo
Greedy:
Ungreedy:
Thanks lg for providing this example.
{}
- number of matching times
c{2}
will becc
c{2,}
will becc
,ccc
,cccc
, and so on.c{2,3}
will only matchcc
andccc
.c{n,m}
: n <= m.
.
- matches any individual character except
\n
. If include\n
, will be(.|\n)
\s
any whitespace char,\S
non-whitespace char\d
any digit,\D
any non-digit\w
any word char,\W
any non-word char
|
- a or b
(f|r)abbit
will befabbit
orrabbit
[]
[abc]
matches either an a, b or c character[abc]+
matchesa
,bb
, orccc
[^abc]+
matches any character except for an a, b or c- [a-zA-Z] matches any characters between a-z or A-Z
- Group Constructs
-
()
captures everything enclosed.(he)+
matchesheheh
,he
, andheh
-
(?P<name>re)
: named & numbered capturing group (submatch) -
(?:re)
: non-capturing group -
(?flags)
: set flags within current group; non-capturing -
(?flags:re)
: set flags during re; non-capturing
Flag syntax is xyz (set) or -xyz (clear) or xy-z (set xy, clear z). The flags are:
- i: case-insensitive (default false)
- m: multi-line mode: ^ and $ match begin/end line in addition to begin/end text (default false)
- s: let . match \n (default false)
- U: ungreedy: swap meaning of
x*
andx*?
,x+
andx+?
, etc (default false)
- Flag/Modifiers
-
g
: Golbal -
m
: Multiline -
i
: Case Insensitive -
s
: Single Line -
U
: Ungreedy -
More on GO - Package syntax
GoLang Regular Expression Examples
// multiple letter match
\w\w\w // matches 3 letter
\w+ // matches all possible consecutive letters
\w{3} // matches exact 3 consecutive letters
\w{2,3} // 2 to 3
\w* // 0 or more
[abc] // match in abc
r[ua]n // "run" or "ran"
// digits
\d // single digit
\d\d // 2 digit
\d+ // as many as possible
\d{4} // just 4 digits
\d{3,4} // 3 to 4 digit
\d* // 0 or more
// general tokens
\t // matches tab
\r // Carriage return
\0 // null character
// groups
(\d\w\d) // matches 1a1, 2e3 etc
Golang Examples
Check if match
package main
import (
"fmt"
"regexp"
)
func main() {
fmt.Println(regexp.Match(`\d`, []byte("1 12 23")))
// FindString method
re := regexp.MustCompile("et$")
fmt.Println(re.FindString("cricket"))
fmt.Println(re.FindString("hacked"))
// FindStringIndex method
re := regexp.MustCompile("tel")
fmt.Println(re.FindStringIndex("telephone"))
// FindStringSubmatch method
re := regexp.MustCompile("p([a-z]+)ch")
fmt.Println(re.FindStringSubmatch("peach punch"))
fmt.Println(re.FindStringSubmatch("cricket"))
}
Scraping HTML
package main
import (
"fmt"
"html"
"io/ioutil"
"net/http"
"regexp"
)
func main() {
resp, err := http.Get("https://f4bb1t.com")
if err != nil {
fmt.Println("http get error.")
}
defer resp.Body.Close()
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
fmt.Println("http read error.")
return
}
src := string(body)
r, _ := regexp.Compile("\\<li\\>.*\\</li\\>")
rHTML, _ := regexp.Compile("<[^>]*>")
titles := r.FindAllString(src, -1)
for _, title := range titles {
cleanTitle := rHTML.ReplaceAllString(title, "")
fmt.Println(html.UnescapeString(cleanTitle))
}
}
Review on CWE-020 Regex Example
CodeQL Snippet:
/**
* Holds if `pattern` is a regular expression pattern for URLs with a host matched by `hostPart`,
* and `pattern` contains a subtle mistake that allows it to match unexpected hosts.
*/
bindingset[pattern]
predicate isIncompleteHostNameRegexpPattern(string pattern, string hostPart) {
hostPart =
pattern
.regexpCapture("(?i).*?" +
// an unescaped single `.`
"(?<!\\\\)[.]" +
// immediately followed by a sequence of subdomains, perhaps with some regex characters mixed in,
// followed by a known TLD
"(([():|?a-z0-9-]+(\\\\)?[.])?" + commonTLD() + ")" + ".*", 1)
}
Other resources
Reference
- Package syntax
- Regex in GoLang – regexp Package
- Regular expressions in Go
- How to use RegEx in GoLang
- Regular Expressions
- Regular Expression