CIS120Linux Fundementals
Regular Expressions
Understanding grep
The grep
command, short for "global regular expression print," is a powerful utility used to search text files for lines that match a specified pattern. It reads the file line by line and prints any lines that contain a match. grep
supports regular expressions, which allow for complex and flexible pattern matching. This makes grep
an invaluable tool for searching and analyzing text data in Unix-like systems.
Basic usage of grep
:
grep [options] pattern [file...]
Example: To search for lines containing the word "error" in a file named logfile.txt
:
grep 'error' logfile.txt
What are Regular Expressions?
Regular expressions (regex) are symbolic notations used to identify patterns in text. They enable powerful and flexible text searches, matches, and manipulations. Regular expressions are supported by many command-line tools and programming languages, making them essential for effective text processing.
Basic Regular Expressions (BRE)
Basic Regular Expressions (BRE) are the simpler form of regular expressions, supported by utilities like grep
. BRE uses a limited set of metacharacters and requires some characters to be escaped with a backslash ().
Common Metacharacters in BRE:
Metacharacter | Description | Example | Explanation |
---|---|---|---|
^ |
Matches the start of a line | ^abc |
Matches "abc" at the beginning of a line |
$ |
Matches the end of a line | abc$ |
Matches "abc" at the end of a line |
. |
Matches any single character | a.c |
Matches "abc", "a c", "a-c", etc. |
[] |
Matches any single character within the brackets | [abc] |
Matches "a", "b", or "c" |
* |
Matches zero or more occurrences of the previous character | a* |
Matches "", "a", "aa", "aaa", etc. |
\{n\} |
Matches exactly n occurrences of the previous character | a\{3\} |
Matches "aaa" |
\{n,m\} |
Matches between n and m occurrences of the previous character | a\{2,4\} |
Matches "aa", "aaa", or "aaaa" |
\? |
Matches zero or one occurrence of the previous character | a\? |
Matches "a" or "" |
\+ |
Matches one or more occurrences of the previous character | a\+ |
Matches "a", "aa", "aaa", etc. |
Examples of BRE:
^abc
matches "abc" at the beginning of a line:
echo "abcdef" | grep '^abc'
Output:
abcdef
abc$
matches "abc" at the end of a line:
echo "123abc" | grep 'abc$'
Output:
123abc
a.c
matches any character between "a" and "c":
echo "abc" | grep 'a.c'
Output:
abc
[abc]
matches any single character "a", "b", or "c":
echo "a" | grep '[abc]'
echo "b" | grep '[abc]'
echo "c" | grep '[abc]'
Output:
a
b
c
a*
matches zero or more occurrences of "a":
echo "aaab" | grep 'a*'
Output:
aaab
a\{3\}
matches exactly three occurrences of "a":
echo "aaa" | grep 'a\{3\}'
Output:
aaa
a\{2,4\}
matches between two and four occurrences of "a":
echo "aaa" | grep 'a\{2,4\}'
Output:
aaa
a\?
matches zero or one occurrence of "a":
echo "a" | grep 'a\?'
Output:
a
a\+
matches one or more occurrences of "a":
echo "aaa" | grep 'a\+'
Output:
aaa
Examples of BRE with the find
commands
To find all files ending in .txt:
find /path/to/search -regex '.*\.txt'
To find files with names starting with "log":
find /path/to/search -regex '.*/log.*'
To find files with exactly 3-character extensions:
find /path/to/search -regex '.*\.[a-zA-Z][a-zA-Z][a-zA-Z]'
To find files ending with .sh or .py:
find /path/to/search -regex '.*\.sh\|.*\.py'
To find files whose full path contains "backup":
find /path/to/search -regex '.*/backup/.*'
To find files that have digits in their filenames:
find /path/to/search -regex '.*[0-9].*'
To find hidden files (starting with a dot):
find /path/to/search -regex '.*/\..*'
To find files starting with "data" and ending in numbers:
find /path/to/search -regex '.*/data[0-9]*'
To find files where the name is exactly 8 characters plus an extension:
find /path/to/search -regex '.*/[^/]\{8\}\..*'
To find files in a specific directory using a literal match:
find /path/to/search -regex '/path/to/search/file.txt'
Extended Regular Expressions (ERE)
Extended Regular Expressions (ERE) are a more powerful and flexible version of regular expressions. They were developed to address the limitations of BRE by introducing additional metacharacters and operators. ERE does not require escaping for certain characters, making the expressions more readable and easier to write. ERE is supported by utilities like egrep
or grep -E
.
Common Metacharacters in ERE:
Metacharacter | Description | Example | Explanation |
---|---|---|---|
^ |
Matches the start of a line | ^abc |
Matches "abc" only at the beginning of a line |
$ |
Matches the end of a line | abc$ |
Matches "abc" only at the end of a line |
. |
Matches any single character | a.c |
Matches "abc", "a-c", "a c", etc. |
[] |
Matches any single character inside the brackets | [aeiou] |
Matches any lowercase vowel |
* |
Matches zero or more of the previous character or group | bo* |
Matches "b", "bo", "boo", "booo", etc. |
+ |
Matches one or more of the previous character or group | bo+ |
Matches "bo", "boo", "booo", etc., but not "b" |
? |
Matches zero or one of the previous character or group | colou?r |
Matches "color" or "colour" |
{n} |
Matches exactly n of the previous character or group | a{3} |
Matches "aaa" |
{n,m} |
Matches between n and m of the previous character or group | a{2,4} |
Matches "aa", "aaa", or "aaaa" |
| |
Acts as OR between expressions | cat|dog |
Matches either "cat" or "dog" |
() |
Groups expressions | (ab)+ |
Matches "ab", "abab", "ababab", etc. |
Examples of ERE:
^abc
matches "abc" at the beginning of a line:
echo "abcdef" | grep -E '^abc'
Output:
abcdef
abc$
matches "abc" at the end of a line:
echo "123abc" | grep -E 'abc$'
Output:
123abc
a.c
matches any character between "a" and "c":
echo "abc" | grep -E 'a.c'
Output:
abc
[abc]
matches any single character "a", "b", or "c":
echo "a" | grep -E '[abc]'
echo "b" | grep -E '[abc]'
echo "c" | grep -E '[abc]'
Output:
a
b
c
a*
matches zero or more occurrences of "a":
echo "aaab" | grep -E 'a*'
Output:
aaab
a{3}
matches exactly three occurrences of "a":
echo "aaa" | grep -E 'a{3}'
Output:
aaa
a{2,4}
matches between two and four occurrences of "a":
echo "aaa" | grep -E 'a{2,4}'
Output:
aaa
a?
matches zero or one occurrence of "a":
echo "a" | grep -E 'a?'
Output:
a
a+
matches one or more occurrences of "a":
echo "aaa" | grep -E 'a+'
Output:
aaa
a|b
matches either "a" or "b":
echo "a" | grep -E 'a|b'
echo "b" | grep -E 'a|b'
Output:
a
b
(abc|def)
matches "abc" or "def":
echo "abc" | grep -E '(abc|def)'
echo "def" | grep -E '(abc|def)'
Output:
abc
def
Examples of ERE with the find command
NOTE: find
does not support ERE directly so you need to pipe it to grep -E
To find files ending in .txt or .md:
find /path/to/search -type f | grep -E '\.(txt|md)$'
To find files that start with "log" and end in digits:
find /path/to/search -type f | grep -E '/log[0-9]+$'
To find files with lowercase extensions of 2 to 4 letters:
find /path/to/search -type f | grep -E '\.[a-z]{2,4}$'
To find hidden files (starting with a dot):
find /path/to/search -type f | grep -E '/\.[^/]+$'
To find files with names that contain digits only:
find /path/to/search -type f | grep -E '/[0-9]+$'
To find files with uppercase extensions like .JPG or .PNG:
find /path/to/search -type f | grep -E '\.(JPG|PNG)$'
To find files with names starting with "data" and followed by optional underscores:
find /path/to/search -type f | grep -E '/data_*$'
To find files with at least one underscore in the filename:
find /path/to/search -type f | grep -E '/[^/]*_+[^/]*$'
To find files with names containing a dash followed by digits (e.g. report-2024):
find /path/to/search -type f | grep -E '/[^/]*-[0-9]+$'
To find files with names that are exactly 8 characters long:
find /path/to/search -type f | grep -E '/[^/]{8}$'
Character Classes
Character classes in regular expressions allow you to match specific sets of characters. These sets are predefined and make it easier to work with groups of characters.
Common Character Classes:
Character Class | Description | Example | Explanation |
---|---|---|---|
[:blank:] |
Matches spaces and tabs | grep '[[:blank:]]' |
Matches spaces and tabs in the text |
[:upper:] |
Matches uppercase letters | grep '[[:upper:]]' |
Matches any uppercase letter |
[:lower:] |
Matches lowercase letters | grep '[[:lower:]]' |
Matches any lowercase letter |
[:digit:] |
Matches digits | grep '[[:digit:]]' |
Matches any digit |
[:alpha:] |
Matches alphabetic characters | grep '[[:alpha:]]' |
Matches any alphabetic character |
[:alnum:] |
Matches alphanumeric characters | grep '[[:alnum:]]' |
Matches any alphanumeric character |
[:punct:] |
Matches punctuation characters | grep '[[:punct:]]' |
Matches any punctuation character |
[:space:] |
Matches all whitespace characters | grep '[[:space:]]' |
Matches spaces, tabs, and newlines |
Examples of Character Classes:
To match spaces and tabs using [:blank:]
:
echo "hello world" | grep '[[:blank:]]'
Output:
hello world
To match uppercase letters using [:upper:]
:
echo "Hello World" | grep '[[:upper:]]'
Output:
Hello World
To match digits using [:digit:]
:
echo "User ID: 4521" | grep '[[:digit:]]'
Output:
User ID: 4521
To match punctuation using [:punct:]
:
echo "Welcome to Earth!" | grep '[[:punct:]]'
Output:
Welcome to Earth!
To match any whitespace character using [:space:]
:
echo "Line one
Line two" | grep '[[:space:]]'
Output:
Line one
Line two
Summary
Regular expressions are a powerful tool for text manipulation in Linux. They allow you to identify patterns in text and perform complex searches and replacements. Basic Regular Expressions (BRE) provide a fundamental set of pattern matching capabilities, while Extended Regular Expressions (ERE) offer more advanced features and flexibility. Character classes, like [:blank:]
and [:upper:]
, further enhance your ability to match specific sets of characters. By understanding and mastering both BRE and ERE, along with character classes, you can significantly enhance your ability to manipulate and analyze text in Unix-like systems.