WCC logo

CIS120Linux Fundementals

Regular Expressions

Understanding grep

The grep command, short for "global regular expression print," is a powerful utility used to search text files for lines that match a specified pattern. It reads the file line by line and prints any lines that contain a match. grep supports regular expressions, which allow for complex and flexible pattern matching. This makes grep an invaluable tool for searching and analyzing text data in Unix-like systems.

Basic usage of grep:

grep [options] pattern [file...]

Example: To search for lines containing the word "error" in a file named logfile.txt:

grep 'error' logfile.txt

What are Regular Expressions?

Regular expressions (regex) are symbolic notations used to identify patterns in text. They enable powerful and flexible text searches, matches, and manipulations. Regular expressions are supported by many command-line tools and programming languages, making them essential for effective text processing.

Basic Regular Expressions (BRE)

Basic Regular Expressions (BRE) are the simpler form of regular expressions, supported by utilities like grep. BRE uses a limited set of metacharacters and requires some characters to be escaped with a backslash ().

Common Metacharacters in BRE:

Metacharacter Description Example Explanation
^ Matches the start of a line ^abc Matches "abc" at the beginning of a line
$ Matches the end of a line abc$ Matches "abc" at the end of a line
. Matches any single character a.c Matches "abc", "a c", "a-c", etc.
[] Matches any single character within the brackets [abc] Matches "a", "b", or "c"
* Matches zero or more occurrences of the previous character a* Matches "", "a", "aa", "aaa", etc.
\{n\} Matches exactly n occurrences of the previous character a\{3\} Matches "aaa"
\{n,m\} Matches between n and m occurrences of the previous character a\{2,4\} Matches "aa", "aaa", or "aaaa"
\? Matches zero or one occurrence of the previous character a\? Matches "a" or ""
\+ Matches one or more occurrences of the previous character a\+ Matches "a", "aa", "aaa", etc.

Examples of BRE:

^abc matches "abc" at the beginning of a line:

echo "abcdef" | grep '^abc'

Output:

abcdef

abc$ matches "abc" at the end of a line:

echo "123abc" | grep 'abc$'

Output:

123abc

a.c matches any character between "a" and "c":

echo "abc" | grep 'a.c'

Output:

abc

[abc] matches any single character "a", "b", or "c":

echo "a" | grep '[abc]'
echo "b" | grep '[abc]'
echo "c" | grep '[abc]'

Output:

a
b
c

a* matches zero or more occurrences of "a":

echo "aaab" | grep 'a*'

Output:

aaab

a\{3\} matches exactly three occurrences of "a":

echo "aaa" | grep 'a\{3\}'

Output:

aaa

a\{2,4\} matches between two and four occurrences of "a":

echo "aaa" | grep 'a\{2,4\}'

Output:

aaa

a\? matches zero or one occurrence of "a":

echo "a" | grep 'a\?'

Output:

a

a\+ matches one or more occurrences of "a":

echo "aaa" | grep 'a\+'

Output:

aaa

Examples of BRE with the find commands

To find all files ending in .txt:

find /path/to/search -regex '.*\.txt'

To find files with names starting with "log":

find /path/to/search -regex '.*/log.*'

To find files with exactly 3-character extensions:

find /path/to/search -regex '.*\.[a-zA-Z][a-zA-Z][a-zA-Z]'

To find files ending with .sh or .py:

find /path/to/search -regex '.*\.sh\|.*\.py'

To find files whose full path contains "backup":

find /path/to/search -regex '.*/backup/.*'

To find files that have digits in their filenames:

find /path/to/search -regex '.*[0-9].*'

To find hidden files (starting with a dot):

find /path/to/search -regex '.*/\..*'

To find files starting with "data" and ending in numbers:

find /path/to/search -regex '.*/data[0-9]*'

To find files where the name is exactly 8 characters plus an extension:

find /path/to/search -regex '.*/[^/]\{8\}\..*'

To find files in a specific directory using a literal match:

find /path/to/search -regex '/path/to/search/file.txt'

Extended Regular Expressions (ERE)

Extended Regular Expressions (ERE) are a more powerful and flexible version of regular expressions. They were developed to address the limitations of BRE by introducing additional metacharacters and operators. ERE does not require escaping for certain characters, making the expressions more readable and easier to write. ERE is supported by utilities like egrep or grep -E.

Common Metacharacters in ERE:

Metacharacter Description Example Explanation
^ Matches the start of a line ^abc Matches "abc" only at the beginning of a line
$ Matches the end of a line abc$ Matches "abc" only at the end of a line
. Matches any single character a.c Matches "abc", "a-c", "a c", etc.
[] Matches any single character inside the brackets [aeiou] Matches any lowercase vowel
* Matches zero or more of the previous character or group bo* Matches "b", "bo", "boo", "booo", etc.
+ Matches one or more of the previous character or group bo+ Matches "bo", "boo", "booo", etc., but not "b"
? Matches zero or one of the previous character or group colou?r Matches "color" or "colour"
{n} Matches exactly n of the previous character or group a{3} Matches "aaa"
{n,m} Matches between n and m of the previous character or group a{2,4} Matches "aa", "aaa", or "aaaa"
| Acts as OR between expressions cat|dog Matches either "cat" or "dog"
() Groups expressions (ab)+ Matches "ab", "abab", "ababab", etc.

Examples of ERE:

^abc matches "abc" at the beginning of a line:

echo "abcdef" | grep -E '^abc'

Output:

abcdef

abc$ matches "abc" at the end of a line:

echo "123abc" | grep -E 'abc$'

Output:

123abc

a.c matches any character between "a" and "c":

echo "abc" | grep -E 'a.c'

Output:

abc

[abc] matches any single character "a", "b", or "c":

echo "a" | grep -E '[abc]'
echo "b" | grep -E '[abc]'
echo "c" | grep -E '[abc]'

Output:

a
b
c

a* matches zero or more occurrences of "a":

echo "aaab" | grep -E 'a*'

Output:

aaab

a{3} matches exactly three occurrences of "a":

echo "aaa" | grep -E 'a{3}'

Output:

aaa

a{2,4} matches between two and four occurrences of "a":

echo "aaa" | grep -E 'a{2,4}'

Output:

aaa

a? matches zero or one occurrence of "a":

echo "a" | grep -E 'a?'

Output:

a

a+ matches one or more occurrences of "a":

echo "aaa" | grep -E 'a+'

Output:

aaa

a|b matches either "a" or "b":

echo "a" | grep -E 'a|b'
echo "b" | grep -E 'a|b'

Output:

a
b

(abc|def) matches "abc" or "def":

echo "abc" | grep -E '(abc|def)'
echo "def" | grep -E '(abc|def)'

Output:

abc
def

Examples of ERE with the find command

NOTE: find does not support ERE directly so you need to pipe it to grep -E

To find files ending in .txt or .md:

find /path/to/search -type f | grep -E '\.(txt|md)$'

To find files that start with "log" and end in digits:

find /path/to/search -type f | grep -E '/log[0-9]+$'

To find files with lowercase extensions of 2 to 4 letters:

find /path/to/search -type f | grep -E '\.[a-z]{2,4}$'

To find hidden files (starting with a dot):

find /path/to/search -type f | grep -E '/\.[^/]+$'

To find files with names that contain digits only:

find /path/to/search -type f | grep -E '/[0-9]+$'

To find files with uppercase extensions like .JPG or .PNG:

find /path/to/search -type f | grep -E '\.(JPG|PNG)$'

To find files with names starting with "data" and followed by optional underscores:

find /path/to/search -type f | grep -E '/data_*$'

To find files with at least one underscore in the filename:

find /path/to/search -type f | grep -E '/[^/]*_+[^/]*$'

To find files with names containing a dash followed by digits (e.g. report-2024):

find /path/to/search -type f | grep -E '/[^/]*-[0-9]+$'

To find files with names that are exactly 8 characters long:

find /path/to/search -type f | grep -E '/[^/]{8}$'

Character Classes

Character classes in regular expressions allow you to match specific sets of characters. These sets are predefined and make it easier to work with groups of characters.

Common Character Classes:

Character Class Description Example Explanation
[:blank:] Matches spaces and tabs grep '[[:blank:]]' Matches spaces and tabs in the text
[:upper:] Matches uppercase letters grep '[[:upper:]]' Matches any uppercase letter
[:lower:] Matches lowercase letters grep '[[:lower:]]' Matches any lowercase letter
[:digit:] Matches digits grep '[[:digit:]]' Matches any digit
[:alpha:] Matches alphabetic characters grep '[[:alpha:]]' Matches any alphabetic character
[:alnum:] Matches alphanumeric characters grep '[[:alnum:]]' Matches any alphanumeric character
[:punct:] Matches punctuation characters grep '[[:punct:]]' Matches any punctuation character
[:space:] Matches all whitespace characters grep '[[:space:]]' Matches spaces, tabs, and newlines

Examples of Character Classes:

To match spaces and tabs using [:blank:]:

echo "hello	world" | grep '[[:blank:]]'

Output:

hello	world

To match uppercase letters using [:upper:]:

echo "Hello World" | grep '[[:upper:]]'

Output:

Hello World

To match digits using [:digit:]:

echo "User ID: 4521" | grep '[[:digit:]]'

Output:

User ID: 4521

To match punctuation using [:punct:]:

echo "Welcome to Earth!" | grep '[[:punct:]]'

Output:

Welcome to Earth!

To match any whitespace character using [:space:]:

echo "Line one
Line two" | grep '[[:space:]]'

Output:

Line one
Line two

Summary

Regular expressions are a powerful tool for text manipulation in Linux. They allow you to identify patterns in text and perform complex searches and replacements. Basic Regular Expressions (BRE) provide a fundamental set of pattern matching capabilities, while Extended Regular Expressions (ERE) offer more advanced features and flexibility. Character classes, like [:blank:] and [:upper:], further enhance your ability to match specific sets of characters. By understanding and mastering both BRE and ERE, along with character classes, you can significantly enhance your ability to manipulate and analyze text in Unix-like systems.