CIS120 Book

CIS120 Linux Fundamentals by Scott Shaper

Regular Expressions

Think of regular expressions like a powerful search language that lets you describe patterns instead of exact matches. It's similar to how you might describe a person to someone: "Look for someone tall wearing a red hat and blue shoes" rather than giving their exact name. With regular expressions (regex), you can tell the computer to find all text that matches a pattern like "any email address" or "phone numbers in this format." This pattern-matching superpower makes regex an essential tool for searching, validating, and manipulating text in Linux.

Quick Reference

Command	What It Does	Common Use
`grep 'pattern' file`	Searches for text matching a pattern	Finding specific lines in log files or code
`grep -E 'pattern' file`	Uses extended regular expressions	More complex pattern matching with fewer escape characters
`grep -i 'pattern' file`	Case-insensitive search	Finding text regardless of capitalization
`find \| grep -E 'pattern'`	Filters find results using regex	Finding files that match specific naming patterns

When to Use Regular Expressions

When you need to search for patterns rather than exact text
When validating input formats (like emails, phone numbers, dates)
When extracting specific information from large text files
When filtering command output for specific patterns
When searching for files with complex naming patterns
When you need to perform search and replace operations with patterns

Understanding grep

The grep command (short for "global regular expression print") is like your pattern-matching detective. It searches through text looking for lines that match your specified pattern and shows you the results. It's one of the most commonly used tools for applying regular expressions in Linux.

Option	What It Does	When to Use
`-i`	Makes the search case-insensitive	When you don't care about exact capitalization
`-v`	Inverts the match (shows non-matching lines)	When you want to exclude certain patterns
`-c`	Shows only the count of matching lines	When you just need to know how many matches exist
`-n`	Shows line numbers with matches	When you need to know where matches occur
`-E`	Uses extended regular expressions	When you need more powerful pattern matching
`-o`	Shows only the matching part of the line	When you only want to see the pattern that matched
`-r`	Searches recursively through directories	When searching through multiple files and folders
`-h`	Suppresses file names in output	When you only want to see matching lines without file names

Basic grep Usage

# Find all lines containing "error" in log file
grep 'error' application.log
# Shows every line that contains the word "error"

# Simple search without showing filename
grep -h 'error' application.log
# Shows matching lines without the filename prefix

# Case-insensitive search for warnings
grep -i 'warning' application.log
# Finds "Warning", "WARNING", "warning", etc.

# Count how many errors occurred
grep -c 'error' application.log
# Displays just the number of matching lines

# Find lines that don't contain "success"
grep -v 'success' application.log
# Shows all lines except those containing "success"

Basic Regular Expressions (BRE)

Think of Basic Regular Expressions as the foundation vocabulary of the pattern-matching language. These are the simpler patterns that most tools support by default. In BRE, some special characters need to be escaped with a backslash (\) to use their special meaning.

Pattern	What It Matches	When to Use	Example
`^`	Beginning of a line	When you need to find patterns at the start of lines	`^ERROR` matches lines starting with "ERROR"
`$`	End of a line	When you need to find patterns at the end of lines	`failed$` matches lines ending with "failed"
`.`	Any single character	When you need to match any character in a specific position	`b.t` matches "bat", "bit", "bot", etc.
`*`	Zero or more of previous character	When something might appear multiple times or not at all	`lo*l` matches "ll", "lol", "lool", etc.
`[...]`	Any character in the brackets	When you need to match one character from a specific set	`[aeiou]` matches any vowel
`[^...]`	Any character NOT in the brackets	When you need to exclude specific characters	`[^0-9]` matches any non-digit
`\{n\}`	Exactly n occurrences	When you need an exact number of repetitions	`a\{3\}` matches exactly "aaa"
`\{n,m\}`	Between n and m occurrences	When you need a range of repetitions	`a\{2,4\}` matches "aa", "aaa", or "aaaa"
`\+`	One or more of previous character	When you need at least one occurrence	`a\+` matches "a", "aa", "aaa", etc., but not ""
`\?`	Zero or one of previous character	When something is optional	`colou\?r` matches "color" or "colour"

BRE Examples

# Find lines starting with "From:"
grep '^From:' email.txt
# Only matches lines that begin with "From:"

# Find lines ending with a period
grep '\.$' document.txt
# Only matches lines that end with a period

# Find all 3-letter words
grep '\<[a-zA-Z]\{3\}\>' document.txt
# Matches "cat", "dog", "The", etc.

# Find phone numbers in format 555-123-4567
grep '[0-9]\{3\}-[0-9]\{3\}-[0-9]\{4\}' contacts.txt
# Matches phone numbers with that specific pattern

# Find words starting with 'a' and ending with 'e'
grep '\' document.txt
# Matches "apple", "awesome", "altitude", etc.

Extended Regular Expressions (ERE)

Think of Extended Regular Expressions as the advanced vocabulary that gives you more expressive power with less typing. ERE is like BRE's more modern cousin that doesn't require you to escape certain special characters. You access ERE using grep -E (or the older egrep command).

Pattern	What It Matches	When to Use	Example
`+`	One or more of previous character	When you need at least one occurrence	`a+` matches "a", "aa", "aaa", etc.
`?`	Zero or one of previous character	When something is optional	`colou?r` matches "color" or "colour"
`{n}`	Exactly n occurrences	When you need an exact number of repetitions	`a{3}` matches exactly "aaa"
`{n,m}`	Between n and m occurrences	When you need a range of repetitions	`a{2,4}` matches "aa", "aaa", or "aaaa"
`\|`	Alternation (OR)	When matching any of several patterns	`cat\|dog` matches "cat" or "dog"
`(...)`	Groups patterns together	When applying operators to multiple characters	`(ab)+` matches "ab", "abab", "ababab", etc.
`(?:...)`	Non-capturing group	When you need grouping without capturing	`(?:ab)+c` matches "abc", "ababc", etc.

ERE Examples

# Find either "error" or "warning"
grep -E 'error|warning' application.log
# Matches lines containing either word

# Find words that start with 'p' and end with 'ing'
grep -E '\bp\w+ing\b' document.txt
# Matches "playing", "programming", "presenting", etc.

# Find valid IP addresses
grep -E '\b([0-9]{1,3}\.){3}[0-9]{1,3}\b' network.log
# Matches patterns like 192.168.1.1

# Find HTML tags
grep -E '<[^>]+>' webpage.html
# Matches , , etc.

# Find email addresses
grep -E '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b' contacts.txt
# Matches most standard email formats

Character Classes

Character classes are like shortcuts for common groups of characters. They make your patterns more readable and save you from typing long lists of characters. In Linux, character classes are written inside brackets with a special syntax.

Character Class	What It Matches	When to Use	Equivalent To
`[[:alpha:]]`	Any letter	When matching alphabetic characters	`[A-Za-z]`
`[[:digit:]]`	Any digit	When matching numbers	`[0-9]`
`[[:alnum:]]`	Any letter or digit	When matching alphanumeric characters	`[A-Za-z0-9]`
`[[:space:]]`	Any whitespace	When matching spaces, tabs, newlines	`[ \t\r\n\v\f]`
`[[:blank:]]`	Spaces and tabs only	When matching horizontal whitespace	`[ \t]`
`[[:upper:]]`	Uppercase letters	When matching capital letters	`[A-Z]`
`[[:lower:]]`	Lowercase letters	When matching small letters	`[a-z]`
`[[:punct:]]`	Punctuation characters	When matching symbols and punctuation	[!"#$%&'()*+,-./:;<=>?@[\]^_`{\|}~]
`[[:print:]]`	Printable characters	When matching visible characters	Letters, digits, spaces, punctuation
`[[:cntrl:]]`	Control characters	When matching non-printable control characters	ASCII 0-31 and 127

Character Class Examples

# Find lines that start with a digit
grep '^[[:digit:]]' data.txt
# Matches lines starting with 0-9

# Find words that contain only letters
grep -E '\b[[:alpha:]]+\b' document.txt
# Matches words with no digits or symbols

# Find lines with punctuation
grep '[[:punct:]]' document.txt
# Matches lines containing any punctuation mark

# Find words starting with uppercase
grep -E '\b[[:upper:]][[:alpha:]]*\b' document.txt
# Matches words starting with capital letters

# Find lines with whitespace at the end
grep '[[:space:]]$' code.txt
# Helps find trailing whitespace in code

Using Regular Expressions with find

The find command can use regular expressions to search for files with names matching specific patterns. This is especially useful when looking for files with complex naming conventions.

Using find with BRE

# Find all .txt files
find /path/to/search -regex '.*\.txt$'
# Matches file.txt, notes.txt, etc.

# Find files with names containing numbers
find /path/to/search -regex '.*[0-9].*'
# Matches file1.txt, report2.pdf, etc.

# Find files with exactly 3-character extensions
find /path/to/search -regex '.*\.[a-zA-Z]\{3\}$'
# Matches file.txt, image.jpg, script.php, etc.

Combining find with grep

# Find .txt or .log files using ERE
find /path/to/search -type f | grep -E '\.(txt|log)$'
# Lists files ending in .txt or .log

# Find files containing "backup" followed by a date (YYYYMMDD)
find /path/to/search -type f | grep -E 'backup_[0-9]{8}'
# Matches backup_20220315, backup_20231127, etc.

# Find files not in common image formats
find /path/to/search -type f | grep -vE '\.(jpg|png|gif|bmp)$'
# Lists files that don't end with common image extensions

Real-World Use Cases

Log Analysis

# Find all error messages with timestamps
grep -E '^([0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}).*ERROR' application.log
# Matches log lines with timestamps followed by ERROR

# Count errors by type
grep 'ERROR' application.log | grep -Eo 'ERROR: [A-Za-z]+' | sort | uniq -c
# Groups and counts different types of errors

# Extract all IP addresses from a log file
grep -Eo '\b([0-9]{1,3}\.){3}[0-9]{1,3}\b' access.log | sort | uniq
# Finds all unique IP addresses

Code Search

# Find all function definitions in Python files
grep -r -E '^def [a-zA-Z_][a-zA-Z0-9_]*\(' --include="*.py" ./src
# Locates all Python function definitions

# Find TODO comments in code
grep -r -E '//\s*TODO:' --include="*.js" ./src
# Finds JavaScript TODO comments

Tips for Success

Start simple and build up complex patterns incrementally
Test your patterns on a small sample of text before using them on large files
Use grep -E when possible to avoid having to escape special characters
Remember that * matches zero or more, while + matches one or more
Use character classes like [[:digit:]] for better readability
Anchor patterns with ^ and $ when you want to match entire lines
Use \b to match word boundaries in extended regex
Use tools like grep -o to see just the matching text, not the whole line
Combine regex with other tools like sort, uniq, and awk for powerful text processing

Common Mistakes to Avoid

Forgetting that . matches any character (use \. to match a literal period)
Using * alone, which matches nothing (it means "zero or more of the previous character")
Not escaping special characters in basic regex (+, ?, {}, ())
Forgetting to use -E with grep when using extended regex features
Creating overly complex patterns that are hard to debug
Not accounting for possible variations in input (spaces, capitalization, etc.)
Using regex when a simpler tool would work (like plain string matching)
Not considering the context around matches (like word boundaries)

Best Practices

Comment complex regex patterns to explain what they do
Break complex patterns into smaller, more manageable pieces
Use character classes and quantifiers to make patterns more readable
Test regex patterns on both matching and non-matching input
Consider case sensitivity requirements (-i vs. explicit character ranges)
Use non-capturing groups (?:...) when you don't need to reference the match
Save useful regex patterns in your notes or as shell aliases
When using regex in scripts, validate input to avoid regex injection attacks