Shotts ch. 19

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

In the following example, we will use find with a regular expression to find every pathname that contains any character that is not a member of the following set: [-_./0-9a-zA-Z] Such a scan would reveal pathnames that contain embedded spaces and other potentially offensive characters. Because of the requirement for an exact match of the entire pathname, we use .* at both ends of the expression to match zero or more instances of any character.

"[me@linuxbox ~]$ find . -regex '.*(ast)[^-_./0-9a-zA-Z].*(ast)'".

The + metacharacter works much like the *, except it requires at least one instance of the preceding element to cause a match. Here is a regular expression that will match only the lines consisting of groups of one or more alphabetic characters separated by single spaces: (enclosed in ""s)

"^([[:alpha:]]+ ?)+$". Let's try it. [me@linuxbox ~]$ echo "This that" | grep -E '^([[:alpha:]]+ ?)+$' This that [me@linuxbox ~]$ echo "a b c" | grep -E '^([[:alpha:]]+ ?)+$' a b c [me@linuxbox ~]$ echo "a b 9" | grep -E '^([[:alpha:]]+ ?)+$' [me@linuxbox ~]$ echo "abc d" | grep -E '^([[:alpha:]]+ ?)+$' [me@linuxbox ~]$ We see that this expression does not match the line a b 9 because it contains a nonalphabetic character; nor does it match abc d because more than one space character separates the characters c and d.

Grep command to find lines containing either string "bzip" or string "gzip"

"grep -h '[bg]zip' dirlist*.txt"

If we wanted to construct a regular expression that would find every file in our lists beginning with an uppercase letter, we could do this: (2 ways)

"grep -h '^[ABCDEFGHIJKLMNOPQRSTUVWXZY]' dirlist*.txt" "grep -h '^[A-Z]' dirlist*.txt"

?: This quantifier means, in effect,

"make the preceding element optional."

Just when we thought this couldn't get any more confusing, we discover that POSIX also splits regular expression implementations into two kinds: basic regular expressions (BRE) and extended regular expressions (ERE). The features we have covered so far are supported by any application that is POSIX compliant and implements BRE. Our grep program is one such program. What's the difference between BRE and ERE? It's a matter of metacharacters. With BRE, the following metacharacters are recognized: ^ $ . [ ] * All other characters are considered literals. With ERE, the following metacharacters (and their associated functions) are added: (7)

( ) { } ? + | However (and this is the fun part), the (, ), {, and } characters are treated as metacharacters in BRE if they are escaped with a backslash, whereas with ERE, preceding any metacharacter with a backslash causes it to be treated as a literal.

What this expression means: '^\([0-9]{3}\) [0-9]{3}-[0-9]{4}$'

( followed by any number from 0-9 three times followed by ) followed by any number 3 times followed by - followed by any number 4 times. The expression itself includes the anchor metacharacters at each end to ensure that the number has no extra characters at either end.

Just when we thought this couldn't get any more confusing, we discover that POSIX also splits regular expression implementations into two kinds: basic regular expressions (BRE) and extended regular expressions (ERE). The features we have covered so far are supported by any application that is POSIX compliant and implements BRE. Our grep program is one such program. What's the difference between BRE and ERE? It's a matter of metacharacters. With BRE, the following metacharacters are recognized: ^ $ . [ ] * All other characters are considered literals. With ERE, the following metacharacters (and their associated functions) are added: ( ) { } ? + | However (and this is the fun part), the (4)___ characters are treated as metacharacters in BRE if they are escaped with a backslash, whereas with ERE, preceding any metacharacter with a backslash causes it to be treated as a literal. Any weirdness that comes along will be covered in the discussions that follow.

(, ), {, and }

grep Option: Long option of option: -h

---no-filename

grep Option: Long option of option: -c

--count

grep Option: Long option of option: -l

--file-with-matches

grep Option: Long option of option: -L

--files-without-match

grep Option: Long option of option: -i

--ignore-case

grep Option: Long option of option: -v

--invert-match

grep Option: Long option of option: -n

--line-number

The locate program supports both basic (the __ option) and extended (the __ option) regular expressions.

--regexp; --regex

Because the features we are going to discuss next are part of ERE, we are going to need to use a different grep. Traditionally, this has been performed by the egrep program, but the GNU version of grep also supports extended regular expressions when the __ option is used.

-E

grep Options: Option with description: (Like the -l option, but print only the names of files that do not contain matches.)

-L

grep Options: Option with description: (Print the number of matches (or non-matches if the -v option is also specified) instead of the lines themselves.)

-c

grep Options: Option with description: (Ignore case. Do not distinguish between uppercase and lowercase characters.)

-i

grep Options: Option with description: (Print the name of each file that contains a match instead of the lines themselves.)

-l

grep Options: Option with description: (Prefix each matching line with the number of the line within the file.)

-n

grep Options: Option with description: (Invert match. Normally, grep prints lines that contain a match. This option causes grep to print every line that does not contain a match.)

-v

less and vim both share the same method of searching for text. Pressing the / key followed by a regular expression will perform a search. If we use less to view our phonelist.txt file, like so: [me@linuxbox ~]$ less phonelist.txt and then search for our validation expression, like this: /^\([0-9]{3}\) [0-9]{3}-[0-9]{4}$. vim, on the other hand, supports basic regular expressions, so our search expression would look like this:

/([0-9]\{3\}) [0-9]\{3\}-[0-9]\{4\} We can see that the expression is mostly the same; however, many of the characters that are considered metacharacters in extended expressions are considered literals in basic expressions. They are treated only as metacharacters when escaped with a backslash. Depending on the particular configuration of vim on our system, the matching will be highlighted.

Command to enter within less to find and highlight entries in this format: (###) ###-###:

/^\([0-9]{3}\) [0-9]{3}-[0-9]{4}$

In character ranges, we see that the dash character is treated specially, so how do we actually include a dash character in a bracket expression?

By making it the first character in the expression.

Description of option: -h

For multifile searches, suppress the output of filenames.

"[me@linuxbox ~]$ grep bzip dirlist*.txt" Result: dirlist-bin.txt:bzip2 dirlist-bin.txt:bzip2recover. What we see is:

In this example, grep searches all the listed files for the string bzip and finds two matches, both in the file dirlist-bin.txt.

This quantifier means, in effect, "make the preceding element optional." Let's say we wanted to check a phone number for validity and we considered a phone number to be valid if it matched either of these two forms, where n is a numeral: -(nnn) nnn-nnnn -nnn nnn-nnnn We could construct a regular expression like this: " ^\(?[0-9][0-9][0-9]\)? [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$". Explain:

In this expression, we follow the parentheses characters with question marks to indicate that they are to be matched zero or one time. Again, because the parentheses are normally metacharacters (in ERE), we precede them with backslashes to cause them to be treated as literals instead.

You can opt to have your system use the traditional (ASCII) collation order by changing the value of the __ environment variable.

LANG

grep Options: Long option and Description of option: -c

Long option: --count. Description: Print the number of matches (or non-matches if the -v option is also specified) instead of the lines themselves.

grep Options: Long option and Description of option: -l

Long option: --files-with-matches. Description: Print the name of each file that contains a match instead of the lines themselves.

grep Options: Long option and Description of option: -L

Long option: --files-without-match. Description: Like the -l option, but print only the names of files that do not contain matches.

grep Options: Long option and Description of option: -i

Long option: --ignore-case. Description: Ignore case. Do not distinguish between uppercase and lowercase characters.

grep Options: Long option and Description of option: -v

Long option: --invert-match. Description: Invert match. Normally, grep prints lines that contain a match. This option causes grep to print every line that does not contain a match.

grep Options: Long option and Description of option: -n

Long option: --line-number. Description: Prefix each matching line with the number of the line within the file.

Meaning of specifier: {n,m}

Match the preceding element if it occurs at least n times but no more than m times.

Meaning of specifier: {n}

Match the preceding element if it occurs exactly n times.

Meaning of specifier: {n,}

Match the preceding element if it occurs n or more times.

Meaning of specifier: {,m}

Match the preceding element if it occurs no more than m times.

What Are Regular Expressions? Simply put, regular expressions are symbolic notations used to identify patterns in text. In some ways, they resemble the shell's wildcard method of matching file and pathnames but on a much grander scale. Regular expressions are supported by many command line tools and by most programming languages to facilitate the solution of text manipulation problems. However, to further confuse things, not all regular expressions are the same; they vary slightly from tool to tool and from programming language to language. For our discussion, we will limit ourselves to regular expressions as described in the __ standard (which will cover most of the command line tools), as opposed to many programming languages (most notably Perl), which use slightly larger and richer sets of notations.

POSIX

less and vim both share the same method of searching for text:

Pressing the / key followed by a regular expression will perform a search.

As the popularity of Unix spread beyond the United States, there grew a need to support characters not found in US English. The ASCII table was expanded to use a full eight bits, adding characters 128-255, which accommodated many more languages. To support this capability, the POSIX standards introduced a concept called a locale, which could be adjusted to select the character set needed for a particular location. We can see the language setting of our system using the following command:

"[me@linuxbox ~]$ echo $LANG" Result: en_US.UTF-8

A set may contain any number of characters, and metacharacters lose their special meaning when placed within brackets. However, there are two cases in which metacharacters are used within bracket expressions and have different meanings: (and identify what they do)

The first is the caret (^), which is used to indicate negation; the second is the dash (-), which is used to indicate a character range.

The find command supports a test based on a regular expression. There is an important consideration to keep in mind when using regular expressions in find versus grep.:

Whereas grep will print a line when the line contains a string that matches an expression, find requires that the pathname exactly match the regular expression.

Just when we thought this couldn't get any more confusing, we discover that POSIX also splits regular expression implementations into two kinds: basic regular expressions (BRE) and extended regular expressions (ERE). The features we have covered so far are supported by any application that is POSIX compliant and implements BRE. Our grep program is one such program. What's the difference between BRE and ERE? It's a matter of metacharacters. With BRE, the following metacharacters are recognized: (6)

^ $ . [ ] * All other characters are considered literals. With ERE, the following metacharacters (and their associated functions) are added: ( ) { } ? + |

This quantifier means, in effect, "make the preceding element optional." Let's say we wanted to check a phone number for validity and we considered a phone number to be valid if it matched either of these two forms, where n is a numeral: -(nnn) nnn-nnnn -nnn nnn-nnnn We could construct a regular expression like this:

^\(?[0-9][0-9][0-9]\)? [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$

What this matches: grep -h '[^bg]zip': we get a list of files that contain the string zip preceded by any character except b or g. Notice that the file zip was not found. A negated character set still requires

a character at the given position, but the character must not be a member of the negated set.

The caret (^) and dollar sign ($) are treated as __ in regular expressions. This means they cause the match to occur only if the regular expression is found at the beginning of the line (^) or at the end of the line ($).

anchors

BRE stands for

basic regular expressions (BRE)

Just when we thought this couldn't get any more confusing, we discover that POSIX also splits regular expression implementations into two kinds:

basic regular expressions (BRE) and extended regular expressions (ERE). The features we have covered so far are supported by any application that is POSIX compliant and implements BRE. Our grep program is one such program. What's the difference between BRE and ERE? It's a matter of metacharacters. With BRE, the following metacharacters are recognized: ^ $ . [ ] * All other characters are considered literals. With ERE, the following metacharacters (and their associated functions) are added: ( ) { } ? + |

Just when we thought this couldn't get any more confusing, we discover that POSIX also splits regular expression implementations into two kinds: basic regular expressions (BRE) and extended regular expressions (ERE). The features we have covered so far are supported by any application that is POSIX compliant and implements BRE. Our grep program is one such program. What's the difference between BRE and ERE? It's a matter of metacharacters. With BRE, the following metacharacters are recognized: ^ $ . [ ] * All other characters are considered literals. With ERE, the following metacharacters (and their associated functions) are added: ( ) { } ? + | However (and this is the fun part), the (, ), {, and } characters are treated as metacharacters in BRE if they are escaped with a backslash, whereas with ERE, preceding any metacharacter with a backslash

causes it to be treated as a literal.

Because the features we are going to discuss next are part of ERE, we are going to need to use a different grep. Traditionally, this has been performed by the __ program, but the GNU version of grep also supports extended regular expressions when the -E option is used.

egrep

The { and } metacharacters are used to

express minimum and maximum numbers of required matches.

ERE stands for

extended regular expressions (ERE)

grep command to find any word in list of words whose third letter is j, last letter is r, and is 5 letters long:

grep '^..j.r$' <file>

regular expression grep command to match either string "AAA" or "BBB"

grep -E 'AAA|BBB'

regular expression grep command to match the filenames in dirlist*.txt that start with either bz, gz, or zip.

grep -E '^(bz|gz|zip)' dirlist*.txt

grep command to search for files with text "bzip" in the dirlist*.txt files and return only filenames that did not contain a match:

grep -L bzip dirlist*(ast).txt

Regular expression that matches all filenames starting with letters and numbers (simple way):

grep -h '^[A-Za-z0-9]' dirlist*.txt

grep command to search for files containing "zip" where zip occurs at both the beginning and the end of the line and suppress output of filenames

grep -h '^zip$' dirlist*(ast).txt

grep command to search for files containing "zip" at end of line and suppress output of filenames

grep -h '^zip' direlist*(ast).txt

grep command to search for files containing "zip" at end of line and suppress output of filenames

grep -h 'zip$' dirlist*(ast).txt

The grep program accepts options and arguments this way, where regex is a regular expression: (show syntax)

grep [<options>] <regex> [<file>...]

characters in brackets: The caret character invokes negation only if

it is the first character within a bracket expression; otherwise, it loses its special meaning and becomes an ordinary character in the set.

The + metacharacter works much like the *, except

it requires at least one instance of the preceding element to cause a match.

To see the locale settings, use the __ command.

locale

locate command to search for pathnames that contain either bin/bz, bin/gz, or /bin/zip:

locate --regex 'bin/(bz|gz|zip)'

The first metacharacter we will look at is the dot (.) or period character, which is used to

match any character.

While it might not seem apparent, our grep searches have been using regular expressions all along, albeit very simple ones. The regular expression bzip is taken to mean that a match will occur only if the line in the file contains at least four characters and that somewhere in the line the characters b, z, i, and p are found in that order, with no other characters in between. The characters in the string bzip are all literal characters, in that they match themselves. In addition to literals, regular expressions may also include __ that are used to specify more complex matches. Regular expression __ consist of the following: ^ $ . [ ] { } - ? * + ( ) | \

metacharacters

The main program we will use to work with regular expressions is our old pal grep. The name grep is actually derived from the phrase "global regular expression print," so we can see that grep has something to do with regular expressions. In essence, grep searches text files for text matching a specified

regular expression and outputs any line containing a match to standard output.

Like the ? metacharacter, the * is used to denote an optional item; however, unlike the ?,

the item may occur any number of times, not just once.

If the first character in a bracket expression is a caret (^),

the remaining characters are taken to be a set of characters that must not be present at the given character position.

Just when we thought this couldn't get any more confusing, we discover that POSIX also splits regular expression implementations into two kinds: basic regular expressions (BRE) and extended regular expressions (ERE). The features we have covered so far are supported by any application that is POSIX compliant and implements BRE. Our grep program is one such program. What's the difference between BRE and ERE? It's a matter of metacharacters. With BRE, the following metacharacters are recognized: ^ $ . [ ] * All other characters are considered literals. With ERE, the following metacharacters (and their associated functions) are added: ( ) { } ? + | However (and this is the fun part), the (, ), {, and } characters are treated as metacharacters in BRE if

they are escaped with a backslash, whereas with ERE, preceding any metacharacter with a backslash causes it to be treated as a literal.

NOTE: As we can see, many of the regular expression metacharacters are also characters that have meaning to the shell when expansion is performed. When we pass regular expressions containing metacharacters on the command line, it is vital that

they be enclosed in quotes to prevent the shell from attempting to expand them.

The caret (^) and dollar sign ($) are treated as anchors in regular expressions. This means that (what they do)

they cause the match to occur only if the regular expression is found at the beginning of the line (^) or at the end of the line ($).

In this example, using a two-character set, we match any line that contains the string bzip or gzip: [me@linuxbox ~]$ grep -h '[bg]zip' dirlist*.txt bzip2 bzip2recover gzip. What happens when metacharacters are placed within brackets?

they lose their special meaning

What this matches: grep -h '[^bg]zip'

we get a list of files that contain the string zip preceded by any character except b or g. Notice that the file zip was not found. A negated character set still requires a character at the given position, but the character must not be a member of the negated set.

What Are Regular Expressions?

Simply put, regular expressions are symbolic notations used to identify patterns in text. In some ways, they resemble the shell's wildcard method of matching file and pathnames but on a much grander scale. Regular expressions are supported by many command line tools and by most programming languages to facilitate the solution of text manipulation problems. However, to further confuse things, not all regular expressions are the same; they vary slightly from tool to tool and from programming language to language. For our discussion, we will limit ourselves to regular expressions as described in the POSIX standard (which will cover most of the command line tools), as opposed to many programming languages (most notably Perl), which use slightly larger and richer sets of notations.


Set pelajaran terkait

Chapter 19 Heart failure and dysrhythmias: Common sequelae of cardiac diseases

View Set