Powerful Linux Text Processing Commands
cat
The power of “cat” is not just output a file to screen but to concatenates a list of file content and stream through the pipe to another program as input.
cat * | sort
find
The power of find is to list out the matched filenames based on metadata of the files like type, size, create date…
grep
“grep” helps you to list out the file(s) with the content that match the pattern(s) in regular expression. You can use it as content search across the files in your file system.
grep -R –color -n -P abc *
option:
- –color (highlight matching part in content with color)
- -n (show line number)
- -P PATTERN (perl regular expression pattern)
- -R (recursively)
- -l (only list out the filenames that match the pattern)
cut
“cut” extracts sections from each line of input. (example of usage). Below the command will extract the 5th field from each line of file A using delimiter colon. You see - sign after 5 means output the 5th field and the rest of characters in the line. If you put the - sign before the number, you are cutting the from the beginning to the number.
cut -d : -f 5- fileA
option:
- -c (character)
- -b (byte)
- -f 5 (field if the line can be broken down by delimiter)
- -d | (delimiter is pipe character)
sort
The sort command sorts a file according to fields–the individual pieces of data on each line. By default, sort assumes that the fields are just words separated by blanks, but you can specify an alternative field delimiter if you want (such as commas or colons). Output from sort is printed to the screen, unless you redirect it to a file.
donor.data
Bay Ching 500000 China
Jack Arta 250000 Indonesia
Cruella Lumper 725000 MalaysiaLet’s take this sample donors file and sort it according to the donation amount. The following shows the command to sort the file on the second field (last name) and the output from the command:
sort +1 -2 donors.data
Jack Arta 250000 Indonesia
Bay Ching 500000 China
Cruella Lumper 725000 Malaysia
If the file is delimited by comma, you can use -t , to tell the sort the delimiter. You can use -u to output the uniqueness as well.
sort -t: +1 -2 company.data To sort the file on the third field (department name) and suppress the duplicates, use this command: sort -t: -u +2 company.data Note that the line for Ed Jucacion did not print, because he’s in Sales, and we asked the command (with the -u flag) to suppress lines that were the same in the sort field.
Nasium, Jim:031762:Marketing
Jucacion, Ed:396082:Sales
Itorre, Jan:406378:Sales
Ancholie, Mel:636496:Research
Nasium, Jim:031762:Marketing
Ancholie, Mel:636496:Research
Itorre, Jan:406378:Sales
option:
- -f Make all lines uppercase before sorting (so “Bill” and “bill” are treated the same).
- -r Sort in reverse order (so “Z” starts the list instead of “A”).
- -n Sort a column in numerical order
- -tx Use x as the field delimiter (replace x with a comma or other character).
- -u Suppress all but one line in each set of lines with equal sort fields (so if you sort on a field containing last names, only one “Smith” will appear even if there are several).
- Specify the sort keys like this: +m Start at the first character of the m+1th field. -n End at the last character of the nth field (if -N omitted, assume the end of the line)
uniq
uniq - line level uniqueness. It prints the unique lines in a sorted file, retaining only one of a run of matching lines. Optionally, it can show only lines that appear exactly once, or lines that appear more than once. uniq requires sorted input since it compares only consecutive lines.
option:
- -u (print the unqiue lines only - lines only appear once)
- -d (print the duplicate lines only - lines appear more than once)
- -c (prefix each line with occurrence)
bash$ cat testfile
This line occurs only once.This line occurs twice.This line occurs twice.This line occurs three times.This line occurs three times.This line occurs three times.
bash$ uniq -c testfile
1 This line occurs only once.2 This line occurs twice.3 This line occurs three times.
bash$ sort testfile | uniq -c | sort -nr 3 This line occurs three times.2 This line occurs twice.1 This line occurs only once.
wc
wc - word count. Apart from word count, it also does the following
- wc -w gives only the word count.
- wc -l gives only the line count.
- wc -c gives only the byte count.
- wc -m gives only the character count.
- wc -L gives only the length of the longest line.
tr
“tr” translate or delete characters. It is used for data cleaning job. Can we do pattern replacement?
tr ‘[:lower:]‘ ‘[:upper:]‘
The above command will convert all the lowest case to upper case.
tr ‘.’ ‘/’
The above will convert all the . character to /. And for translation, you cannot have -d option on. You may be asking when would we do that. Here is the common use case - convert window files to unix formatted file:
tr -d ‘\r’ < input_dos_file.txt > output_unix_file.txt
option:
- -s (squeeze the repeated characters into one character. eg. tr -s ‘\n’ )
- -d (delete characters eg. tr -d ‘\000′)
sed
“tr” can do character replacement. But if you want to do pattern replacement, you need to use sed. usage: sed -e s/pattern/replacement/flags
sed -e s/one/another
sed -e s/[aeiou]/_/g
Note the use of the g flag so that you apply the pattern/replacement to every match instead of just the first one.
awk
Put them all together
cat * |grep lucene-core|cut -f2 -d' '|uniq|tr '.' '/'| awk '{printf "%s.class\n", $1}'






































(3.67 out of 5)