sort, uniq, wc, cut: The Small Text Utilities

Once you know grep, sed, and awk, the next layer of shell text-processing is four small utilities that you compose with everything else: sort, uniq, wc, and cut. They are simple, predictable, and indispensable.

sort

sort file.txt              # alphabetical
sort -r file.txt           # reverse
sort -n numbers.txt        # numeric (10 comes after 9, not before 2)
sort -h sizes.txt          # human-readable sizes (10K, 5M, 1G)
sort -u file.txt           # sort AND remove duplicates
sort -k 2 file.txt         # sort by second column
sort -k 2,2n file.txt      # sort by 2nd column, numerically
sort -t',' -k 3 data.csv   # for CSV: -t sets the separator
sort -R file.txt           # random shuffle

Real examples

# Sort directories by size, biggest first
du -sh */ | sort -hr

# Top 10 largest files in current tree
find . -type f -exec du -h {} + | sort -hr | head -10

# Sort by file extension
ls | sort -t. -k2

uniq

uniq only removes ADJACENT duplicates. Almost always pair with sort first.

sort file | uniq                   # remove duplicates
sort file | uniq -c                # count occurrences
sort file | uniq -c | sort -rn     # sort by count, biggest first
sort file | uniq -d                # show only duplicated lines
sort file | uniq -u                # show only unique (non-duplicated) lines

The all-time most useful pattern

sort | uniq -c | sort -rn | head

That four-step pipeline gives you “top N by frequency” — top IPs hitting your server, top words in a file, top error codes in logs, top anything.

# Top 5 IPs in nginx log
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -5

# Top extensions in a directory tree
find . -type f | sed 's/.*.//' | sort | uniq -c | sort -rn | head

wc (word count)

wc file.txt          # lines, words, characters
wc -l file.txt       # just line count
wc -w file.txt       # word count
wc -c file.txt       # byte count
wc -m file.txt       # character count (matters for multi-byte UTF-8)
wc -l *.py           # lines per file plus total

ls | wc -l           # how many files in current dir
ps aux | wc -l       # how many processes (minus 1 for header)

cut

cut slices columns out of structured text. Faster and simpler than awk for column extraction (though less flexible).

# By character position
cut -c 1-10 file.txt          # first 10 chars per line
cut -c 5- file.txt            # from char 5 to end

# By field (default delimiter is TAB)
cut -f 1,3 data.tsv           # 1st and 3rd field

# Custom delimiter
cut -d',' -f 2 data.csv       # 2nd field of CSV
cut -d':' -f 1 /etc/passwd    # usernames
cut -d':' -f 1,7 /etc/passwd  # username + shell

Other useful small utilities

tr — translate characters

echo "hello" | tr 'a-z' 'A-Z'         # uppercase
echo "hello" | tr -d 'l'              # delete characters
echo "a,b,c" | tr ',' 'n'            # CSV to lines
cat file | tr -s ' '                   # squeeze repeated spaces

head and tail

head file.txt                  # first 10 lines
head -20 file.txt              # first 20
head -n -5 file.txt            # everything EXCEPT last 5
tail file.txt                  # last 10 lines
tail -20 file.txt              # last 20
tail -f /var/log/syslog        # follow as new lines arrive
tail -F /var/log/app.log       # follow + handle log rotation

tee — write to file AND screen

ls | tee out.txt | wc -l           # save AND continue piping
echo "line" | sudo tee -a /etc/hosts   # append with sudo

Real one-liner challenges

# Most common word in a text file (length > 3)
cat file.txt | tr -s ' nt' 'n' | tr -d '[:punct:]' | 
  awk 'length>3' | sort | uniq -c | sort -rn | head -1

# Count unique users currently logged in
who | cut -d' ' -f1 | sort -u | wc -l

# How much disk does each user use in /home
du -sh /home/* | sort -hr

# Find all running Python processes and their memory
ps aux | grep python | awk '{print $4, $11}' | sort -rn | head

Common mistakes

  • uniq without sort first — only removes adjacent duplicates.
  • cut with default tab delimiter on space-separated data — use -d' ' or use awk instead.
  • wc -l off-by-one — files without trailing newline can show one less line than you expect.

What to learn next

That covers the text processing pipeline. Next big section: managing processes — finding what’s running, sending it signals, killing the runaway ones.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *