sort, uniq, wc, cut: The Small Text Utilities
Once you know grep, sed, and awk, the next layer of shell text-processing is four small utilities that you compose with everything else: sort, uniq, wc, and cut. They are simple, predictable, and indispensable.
sort
sort file.txt # alphabetical
sort -r file.txt # reverse
sort -n numbers.txt # numeric (10 comes after 9, not before 2)
sort -h sizes.txt # human-readable sizes (10K, 5M, 1G)
sort -u file.txt # sort AND remove duplicates
sort -k 2 file.txt # sort by second column
sort -k 2,2n file.txt # sort by 2nd column, numerically
sort -t',' -k 3 data.csv # for CSV: -t sets the separator
sort -R file.txt # random shuffle
Real examples
# Sort directories by size, biggest first
du -sh */ | sort -hr
# Top 10 largest files in current tree
find . -type f -exec du -h {} + | sort -hr | head -10
# Sort by file extension
ls | sort -t. -k2
uniq
uniq only removes ADJACENT duplicates. Almost always pair with sort first.
sort file | uniq # remove duplicates
sort file | uniq -c # count occurrences
sort file | uniq -c | sort -rn # sort by count, biggest first
sort file | uniq -d # show only duplicated lines
sort file | uniq -u # show only unique (non-duplicated) lines
The all-time most useful pattern
sort | uniq -c | sort -rn | head
That four-step pipeline gives you “top N by frequency” — top IPs hitting your server, top words in a file, top error codes in logs, top anything.
# Top 5 IPs in nginx log
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -5
# Top extensions in a directory tree
find . -type f | sed 's/.*.//' | sort | uniq -c | sort -rn | head
wc (word count)
wc file.txt # lines, words, characters
wc -l file.txt # just line count
wc -w file.txt # word count
wc -c file.txt # byte count
wc -m file.txt # character count (matters for multi-byte UTF-8)
wc -l *.py # lines per file plus total
ls | wc -l # how many files in current dir
ps aux | wc -l # how many processes (minus 1 for header)
cut
cut slices columns out of structured text. Faster and simpler than awk for column extraction (though less flexible).
# By character position
cut -c 1-10 file.txt # first 10 chars per line
cut -c 5- file.txt # from char 5 to end
# By field (default delimiter is TAB)
cut -f 1,3 data.tsv # 1st and 3rd field
# Custom delimiter
cut -d',' -f 2 data.csv # 2nd field of CSV
cut -d':' -f 1 /etc/passwd # usernames
cut -d':' -f 1,7 /etc/passwd # username + shell
Other useful small utilities
tr — translate characters
echo "hello" | tr 'a-z' 'A-Z' # uppercase
echo "hello" | tr -d 'l' # delete characters
echo "a,b,c" | tr ',' 'n' # CSV to lines
cat file | tr -s ' ' # squeeze repeated spaces
head and tail
head file.txt # first 10 lines
head -20 file.txt # first 20
head -n -5 file.txt # everything EXCEPT last 5
tail file.txt # last 10 lines
tail -20 file.txt # last 20
tail -f /var/log/syslog # follow as new lines arrive
tail -F /var/log/app.log # follow + handle log rotation
tee — write to file AND screen
ls | tee out.txt | wc -l # save AND continue piping
echo "line" | sudo tee -a /etc/hosts # append with sudo
Real one-liner challenges
# Most common word in a text file (length > 3)
cat file.txt | tr -s ' nt' 'n' | tr -d '[:punct:]' |
awk 'length>3' | sort | uniq -c | sort -rn | head -1
# Count unique users currently logged in
who | cut -d' ' -f1 | sort -u | wc -l
# How much disk does each user use in /home
du -sh /home/* | sort -hr
# Find all running Python processes and their memory
ps aux | grep python | awk '{print $4, $11}' | sort -rn | head
Common mistakes
uniqwithoutsortfirst — only removes adjacent duplicates.cutwith default tab delimiter on space-separated data — use-d' 'or useawkinstead.wc -loff-by-one — files without trailing newline can show one less line than you expect.
What to learn next
That covers the text processing pipeline. Next big section: managing processes — finding what’s running, sending it signals, killing the runaway ones.