10x faster method to remove dup lines

Write at 2017 Apr 27 in notes awk sort text processing duplicate

Whenever talking about removing duplicate line, I always use a combination of sort and uniq like this:

wc -l source.csv
10941092 source.csv

time sort -u source.csv | uniq > source_standarize.csv

However, today I learn a new trick which is 10x faster using awk:

time awk '!a[$0]++' source.csv


So 10x faster. Let’s see how this trick works

awk '1' file

1 evaluates to true, and will return everything.

awk '0' file

will return nothing.

awk goes over line by line, evaluate and the result, if it’s true, it print out. so look at !a[$0]++

  • a is just an varaible that we declate and use.
  • a[$0]: $0 is whole line, so we create an hash, with the line text as key
  • a[$0]++ will incrase the value to 1
  • First time a line appear, a[$0] return 0. Hence !0-> true.
  • Second time a line appear, a[$0] is already 2, and will increaseing. !2 -> false

Hence awk will not print out that. So basically this is an giant hash map

