Caltech Library logo

How to find duplicates in a column

Searching for duplicate values in a column can be done using cat, csvcols, sort and csvfind. Here’s the basic algorithm from the command line or Bash script.

Here’s an example Bash script looking for duplicates in dups.csv in column 2, second column (columns are counted from 1 rather than zero)

    CSV_FILE="dups.csv"
    CSV_COL_NO="2"

    csvcols -i "$CSV_FILE" -col "$CSV_COL_NO" | sort -u | while read CELL; do
        if [ "$CELL" != "" ]; then
            csvfind -i "$CSV_FILE" -trim-spaces -col "$CSV_COL_NO"  "${CELL}"
        fi
    done

This would result a new CSV file with duplicates grouped together.