Removing Duplicate Rows in Excel
Today a buddy working at ICBC called me. He had an Excel file with some duplicate rows that needed deleting. Though I’m not a big fan of OA work, I had to help a friend. Accepted!
He sent me a CSV file with 45,000+ records, looking roughly like this:
BILLNO BUNDLE ACCEPTBANK HOLDER BUYTYPE AMOUNT
AA0100003443 2.00E+11 87647 87649 2 10000000
AA0100003444 2.00E+11 87647 87649 2 10000000
Some records had duplicate BILLNOs — a few hundred supposedly. Only one record per BILLNO should remain; the rest need deleting.
Seeing how neatly formatted the records were, my first thought was a relational database (what can I say — J2EE, UML, VC, DB are our bread-and-butter courses). I was already mentally reciting SQL. “Fortunately” I still had Access 2003 installed on my machine — not great, but good enough. Opened Access, imported the CSV. Thanks to Access’s powerful features, I got table a without any issues. Next step was easy — I thought of creating a new table with BILLNO as primary key to guarantee uniqueness. But then I realized doing this in Access isn’t straightforward: Access creates its own Integer index, ignoring mine. Inserting data and handling failures would be麻烦 (troublesome). So I abandoned that idea. Then I thought: just use SQL to select from one table into another. Created an empty table b with the same structure as a, and wrote:
INSERT INTO b
SELECT * FROM a
WHERE a.BILLNO in
(
SELECT max(a.BILLNO) FROM a GROUP BY a.BILLNO
)
The result shocked me. My P3 1GHz laptop hit 100% CPU immediately, and after 5 minutes there was no sign of stopping. Thinking about it, Access isn’t good at data processing. Without optimization, this SQL is essentially a double loop. Even optimized, the outer SQL still has to search through tens of thousands of records. Inefficient. My database attempt failed.
Then I thought: Office has VBA! Why go far when the solution’s right there? Opened the Visual Basic editor, dragged a Button onto a form (I taught my dad how to drag buttons on forms over winter break), double-clicked, and started coding. Thanks again to Office’s powerful help system — I quickly found how to reference cell data. First, sorted Excel by BILLNO. Then wrote some Visual Basic code I hadn’t touched in ages:
Private Sub CommandButton1_Click()
Dim i
For i = 2 To 47522
If Sheet1.Cells(i, 1).Value = Sheet1.Cells(i + 1, 1).Value Then
Sheet1.Cells(i, 1).Value = "AA"
End If
Next i
End Sub
Pressed F5, executed — all duplicate BILLNOs were replaced with “AA”, keeping only one per group. Then sorted by BILLNO again, manually selected all rows with “AA”, deleted them. Done!
Like all programmers (and prostitutes), I was intensely focused during work. After finishing, I felt incomparable pleasure… After the pleasure, I reflected on the lessons:
- Clients’ requirements are endlessly varied, but we must remember: for programmers, “nothing is impossible if you can think of it.” 2.惯性思维 (Habitual thinking) can be harmful, especially for中年 (middle-aged — haha, meaning experienced) programmers who fall into it unconsciously.
- Though I haven’t coded in a while, I haven’t gotten old yet…