Using Python to Cleanup Excel


Often you get a spreadsheet with some data that needs heavily filtered. You could try Applescript but let’s be honest. For text processing Python is so much easier. So here’s a Python script I use a lot with Excel. It’s pretty flexible following my scripting rules. It’s designed to be more a template that I then modify when the need arises.

First the code and then the comments.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
#!/usr/bin/python
 
## cleanexcel.py
## A quick and dirty template to modify for doing filtering of 
## Excel data.  This could esaily be modified for iWork.
 
import string, sys, time, shutil
from appscript import *
from osax import OSAX
 
tr = {  "FEDEX":"FedEx", \
        "USPS":"USPS", \
        "LOWE'S":"Lowe's", \
        "COSTCO":"Costco", \
    }
 
 
def leftword(data):
    # return the leftmost word and the location of the
    # word boundary
 
    endword = data.find(" ")
    return data[0:endword], endword
 
def rightword(data):
    # return the rightmost word and the location of the
    # word boundary
 
    startword = data.rfind(" ")
    return data[startword+1:], startword
 
def cleandata(data):
    d = data
    firstword, loc = leftword(d)
 
    if firstword[0] in string.digits:
        d = d[loc+1:]
 
        firstword, loc = leftword(d)
 
        if firstword[0] in string.digits:
            d = d[loc+1:]
 
    lastword, loc = rightword(d)
 
    if lastword[0] in string.digits: 
        d = d[:loc]
 
    return d
 
def filtercell(data):
    # iterate through the dictionary of conversions
    global tr
    for word, rep in tr.iteritems():
        if word in data:
            return rep
 
    # clear numbers on left & right
    d = cleandata( data )
    if d[0:5] == "P.O.S":
        d = d[16:]
 
    return d.title()
 
def handle_spreadsheet():
    MS = app('/Applications/Microsoft Office 2008/Microsoft Excel')
    col = "C"
    ncol = "D"
 
    for row in range(2,3283):
        # get the  field and return filtered data
        data = MS.cells[ col + str(row) ].value.get()
        fd = filtercell( data )
 
        # replace the data in excel
        MS.cells[ ncol + str(row) ].value.set(fd)
        #MS.cells[ ncol + str(row) ].value.set(data)
 
if __name__ == '__main__':
    if len(sys.argv) == 1:
        handle_spreadsheet()
 
 
 
    # change to 0 for success, 1 for (partial) failure
    sys.exit(0)

I run this from within TextMate so I can quickly modify it. The biggest suggestion, which may or may not work for your data, is to save the filtered text into a column beside the one you are filtering. That way you can do a sanity check by quickly scanning down and comparing the two. The column you are filtering is col while the column for the filtered text is ncol.

I found that for large selections that the Applescript for getting the selection bounds with the current version of Excel is a bit buggy. (It basically hangs – maybe it’s just taking a long time but I didn’t want to wait) So I explicitly set the rows to work on with the range() function.

I use Appscript in Python instead of Applescript the language. That part of the code is simple and hopefully self evident.

The key part of the code is the filtercell() function. There I do a variety of filtering. In my case it was simplifying some names by seeing if a particular name occurred in the text. To make the script easy to modify I have a global dictionary whose key is the word to find and whose value is what to change the data to. So, for example, if I find the word “FEDEX” in the cell then I will replace it with “FedEx”.

I then had a more robust function called cleandata() which just removed some numbers from the beginning and end of my strings. (This was some bank data I was filtering) I left that in just so you could see how I did it. But realistically you can just ignore it.

Yeah this is all pretty simple. But I suspect there are some folks out there who’ve not thought about doing this. I can’t imagine having to write this in Applescript proper. (I get nauseous just thinking about it)

Related posts:

  1. Excel and Python
  2. Sending Data to Windows
  3. Fixing Paragraphs
  4. Footnotes and MarsEdit
  5. Address Book Selection as Text on Clipboard
  6. Clean Up Text in Mars Edit
  7. Python & Appscript Tools
  8. SalesForce to FedEx
  1. #1 by Alex on 2010/01/21 - 5:50 pm

    Don’t forget there’s also the PyXL library for directly modifying Excel files using Python, so you don’t have to have the file open in Excel.

  2. #2 by clark on 2010/01/21 - 6:30 pm

    Yeah. I find I like using Applescript over modifying the file for various reasons. (For one I can modify it in place) If you were doing something quite regularly there would be advantages to using PyXL as I’m sure it’s faster than Applescripts. Excel in particular doesn’t handle some Applescripts in a terribly efficient fashion.

    The other advantage to using Applescript is that it’s fairly trivial to modify to work with say Numbers.

(will not be published)