Combining text files

In many real world situations, data is not only dirty, it's not even put together in a single file.
    Keywords

The use case

Using the relatively clean and straightforward baby names data from the Social Security Administration, this example shows several variations of how to combine text files in Bash and in Python, including how to add data on a per-file, per-line basis. This is a good time to not only get acquainted with the conveniences of Bash programming, but to practice using higher-level CSV libraries (even though they are overkill in this situation).

The routine

  1. Open a blank, new text file.

  2. Open an existing text file.

  3. Copy existing text file.

  4. Paste existing text file into new text file.

  5. Open another existing text file.

  6. Repeat.

Just cat

The Unix operating system has long had the simple cat program, which has no other ambition than to take a list of files given to it and then to combine, or concatenate their streams:

cat file1.txt file2.txt > all.txt 

This is actually good enough for many situations, and even the most skilled practitioner will first think of cat as a first resort.

Baby names by state

The U.S. Social Security Administration releases bulk data on baby names in two different zip files: by state and by year (nationwide). Each zip file contains dozens of text files totaling in the tens of megabytes.

The states file is pretty straightforward. The list of text files, one for each state, looks like this:

AK.TXT    HI.TXT    MI.TXT    NV.TXT    TX.TXT
AL.TXT    IA.TXT    MN.TXT    NY.TXT    UT.TXT
AR.TXT    ID.TXT    MO.TXT    OH.TXT    VA.TXT
AZ.TXT    IL.TXT    MS.TXT    OK.TXT    VT.TXT
CA.TXT    IN.TXT    MT.TXT    OR.TXT    WA.TXT
CO.TXT    KS.TXT    NC.TXT    PA.TXT    WI.TXT
CT.TXT    KY.TXT    ND.TXT    RI.TXT    WS_FTP.LOG
DC.TXT    LA.TXT    NE.TXT    SC.TXT    WV.TXT
DE.TXT    MA.TXT    NH.TXT    SD.TXT    WY.TXT
FL.TXT    MD.TXT    NJ.TXT    StateReadMe.pdf
GA.TXT    ME.TXT    NM.TXT    TN.TXT

Each text file is headerless. Each row has this comma-separated value format:

state abbreviation, sex, year, name, count

e.g.

AK,F,1910,Mary,14
AK,F,1910,Annie,12
AK,F,1910,Anna,10
AK,F,1910,Margaret,8
AK,F,1910,Helen,7

Bash cat

Because the files are all the same format, we just have to paste them one-after-another into a combined file, not having to worry about deleting the redundant headers.

Assuming you’re currently in the directory containing the unzipped text files, here’s how to combine all the state files (i.e. files that end with .TXT) into a file named allstates.csv

cat *.TXT > allstates.csv

Adding a header line

That was just too easy. Of course, we want column headers in the combined allstates.csv, so that other programs (such as Excel, or a SQL database) or, generally, anyone who wants to use allstates.csv, knows what the columns are.

We can do this manually with echo:

echo "state,sex,year,name,count" > allstates.csv
cat *.TXT >> allstates.csv

Bash for-loop

Using cat seems simple enough. Let’s practice the for construct:

echo "state,sex,year,name,count" > allstates.csv
for fname in *.TXT; do
  echo Adding $fname
  cat $fname >> allstates.csv
done

Note: It’s not an issue here – but because of the naive way that bash handles spaces-within-filenames, the above for-loop can be quite erratic, if not dangerous. However, I’ll leave the lengthy discussion on the ins-and-outs of that for another occasion; in this (very) controlled scenario, you shouldn’t run into any filename/operating-system-specific issues.

Python for-loop

Hopefully you can see the elegance of the bash solution. But pretend you’re not on a Unix-like system, or that you want to practice Python:

from glob import glob
output = open("allstates.csv", "w")
output.write("state,sex,year,name,count\n")
for fname in glob("*.TXT"):
    print("Adding", fname)
    f = open(fname)
    output.write(f.read())
    f.close()
output.close()

Practicing the with statement

This is a good time to get into the habit of using Python’s with statement, which isn’t important in this situation, but standard practice when working with bigger datasets and real-world operating scenarios:

from glob import glob
with open("allstates.csv", "w") as output:
    output.write("state,sex,year,name,count\n")
    for fname in glob("*.TXT"):
        print("Adding", fname)
        with open(fname) as f:
           output.write(f.read())

Baby names by year

How the SSA packages the nationwide by year baby names text files is more complicated.

Each year has its own headerless file:

NationalReadMe.pdf  yob1924.txt   yob1969.txt
yob1880.txt   yob1925.txt   yob1970.txt
yob1881.txt   yob1926.txt   yob1971.txt
yob1882.txt   yob1927.txt   yob1972.txt
yob1883.txt   yob1928.txt   yob1973.txt

However, unlike the state text files, the year text files _do not include the year on each row. For example, here are the first five lines from yob1984.txt:

  Jennifer,F,50556
  Jessica,F,45849
  Ashley,F,38759
  Amanda,F,33903
  Sarah,F,25873

But how do you know that they originated from the 1984 file, without having been told beforehand? This means that simply cat-ing all the files together would make it impossible to distinguish which row comes from which year.

Grep and iterate line-by-line, per-file, in Bash

Luckily, we practiced using the for-loop in both Bash and Python, which gives us the flexibility to alter the files on a per-file, per-line basis. With bash and the use of grep to capture the year per filename:

echo "year,name,sex,count" > allyears.csv
for fname in *.txt; do
  year=$(echo $fname | grep -oE [0-9]{4})
  echo Year $year
  cat $fname | while read -r line; do
    echo "$year,$line" >> allyears.csv
  done
done

Python for-loops and regex module

Same concept here. If you already have some basic familiarity with regular expressions, try practicing a positive-lookbehind and Python’s os.path module for more exactness (even though it’s not needed here):

from os.path import basename
from glob import glob
import re
with open("allyears.csv", "w") as output:
    output.write("year,name,sex,count\n")
    for fname in glob("*.txt"):
        year = re.search(r'(?<=^yob)\d{4}', basename(fname)).group()
        print("Adding", year)
        with open(fname, 'r') as f:
           # readlines is OK here, bad in practice for massive files
            for line in f.readlines():
                output.write("%s,%s" % (year, line)) 

Python csv module

Again, the SSA babynames scenario is simplified, including in how we can safely expect that each line is a “simple” CSV rather than a complicated one, in which commas-within-fields and special characters have to be dealt with. In such real-world scenarios, the following is a very bad way to add a comma-delimited field:

# ...    
      output.write("%s,%s" % (year, line))

So it’s worth practicing Python’s csv module; to be honest, it’s taken me a lot of time to figure out the conventions.

csv.Reader and csv.Writer

from os.path import basename
from glob import glob
import re
import csv
with open("allyears.csv", "w") as output:
    cw = csv.writer(output)
    # notice different data layout, with year at the end of each row
    cw.writerow(['name', 'sex', 'count', 'year'])
    for fname in glob("*.txt"):
        year = re.search(r'(?<=^yob)\d{4}', basename(fname)).group()
        print("Adding", year)
        with open(fname, 'r') as f:
            for row in csv.reader(f):
                row.append(year)
                cw.writerow(row)

csv.DictReader and csv.DictWriter

If you prefer using dictionaries to track the structure of your data:

from os.path import basename
from glob import glob
import re
from csv import DictReader, DictWriter
with open("allyears.csv", "w") as output:
    cw = DictWriter(output, fieldnames = ['name', 'sex', 'count', 'year'])
    # DictWriter will lay out columns in an unspecified order
    cw.writeheader()
    for fname in glob("*.txt"):
        year = re.search(r'(?<=^yob)\d{4}', basename(fname)).group()
        print("Adding", year)
        with open(fname, 'r') as f:
            for row in DictReader(f, fieldnames = ["name", "sex", "count"]):
                row['year'] = year
                cw.writerow(row)