Keywords
The use case
Using the relatively clean and straightforward baby names data from the Social Security Administration, this example shows several variations of how to combine text files in Bash and in Python, including how to add data on a per-file, per-line basis. This is a good time to not only get acquainted with the conveniences of Bash programming, but to practice using higher-level CSV libraries (even though they are overkill in this situation).
The routine
Open a blank, new text file.
Open an existing text file.
Copy existing text file.
Paste existing text file into new text file.
Open another existing text file.
Repeat.
Just cat
The Unix operating system has long had the simple cat program, which has no other ambition than to take a list of files given to it and then to combine, or concatenate their streams:
cat file1.txt file2.txt > all.txt
This is actually good enough for many situations, and even the most skilled practitioner will first think of cat
as a first resort.
Baby names by state
The U.S. Social Security Administration releases bulk data on baby names in two different zip files: by state and by year (nationwide). Each zip file contains dozens of text files totaling in the tens of megabytes.
The states file is pretty straightforward. The list of text files, one for each state, looks like this:
AK.TXT HI.TXT MI.TXT NV.TXT TX.TXT
AL.TXT IA.TXT MN.TXT NY.TXT UT.TXT
AR.TXT ID.TXT MO.TXT OH.TXT VA.TXT
AZ.TXT IL.TXT MS.TXT OK.TXT VT.TXT
CA.TXT IN.TXT MT.TXT OR.TXT WA.TXT
CO.TXT KS.TXT NC.TXT PA.TXT WI.TXT
CT.TXT KY.TXT ND.TXT RI.TXT WS_FTP.LOG
DC.TXT LA.TXT NE.TXT SC.TXT WV.TXT
DE.TXT MA.TXT NH.TXT SD.TXT WY.TXT
FL.TXT MD.TXT NJ.TXT StateReadMe.pdf
GA.TXT ME.TXT NM.TXT TN.TXT
Each text file is headerless. Each row has this comma-separated value format:
state abbreviation, sex, year, name, count
e.g.
AK,F,1910,Mary,14
AK,F,1910,Annie,12
AK,F,1910,Anna,10
AK,F,1910,Margaret,8
AK,F,1910,Helen,7
Bash cat
Because the files are all the same format, we just have to paste them one-after-another into a combined file, not having to worry about deleting the redundant headers.
Assuming you’re currently in the directory containing the unzipped text files, here’s how to combine all the state files (i.e. files that end with .TXT
) into a file named allstates.csv
cat *.TXT > allstates.csv
Adding a header line
That was just too easy. Of course, we want column headers in the combined allstates.csv
, so that other programs (such as Excel, or a SQL database) or, generally, anyone who wants to use allstates.csv
, knows what the columns are.
We can do this manually with echo
:
echo "state,sex,year,name,count" > allstates.csv
cat *.TXT >> allstates.csv
Bash for-loop
Using cat
seems simple enough. Let’s practice the for
construct:
echo "state,sex,year,name,count" > allstates.csv
for fname in *.TXT; do
echo Adding $fname
cat $fname >> allstates.csv
done
Note: It’s not an issue here – but because of the naive way that bash handles spaces-within-filenames, the above for
-loop can be quite erratic, if not dangerous. However, I’ll leave the lengthy discussion on the ins-and-outs of that for another occasion; in this (very) controlled scenario, you shouldn’t run into any filename/operating-system-specific issues.
Python for-loop
Hopefully you can see the elegance of the bash solution. But pretend you’re not on a Unix-like system, or that you want to practice Python:
from glob import glob
output = open("allstates.csv", "w")
output.write("state,sex,year,name,count\n")
for fname in glob("*.TXT"):
print("Adding", fname)
f = open(fname)
output.write(f.read())
f.close()
output.close()
Practicing the with
statement
This is a good time to get into the habit of using Python’s with
statement, which isn’t important in this situation, but standard practice when working with bigger datasets and real-world operating scenarios:
from glob import glob
with open("allstates.csv", "w") as output:
output.write("state,sex,year,name,count\n")
for fname in glob("*.TXT"):
print("Adding", fname)
with open(fname) as f:
output.write(f.read())
Baby names by year
How the SSA packages the nationwide by year baby names text files is more complicated.
Each year has its own headerless file:
NationalReadMe.pdf yob1924.txt yob1969.txt
yob1880.txt yob1925.txt yob1970.txt
yob1881.txt yob1926.txt yob1971.txt
yob1882.txt yob1927.txt yob1972.txt
yob1883.txt yob1928.txt yob1973.txt
However, unlike the state text files, the year text files _do not include the year on each row. For example, here are the first five lines from yob1984.txt
:
Jennifer,F,50556
Jessica,F,45849
Ashley,F,38759
Amanda,F,33903
Sarah,F,25873
But how do you know that they originated from the 1984
file, without having been told beforehand? This means that simply cat
-ing all the files together would make it impossible to distinguish which row comes from which year.
Grep and iterate line-by-line, per-file, in Bash
Luckily, we practiced using the for
-loop in both Bash and Python, which gives us the flexibility to alter the files on a per-file, per-line basis. With bash and the use of grep
to capture the year per filename:
echo "year,name,sex,count" > allyears.csv
for fname in *.txt; do
year=$(echo $fname | grep -oE [0-9]{4})
echo Year $year
cat $fname | while read -r line; do
echo "$year,$line" >> allyears.csv
done
done
Python for-loops and regex module
Same concept here. If you already have some basic familiarity with regular expressions, try practicing a positive-lookbehind and Python’s os.path
module for more exactness (even though it’s not needed here):
from os.path import basename
from glob import glob
import re
with open("allyears.csv", "w") as output:
output.write("year,name,sex,count\n")
for fname in glob("*.txt"):
year = re.search(r'(?<=^yob)\d{4}', basename(fname)).group()
print("Adding", year)
with open(fname, 'r') as f:
# readlines is OK here, bad in practice for massive files
for line in f.readlines():
output.write("%s,%s" % (year, line))
Python csv module
Again, the SSA babynames scenario is simplified, including in how we can safely expect that each line is a “simple” CSV rather than a complicated one, in which commas-within-fields and special characters have to be dealt with. In such real-world scenarios, the following is a very bad way to add a comma-delimited field:
# ...
output.write("%s,%s" % (year, line))
So it’s worth practicing Python’s csv
module; to be honest, it’s taken me a lot of time to figure out the conventions.
csv.Reader and csv.Writer
from os.path import basename
from glob import glob
import re
import csv
with open("allyears.csv", "w") as output:
cw = csv.writer(output)
# notice different data layout, with year at the end of each row
cw.writerow(['name', 'sex', 'count', 'year'])
for fname in glob("*.txt"):
year = re.search(r'(?<=^yob)\d{4}', basename(fname)).group()
print("Adding", year)
with open(fname, 'r') as f:
for row in csv.reader(f):
row.append(year)
cw.writerow(row)
csv.DictReader and csv.DictWriter
If you prefer using dictionaries to track the structure of your data:
from os.path import basename
from glob import glob
import re
from csv import DictReader, DictWriter
with open("allyears.csv", "w") as output:
cw = DictWriter(output, fieldnames = ['name', 'sex', 'count', 'year'])
# DictWriter will lay out columns in an unspecified order
cw.writeheader()
for fname in glob("*.txt"):
year = re.search(r'(?<=^yob)\d{4}', basename(fname)).group()
print("Adding", year)
with open(fname, 'r') as f:
for row in DictReader(f, fieldnames = ["name", "sex", "count"]):
row['year'] = year
cw.writerow(row)