An Awk CSV Tutorial

Recently I read a blog post where the author claimed that it was always necessary to use a library written by someone else, along with a complex scripting language, in order to process CSV files, or perform a variety of other tasks. In the Reddit thread associated with that post, I demonstrated that that was not the case; although given the apparent lack of familiarity with the tools that the POSIX userland offers for performing CSV processing in particular, that many people seem to have, I decided to write a tutorial about it.

What I am going to be sharing with you, is not new information. I am still only in the very early stages of learning the AWK language myself, and I know that there are probably many older UNIX users around who have a far more intimate grasp of these concepts. Nevertheless, the point is that at one point in time, people tended to organise files with the explicit understanding that taken together, file and directory names constituted a database of sorts. This does not seem to be a practice that has been in use for many years, but I have found it extremely beneficial myself, and hopefully some of the rest of you will as well.

I'm going to start by admitting that I do not consider the comma (",") to be a good character to use as a field seperator, personally. My own practice is to rename all of my files as sets of fields which can be manipulated either with awk(1) or cut(1), but in order to do so, as a field seperator I use the plus sign. "(+)"

I use a plus sign, because the comma occurs far too commonly as a form of punctuation, within many sentences or names, whereas the plus sign occurs sufficiently rarely that if it ever does, I can simply escape it. With the comma, I would need to be escaping all the time.

As an example, here is the file name of an episode of "Star Trek: Enterprise."

mp4.+star-trek+enterprise+s03+e18+azati-prime

As you can perhaps see, the above file name is divided up into six fields, with a plus sign as the field seperator between each of them. I will now go through them. Some of you may have noticed that the file extension is listed at the beginning of the filename, which may seem odd; but my overall rule with filenames, is to sort fields going from the most general, (which is therefore usually the file type or extension) through to the most specific or individual, which in this case is the episode name.

So here we have first the file type or extension, which is "mp4." The second field here is the overall franchise name, which is Star Trek. The third field is the individual series name, which is Enterprise. The fourth and fifth fields indicate first the season number, (Enterprise had four seasons) and secondly the episode number within that season. The last field indicates the individual episode name, as mentioned.

I always remove all capital letters, spaces, commas, and non-alphanumeric characters from filenames as well, so that they can be parsed easily. If you need to preserve strange characters, you can create a variable with those characters intact, and then rename the file with them in it, once you have done all other processing.

Once we have this, we can write the following AWK script which will allow us to extract any of those fields, for whatever purpose we might have. This can, of course, make the automated renaming or moving of large numbers of files much easier, as well.

#!/usr/bin/env bash
set -x

#----------+comments---------
# Field processing script.
#----------.comments---------

#----------+code---------
rawfilename="mp4.+star-trek+enterprise+s03+e18+azati-prime"

filetype=$(echo "${rawfilename}" | awk 'BEGIN { FS="+" } { print $1 }')
franchisename=$(echo "${rawfilename}" | awk 'BEGIN { FS="+" } { print $2 }')
seriesname=$(echo "${rawfilename}" | awk 'BEGIN { FS="+" } { print $3 }')
seasonno=$(echo "${rawfilename}" | awk 'BEGIN { FS="+" } { print $4 }')
episodeno=$(echo "${rawfilename}" | awk 'BEGIN { FS="+" } { print $5 }')
episodename=$(echo "${rawfilename}" | awk 'BEGIN { FS="+" } { print $6 }')

echo "File type is ${filetype}."
echo "File franchise name is ${franchisename}."
echo "File series name is ${seriesname}."
echo "File season number is ${seasonno}."
echo "File episode number is ${episodeno}."
echo "File episode name is ${episodename}."
#----------.code---------

Once you have your fields stored in variables in this manner, it is very easy to do whatever else with them that you might need. Because most Windows files need to have the extension at the end, I can re-arrange the fields to allow that, as simply as changing the echo order, before issuing the mv command. Because I mostly work with the command line within FreeBSD, however, I prefer to keep the filetypes at the beginning of the file, which also means that when I use the command "ls -l" in a given directory, file types are grouped first, and then file names, and both are listed alphabetically.

Other advantages of this filenaming convention, are the ability to very rapidly create media playlists with grep, for example. With the above filename, I could issue the command:-

grep +enterprise+s03+ > playlist

This would have the result of putting the name of every episode from season 3 of Enterprise into a text file, which I could then very easily play with mplayer. You can of course store data in this very same format within text files, that you can then use this sort of setup to produce documents or reports.