{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Data Cleaning in python\n",
    "This practical was based on Section 2 of \"An Introduction ot data cleaning with R\" from Statistics Netherlands, and adapted for python.\n",
    "* Available at https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf\n",
    "\n",
    "## Logging in and getting started\n",
    "\n",
    "### Data files to be set up\n",
    "#### unnamed.txt\n",
    "```\n",
    "21,6.0 \n",
    "42,5.9 \n",
    "18,5.7* \n",
    "21,NA\n",
    "```\n",
    "\n",
    "#### daltons.txt\n",
    "```\n",
    "%% Data on the Dalton Brothers\n",
    "Gratt,1861,1892\n",
    "Bob,1892\n",
    "1871,Emmet,1937\n",
    "% Names, birth and death dates\n",
    "```\n",
    "\n",
    "### On Your Own Laptop\n",
    "\n",
    "##### Prerequisites\n",
    "\n",
    "Python 2.7 and Conda.\n",
    "\n",
    "#### Command line install:\n",
    "```\n",
    "condo create --name pythonData\n",
    "conda install -n pythonData Jupyter pandas\n",
    "source activate pythonData\n",
    "jupyter notebook\n",
    "```\n",
    "Open http://localhost:8888 in browser.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### On EPCC's RDF (Research Data Facility)\n",
    "\n",
    "* Open a terminal window and run the following commands:\n",
    "*NB Replace port number 8889 with another number between 8000-9000\n",
    "```\n",
    "# Login\n",
    "ssh username@login.rdf.ac.uk\n",
    "# Load python modules\n",
    "module load python\n",
    "module load anaconda\n",
    "# Create working directory\n",
    "mkdir dataCleaning\n",
    "cd dataCleaning\n",
    "# create the input data files\n",
    "nano unnamed.txt\n",
    "nano daltons.txt\n",
    "ipython notebook --no-browser --port=8889\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Mac or Unix connection to RDF\n",
    "* Open another terminal window (mac or Unix) and run this command:\n",
    "```\n",
    "ssh -N -f -L localhost:8888:localhost:8889 username@login.rdf.ac.uk\n",
    "```\n",
    "* Go to http://localhost:8888 in your browser. Open the dataCleaning directory and create a new notebook to work in."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Windows connection to RDF\n",
    "Open putty\n",
    "set ssh connection:\n",
    "```Host Name: user@IP\n",
    "port: 22\n",
    "set putty/connections/SSH/tunnels\n",
    "source: local port:8888\n",
    "Destination: remote server: localhost:8889```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 1: Read csv data into a data frame\n",
    "Start with a file called unnamed.txt containing the following text:\n",
    "```\n",
    "21,6.0 \n",
    "42,5.9 \n",
    "18,5.7* \n",
    "21,NA\n",
    "```\n",
    "### Start Python Notebook\n",
    "As described above."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### read_csv\n",
    "* Import the pandas module as pd\n",
    "* Read this with pd.read_csv()\n",
    " - What has happened to the first row? - it's a header.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>21</th>\n",
       "      <th>6.0</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>42</td>\n",
       "      <td>5.9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>18</td>\n",
       "      <td>5.7*</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>21</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   21   6.0\n",
       "0  42   5.9\n",
       "1  18  5.7*\n",
       "2  21   NaN"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "pd.read_csv(\"unnamed.txt\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Header row\n",
    "* Read this again with header=None as an argument\n",
    " - What has happened now? \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>21</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>42</td>\n",
       "      <td>5.9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>18</td>\n",
       "      <td>5.7*</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>21</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    0     1\n",
       "0  21   6.0\n",
       "1  42   5.9\n",
       "2  18  5.7*\n",
       "3  21   NaN"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv(\"unnamed.txt\", header=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setting the column names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>height</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>21</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>42</td>\n",
       "      <td>5.9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>18</td>\n",
       "      <td>5.7*</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>21</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   age height\n",
       "0   21    6.0\n",
       "1   42    5.9\n",
       "2   18   5.7*\n",
       "3   21    NaN"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv(\"unnamed.txt\", header=None, names=('age','height'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>height</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>21</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>42</td>\n",
       "      <td>5.9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>18</td>\n",
       "      <td>5.7*</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>21</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   age height\n",
       "0   21    6.0\n",
       "1   42    5.9\n",
       "2   18   5.7*\n",
       "3   21    NaN"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "person = pd.read_csv(\"unnamed.txt\", header=None, names=('age','height'))\n",
    "person"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 4 entries, 0 to 3\n",
      "Data columns (total 2 columns):\n",
      "age       4 non-null int64\n",
      "height    3 non-null object\n",
      "dtypes: int64(1), object(1)\n",
      "memory usage: 136.0+ bytes\n"
     ]
    }
   ],
   "source": [
    "person.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Let’s convert the height column into numeric values\n",
    "What happened to 5.7*?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "Unable to parse string \"5.7*\" at position 2",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-70-ffe88830763a>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mperson\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mheight\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mto_numeric\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mperson\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mheight\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m/Users/eilidhtroup/anaconda/envs/pythonData/lib/python2.7/site-packages/pandas/core/tools/numeric.pyc\u001b[0m in \u001b[0;36mto_numeric\u001b[0;34m(arg, errors, downcast)\u001b[0m\n\u001b[1;32m    124\u001b[0m             \u001b[0mcoerce_numeric\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mFalse\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0merrors\u001b[0m \u001b[0;32min\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0;34m'ignore'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'raise'\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32melse\u001b[0m \u001b[0mTrue\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    125\u001b[0m             values = lib.maybe_convert_numeric(values, set(),\n\u001b[0;32m--> 126\u001b[0;31m                                                coerce_numeric=coerce_numeric)\n\u001b[0m\u001b[1;32m    127\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    128\u001b[0m     \u001b[0;32mexcept\u001b[0m \u001b[0mException\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32mpandas/_libs/src/inference.pyx\u001b[0m in \u001b[0;36mpandas._libs.lib.maybe_convert_numeric (pandas/_libs/lib.c:56638)\u001b[0;34m()\u001b[0m\n",
      "\u001b[0;31mValueError\u001b[0m: Unable to parse string \"5.7*\" at position 2"
     ]
    }
   ],
   "source": [
    "person.height = pd.to_numeric(person.height) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* What happens to 5.7\\* when you coerce to a numeric value?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>height</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>21</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>42</td>\n",
       "      <td>5.9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>18</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>21</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   age  height\n",
       "0   21     6.0\n",
       "1   42     5.9\n",
       "2   18     NaN\n",
       "3   21     NaN"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "person.height = pd.to_numeric(person.height, errors='coerce') \n",
    "person"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let’s check the structure. It’s a data frame containing:\n",
    "* an age column of ints\n",
    "* a height columns of floats."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 4 entries, 0 to 3\n",
      "Data columns (total 2 columns):\n",
      "age       4 non-null int64\n",
      "height    2 non-null float64\n",
      "dtypes: float64(1), int64(1)\n",
      "memory usage: 136.0 bytes\n"
     ]
    }
   ],
   "source": [
    "person.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 2 Dealing with unstructured data\n",
    "## Step 1 Readlines"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "with open(\"daltons.txt\") as f:\n",
    "    txt = f.readlines()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['%% Data on the Dalton Brothers\\r\\n',\n",
       " 'Gratt,1861,1892\\r\\n',\n",
       " 'Bob,1892\\r\\n',\n",
       " '1871,Emmet,1937\\r\\n',\n",
       " '% Names, birth and death dates\\r\\n']"
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pattern matching\n",
    "\n",
    "Several example datasets come as part of Python's scikit-learn package. \n",
    "http://scikit-learn.org/stable/tutorial/basic/tutorial.html#loading-example-dataset\n",
    "\n",
    "However, here we are going to download a csv file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "iris = pd.read_csv('https://github.com/pandas-dev/pandas/raw/master/pandas/tests/data/iris.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 150 entries, 0 to 149\n",
      "Data columns (total 5 columns):\n",
      "SepalLength    150 non-null float64\n",
      "SepalWidth     150 non-null float64\n",
      "PetalLength    150 non-null float64\n",
      "PetalWidth     150 non-null float64\n",
      "Name           150 non-null object\n",
      "dtypes: float64(4), object(1)\n",
      "memory usage: 5.9+ KB\n"
     ]
    }
   ],
   "source": [
    "iris.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "names = iris.columns.tolist()  # Alternatively list(iris)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Name']"
      ]
     },
     "execution_count": 80,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "names"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Using list comprehension"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Python's list comprehension applys a function to each element in a list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[8, 10, 12]"
      ]
     },
     "execution_count": 81,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "numbers = [4,5,6]\n",
    "[x*2 for x in numbers]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 82,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "'Petal' in 'PetalLength'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[False, False, True, True, False]"
      ]
     },
     "execution_count": 83,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[\"Petal\" in name for name in names]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['PetalLength', 'PetalWidth']"
      ]
     },
     "execution_count": 84,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[name for name in names if 'Petal' in name]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Regular expressions\n",
    "* As above, using regular expressions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['PetalLength', 'PetalWidth']"
      ]
     },
     "execution_count": 85,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import re\n",
    "[name for name in names if re.search(\"Petal\", name)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "• ^ matches pattern at start"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['PetalLength', 'PetalWidth']"
      ]
     },
     "execution_count": 86,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[name for name in names if re.search(\"^P\", name)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "• $ matches pattern at end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']"
      ]
     },
     "execution_count": 87,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[name for name in names if re.search(\"th$\", name)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "• [] character class, match characters enclosed in []"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['SepalLength', 'PetalLength']"
      ]
     },
     "execution_count": 88,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[name for name in names if re.search(\"[g][t][h]\", name)]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For more see help(re) for full explanation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### An aside on Simple string matching alternatives (regular expression are more powerful, and return position of match)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['PetalLength', 'PetalWidth']"
      ]
     },
     "execution_count": 89,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[name for name in names if name.startswith(\"P\")]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[False, False, True, True, False]"
      ]
     },
     "execution_count": 90,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[name.startswith(\"P\") for name in names]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']"
      ]
     },
     "execution_count": 91,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[name for name in names if name.endswith(\"th\")]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['SepalLength', 'PetalLength']"
      ]
     },
     "execution_count": 92,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[name for name in names if \"gth\" in name]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Subsetting and Logicals"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Logical AND &"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>SepalLength</th>\n",
       "      <th>SepalWidth</th>\n",
       "      <th>PetalLength</th>\n",
       "      <th>PetalWidth</th>\n",
       "      <th>Name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>70</th>\n",
       "      <td>5.9</td>\n",
       "      <td>3.2</td>\n",
       "      <td>4.8</td>\n",
       "      <td>1.8</td>\n",
       "      <td>Iris-versicolor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>77</th>\n",
       "      <td>6.7</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>1.7</td>\n",
       "      <td>Iris-versicolor</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    SepalLength  SepalWidth  PetalLength  PetalWidth             Name\n",
       "70          5.9         3.2          4.8         1.8  Iris-versicolor\n",
       "77          6.7         3.0          5.0         1.7  Iris-versicolor"
      ]
     },
     "execution_count": 93,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "iris[(iris.Name == \"Iris-versicolor\") & (iris.PetalWidth >= 1.7)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Logical OR |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>SepalLength</th>\n",
       "      <th>SepalWidth</th>\n",
       "      <th>PetalLength</th>\n",
       "      <th>PetalWidth</th>\n",
       "      <th>Name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>4.3</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.1</td>\n",
       "      <td>0.1</td>\n",
       "      <td>Iris-setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>131</th>\n",
       "      <td>7.9</td>\n",
       "      <td>3.8</td>\n",
       "      <td>6.4</td>\n",
       "      <td>2.0</td>\n",
       "      <td>Iris-virginica</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     SepalLength  SepalWidth  PetalLength  PetalWidth            Name\n",
       "13           4.3         3.0          1.1         0.1     Iris-setosa\n",
       "131          7.9         3.8          6.4         2.0  Iris-virginica"
      ]
     },
     "execution_count": 94,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "iris[(iris.SepalLength == 4.3) | (iris.SepalLength == 7.9)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Note == for comparison not ="
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Logical NOT ~"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>SepalLength</th>\n",
       "      <th>SepalWidth</th>\n",
       "      <th>PetalLength</th>\n",
       "      <th>PetalWidth</th>\n",
       "      <th>Name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>4.3</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.1</td>\n",
       "      <td>0.1</td>\n",
       "      <td>Iris-setosa</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    SepalLength  SepalWidth  PetalLength  PetalWidth         Name\n",
       "13          4.3         3.0          1.1         0.1  Iris-setosa"
      ]
     },
     "execution_count": 95,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "iris[~(iris.SepalLength > 4.3)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Selecting rows and columns\n",
    "\n",
    "* Pandas filter() command selects columns\n",
    "* Can filter by regular expression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index([u'PetalLength', u'PetalWidth'], dtype='object')"
      ]
     },
     "execution_count": 96,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "iris.filter(regex='^P').columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Select columns and rows at the same time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PetalLength</th>\n",
       "      <th>PetalWidth</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>1.1</td>\n",
       "      <td>0.1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    PetalLength  PetalWidth\n",
       "13          1.1         0.1"
      ]
     },
     "execution_count": 97,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "iris.filter(regex='^P')[~(iris.SepalLength > 4.3)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 (cont) Selecting lines only with data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Find lines starting with a % sign"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['%% Data on the Dalton Brothers\\r\\n', '% Names, birth and death dates\\r\\n']"
      ]
     },
     "execution_count": 98,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[name for name in txt if re.search(\"^%\", name)]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Gratt,1861,1892\\r\\n', 'Bob,1892\\r\\n', '1871,Emmet,1937\\r\\n']"
      ]
     },
     "execution_count": 99,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dat = [name for name in txt if not re.search(\"^%\", name)]\n",
    "dat"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3 - split lines into fields"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* For each line, we now want to extract the content for each field.\n",
    "* We now need to know about splitting lines and learn about lists in Python."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "L = [1,2, \"three\", [3,3]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[] retrieves an object from the list.\n",
    "Indexing starts at zero."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 101,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "L[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Can select a range of values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[1, 2, 'three']"
      ]
     },
     "execution_count": 102,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "L[0:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use - to count from end\n",
    "e.g. last 2 items"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'three'"
      ]
     },
     "execution_count": 103,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "L[-2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From second last to end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['three', [3, 3]]"
      ]
     },
     "execution_count": 104,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "L[-2:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### split\n",
    "split() splits a string into a list of substrings at the point indicated by the split pattern."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Split', 'the', 'words', 'in', 'a', 'sentence\\n']"
      ]
     },
     "execution_count": 105,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = \"Split the words in a sentence\\n\"\n",
    "x.split(\" \")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Use split() to split each line into data chunks.\n",
    "* Use strip() to remove whitespace characters such as \\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Split', 'the', 'words', 'in', 'a', 'sentence']"
      ]
     },
     "execution_count": 106,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x.strip().split(\" \")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 107,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['Gratt', '1861', '1892'], ['Bob', '1892'], ['1871', 'Emmet', '1937']]"
      ]
     },
     "execution_count": 107,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "field_list = [ln.strip().split(\",\") for ln in dat]\n",
    "field_list"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4 - Standardise Rows"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Now we want to make sure each row has the same number of fields and in the same order.\n",
    "* Let's write a function to process each row."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## User-defined function in Python"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`def my_function(arg1, arg2, ...):\n",
    "    statements\n",
    "    return object\n",
    "code not in my_function`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Objects in the function are local to the function.\n",
    "* The object returned can be any data type\n",
    "* Functions are stored as objects\n",
    "* An explicit return statement is required.\n",
    "* : marks the start of the body of the function. The body must be indented, the end of the indentation marks the end of the function.\n",
    "****\n",
    "* So let's write a function that takes the list representing each line, extracts the person's name, their birth and death dates and re-orders them accordingly.\n",
    "* Let's call this function assign_fields and store it in a file called assign_fields.py\n",
    "\n",
    "* Open a text file with:\n",
    "    `nano assign_fields.py`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 108,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "def assign_fields(x):\n",
    "    # x is a list of words from a line.\n",
    "    \n",
    "    # create a list to hold the extracted fields, initialised to 'NA' by default.\n",
    "    out = ['NA'] * 3\n",
    "    \n",
    "    for word in x:\n",
    "        # extract the name value (alphabetical) and insert in the first position.\n",
    "        if word.isalpha():\n",
    "            out[0] = word\n",
    "        else:    \n",
    "            # extract birth date (if any)\n",
    "            # based on knowledge that all Dalton brothers were born before 1890\n",
    "            # and died after 1890\n",
    "            if (int(word) < 1890): \n",
    "                out[1] = word\n",
    "            elif (int(word) > 1890): \n",
    "                out[2] = word\n",
    "    # Returns a list format: [name, born, died]            \n",
    "    return out"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Save the assign_fields.py file and restart ipython."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Read the file in again after re-starting ipython"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 109,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import re\n",
    "with open(\"daltons.txt\") as f:\n",
    "    txt = f.readlines()\n",
    "dat = [name for name in txt if not re.search(\"^%\", name)]\n",
    "field_list = [ln.strip().split(\",\") for ln in dat]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* From the assign_fields module file, import the assign_fields function.\n",
    "* Run the assign_fields function over every row in the list of fields"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 110,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['Gratt', '1861', '1892'], ['Bob', 'NA', '1892'], ['Emmet', '1871', '1937']]"
      ]
     },
     "execution_count": 110,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from assign_fields import assign_fields\n",
    "standard_fields = [assign_fields(ln) for ln in field_list]\n",
    "standard_fields"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5 - Transform to a data frame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Let's convert the list of standardised rows into a data frame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 111,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "daltons = pd.DataFrame(standard_fields)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 112,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Gratt</td>\n",
       "      <td>1861</td>\n",
       "      <td>1892</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Bob</td>\n",
       "      <td>NA</td>\n",
       "      <td>1892</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Emmet</td>\n",
       "      <td>1871</td>\n",
       "      <td>1937</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       0     1     2\n",
       "0  Gratt  1861  1892\n",
       "1    Bob    NA  1892\n",
       "2  Emmet  1871  1937"
      ]
     },
     "execution_count": 112,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "daltons"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 113,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "daltons = pd.DataFrame(standard_fields, columns=['name','birth','death'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 114,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>birth</th>\n",
       "      <th>death</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Gratt</td>\n",
       "      <td>1861</td>\n",
       "      <td>1892</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Bob</td>\n",
       "      <td>NA</td>\n",
       "      <td>1892</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Emmet</td>\n",
       "      <td>1871</td>\n",
       "      <td>1937</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    name birth death\n",
       "0  Gratt  1861  1892\n",
       "1    Bob    NA  1892\n",
       "2  Emmet  1871  1937"
      ]
     },
     "execution_count": 114,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "daltons"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 6 - Normalise and coerce to correct type"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Now need to coerce our columns to the correct types e.g. numerics, characters, categories, ... In this case birth and death, need to be numerics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 122,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 3 entries, 0 to 2\n",
      "Data columns (total 3 columns):\n",
      "name     3 non-null object\n",
      "birth    2 non-null float64\n",
      "death    3 non-null int64\n",
      "dtypes: float64(1), int64(1), object(1)\n",
      "memory usage: 144.0+ bytes\n"
     ]
    }
   ],
   "source": [
    "daltons.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 123,
   "metadata": {},
   "outputs": [],
   "source": [
    "daltons.birth = pd.to_numeric(daltons.birth, errors='coerce') "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 124,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "daltons.death =  pd.to_numeric(daltons.death, errors='coerce') "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 125,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>birth</th>\n",
       "      <th>death</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Gratt</td>\n",
       "      <td>1861.0</td>\n",
       "      <td>1892</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Bob</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1892</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Emmet</td>\n",
       "      <td>1871.0</td>\n",
       "      <td>1937</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    name   birth  death\n",
       "0  Gratt  1861.0   1892\n",
       "1    Bob     NaN   1892\n",
       "2  Emmet  1871.0   1937"
      ]
     },
     "execution_count": 125,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "daltons"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 126,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 3 entries, 0 to 2\n",
      "Data columns (total 3 columns):\n",
      "name     3 non-null object\n",
      "birth    2 non-null float64\n",
      "death    3 non-null int64\n",
      "dtypes: float64(1), int64(1), object(1)\n",
      "memory usage: 144.0+ bytes\n"
     ]
    }
   ],
   "source": [
    "daltons.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "* The birth column contains floats instead of integers because you can't mix int and NaN data types in pandas."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Repeatability"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Storing the instructions in a file along **with comments** enables repetablility. \n",
    "* Remember to import the required modules.\n",
    "* Ipython notebooks allow nicely formatted comments, code, and output to be mixed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 128,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Daltons\n",
      "    name   birth  death\n",
      "0  Gratt  1861.0   1892\n",
      "1    Bob     NaN   1892\n",
      "2  Emmet  1871.0   1937\n",
      "\n",
      "Info\n",
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 3 entries, 0 to 2\n",
      "Data columns (total 3 columns):\n",
      "name     3 non-null object\n",
      "birth    2 non-null float64\n",
      "death    3 non-null int64\n",
      "dtypes: float64(1), int64(1), object(1)\n",
      "memory usage: 144.0+ bytes\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import re\n",
    "with open(\"daltons.txt\") as f:\n",
    "    txt = f.readlines()\n",
    "dat = [name for name in txt if not re.search(\"^%\", name)]\n",
    "field_list = [ln.strip().split(\",\") for ln in dat]\n",
    "from assign_fields import assign_fields\n",
    "standard_fields = [assign_fields(ln) for ln in field_list]\n",
    "colnames = ['name', 'birth', 'death']\n",
    "daltons = pd.DataFrame(standard_fields, columns=colnames)\n",
    "daltons.birth = pd.to_numeric(daltons.birth, errors='coerce') \n",
    "daltons.death = pd.to_numeric(daltons.death, errors='coerce') \n",
    "print(\"Daltons\")\n",
    "print(daltons)\n",
    "print('\\nInfo')\n",
    "print(daltons.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Fixing character vectors - re.sub"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 129,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Replace-the-spaces-in-this-text'"
      ]
     },
     "execution_count": 129,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import re\n",
    "import pandas as pd\n",
    "string = \"Replace the spaces in this text\"\n",
    "re.sub(\" \", \"-\", string)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 130,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Replace-first space in this text'"
      ]
     },
     "execution_count": 130,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "string = \"Replace first space in this text\"\n",
    "re.sub(\" \", \"-\", string, count=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 131,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "iris = pd.read_csv('https://github.com/pandas-dev/pandas/raw/master/pandas/tests/data/iris.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 132,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "names = iris.columns.tolist()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 133,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Name']"
      ]
     },
     "execution_count": 133,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 134,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['S-palL-ngth', 'S-palWidth', 'P-talL-ngth', 'P-talWidth', 'Nam-']"
      ]
     },
     "execution_count": 134,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[re.sub(\"e\", '-', name) for name in names]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Parallel processing in Python\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Serial code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 135,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['Gratt', '1861', '1892'], ['Bob', '1892'], ['1871', 'Emmet', '1937']]"
      ]
     },
     "execution_count": 135,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "field_list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 136,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['Gratt', '1861', '1892'], ['Bob', 'NA', '1892'], ['Emmet', '1871', '1937']]"
      ]
     },
     "execution_count": 136,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "standard_fields = [assign_fields(ln) for ln in field_list]\n",
    "standard_fields"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Parallel code using 'multiprocessing' module - each line processed on a different processor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 137,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['Gratt', '1861', '1892'], ['Bob', 'NA', '1892'], ['Emmet', '1871', '1937']]"
      ]
     },
     "execution_count": 137,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import multiprocessing\n",
    "from multiprocessing import Pool\n",
    "\n",
    "try:\n",
    "    cpus = multiprocessing.cpu_count()\n",
    "except NotImplementedError:\n",
    "    cpus = 2   # arbitrary default\n",
    "\n",
    "pool = Pool(processes=cpus)\n",
    "pool.map(assign_fields, field_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 138,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4"
      ]
     },
     "execution_count": 138,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cpus"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}