{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# Data Cleaning in python\n", "This practical was based on Section 2 of \"An Introduction ot data cleaning with R\" from Statistics Netherlands, and adapted for python.\n", "* Available at https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf\n", "\n", "## Logging in and getting started\n", "\n", "### Data files to be set up\n", "#### unnamed.txt\n", "```\n", "21,6.0 \n", "42,5.9 \n", "18,5.7* \n", "21,NA\n", "```\n", "\n", "#### daltons.txt\n", "```\n", "%% Data on the Dalton Brothers\n", "Gratt,1861,1892\n", "Bob,1892\n", "1871,Emmet,1937\n", "% Names, birth and death dates\n", "```\n", "\n", "### On Your Own Laptop\n", "\n", "##### Prerequisites\n", "\n", "Python 2.7 and Conda.\n", "\n", "#### Command line install:\n", "```\n", "condo create --name pythonData\n", "conda install -n pythonData Jupyter pandas\n", "source activate pythonData\n", "jupyter notebook\n", "```\n", "Open http://localhost:8888 in browser.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### On EPCC's RDF (Research Data Facility)\n", "\n", "* Open a terminal window and run the following commands:\n", "*NB Replace port number 8889 with another number between 8000-9000\n", "```\n", "# Login\n", "ssh username@login.rdf.ac.uk\n", "# Load python modules\n", "module load python\n", "module load anaconda\n", "# Create working directory\n", "mkdir dataCleaning\n", "cd dataCleaning\n", "# create the input data files\n", "nano unnamed.txt\n", "nano daltons.txt\n", "ipython notebook --no-browser --port=8889\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Mac or Unix connection to RDF\n", "* Open another terminal window (mac or Unix) and run this command:\n", "```\n", "ssh -N -f -L localhost:8888:localhost:8889 username@login.rdf.ac.uk\n", "```\n", "* Go to http://localhost:8888 in your browser. Open the dataCleaning directory and create a new notebook to work in." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Windows connection to RDF\n", "Open putty\n", "set ssh connection:\n", "```Host Name: user@IP\n", "port: 22\n", "set putty/connections/SSH/tunnels\n", "source: local port:8888\n", "Destination: remote server: localhost:8889```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 1: Read csv data into a data frame\n", "Start with a file called unnamed.txt containing the following text:\n", "```\n", "21,6.0 \n", "42,5.9 \n", "18,5.7* \n", "21,NA\n", "```\n", "### Start Python Notebook\n", "As described above." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### read_csv\n", "* Import the pandas module as pd\n", "* Read this with pd.read_csv()\n", " - What has happened to the first row? - it's a header.\n" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
216.0
0425.9
1185.7*
221NaN
\n", "
" ], "text/plain": [ " 21 6.0\n", "0 42 5.9\n", "1 18 5.7*\n", "2 21 NaN" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "pd.read_csv(\"unnamed.txt\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Header row\n", "* Read this again with header=None as an argument\n", " - What has happened now? \n" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
0216.0
1425.9
2185.7*
321NaN
\n", "
" ], "text/plain": [ " 0 1\n", "0 21 6.0\n", "1 42 5.9\n", "2 18 5.7*\n", "3 21 NaN" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv(\"unnamed.txt\", header=None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setting the column names" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageheight
0216.0
1425.9
2185.7*
321NaN
\n", "
" ], "text/plain": [ " age height\n", "0 21 6.0\n", "1 42 5.9\n", "2 18 5.7*\n", "3 21 NaN" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv(\"unnamed.txt\", header=None, names=('age','height'))" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageheight
0216.0
1425.9
2185.7*
321NaN
\n", "
" ], "text/plain": [ " age height\n", "0 21 6.0\n", "1 42 5.9\n", "2 18 5.7*\n", "3 21 NaN" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "person = pd.read_csv(\"unnamed.txt\", header=None, names=('age','height'))\n", "person" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 4 entries, 0 to 3\n", "Data columns (total 2 columns):\n", "age 4 non-null int64\n", "height 3 non-null object\n", "dtypes: int64(1), object(1)\n", "memory usage: 136.0+ bytes\n" ] } ], "source": [ "person.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Let’s convert the height column into numeric values\n", "What happened to 5.7*?" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "ename": "ValueError", "evalue": "Unable to parse string \"5.7*\" at position 2", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mperson\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mheight\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mto_numeric\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mperson\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mheight\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m/Users/eilidhtroup/anaconda/envs/pythonData/lib/python2.7/site-packages/pandas/core/tools/numeric.pyc\u001b[0m in \u001b[0;36mto_numeric\u001b[0;34m(arg, errors, downcast)\u001b[0m\n\u001b[1;32m 124\u001b[0m \u001b[0mcoerce_numeric\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mFalse\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0merrors\u001b[0m \u001b[0;32min\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0;34m'ignore'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'raise'\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32melse\u001b[0m \u001b[0mTrue\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 125\u001b[0m values = lib.maybe_convert_numeric(values, set(),\n\u001b[0;32m--> 126\u001b[0;31m coerce_numeric=coerce_numeric)\n\u001b[0m\u001b[1;32m 127\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 128\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mException\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32mpandas/_libs/src/inference.pyx\u001b[0m in \u001b[0;36mpandas._libs.lib.maybe_convert_numeric (pandas/_libs/lib.c:56638)\u001b[0;34m()\u001b[0m\n", "\u001b[0;31mValueError\u001b[0m: Unable to parse string \"5.7*\" at position 2" ] } ], "source": [ "person.height = pd.to_numeric(person.height) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* What happens to 5.7\\* when you coerce to a numeric value?" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageheight
0216.0
1425.9
218NaN
321NaN
\n", "
" ], "text/plain": [ " age height\n", "0 21 6.0\n", "1 42 5.9\n", "2 18 NaN\n", "3 21 NaN" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "person.height = pd.to_numeric(person.height, errors='coerce') \n", "person" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let’s check the structure. It’s a data frame containing:\n", "* an age column of ints\n", "* a height columns of floats." ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 4 entries, 0 to 3\n", "Data columns (total 2 columns):\n", "age 4 non-null int64\n", "height 2 non-null float64\n", "dtypes: float64(1), int64(1)\n", "memory usage: 136.0 bytes\n" ] } ], "source": [ "person.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 2 Dealing with unstructured data\n", "## Step 1 Readlines" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "collapsed": true }, "outputs": [], "source": [ "with open(\"daltons.txt\") as f:\n", " txt = f.readlines()" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['%% Data on the Dalton Brothers\\r\\n',\n", " 'Gratt,1861,1892\\r\\n',\n", " 'Bob,1892\\r\\n',\n", " '1871,Emmet,1937\\r\\n',\n", " '% Names, birth and death dates\\r\\n']" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pattern matching\n", "\n", "Several example datasets come as part of Python's scikit-learn package. \n", "http://scikit-learn.org/stable/tutorial/basic/tutorial.html#loading-example-dataset\n", "\n", "However, here we are going to download a csv file." ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "collapsed": true }, "outputs": [], "source": [ "iris = pd.read_csv('https://github.com/pandas-dev/pandas/raw/master/pandas/tests/data/iris.csv')" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 150 entries, 0 to 149\n", "Data columns (total 5 columns):\n", "SepalLength 150 non-null float64\n", "SepalWidth 150 non-null float64\n", "PetalLength 150 non-null float64\n", "PetalWidth 150 non-null float64\n", "Name 150 non-null object\n", "dtypes: float64(4), object(1)\n", "memory usage: 5.9+ KB\n" ] } ], "source": [ "iris.info()" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "collapsed": true }, "outputs": [], "source": [ "names = iris.columns.tolist() # Alternatively list(iris)" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Name']" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using list comprehension" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Python's list comprehension applys a function to each element in a list." ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[8, 10, 12]" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "numbers = [4,5,6]\n", "[x*2 for x in numbers]" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'Petal' in 'PetalLength'" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[False, False, True, True, False]" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[\"Petal\" in name for name in names]" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['PetalLength', 'PetalWidth']" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[name for name in names if 'Petal' in name]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Regular expressions\n", "* As above, using regular expressions" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['PetalLength', 'PetalWidth']" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "[name for name in names if re.search(\"Petal\", name)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "• ^ matches pattern at start" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['PetalLength', 'PetalWidth']" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[name for name in names if re.search(\"^P\", name)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "• $ matches pattern at end" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[name for name in names if re.search(\"th$\", name)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "• [] character class, match characters enclosed in []" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['SepalLength', 'PetalLength']" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[name for name in names if re.search(\"[g][t][h]\", name)]\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For more see help(re) for full explanation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### An aside on Simple string matching alternatives (regular expression are more powerful, and return position of match)" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['PetalLength', 'PetalWidth']" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[name for name in names if name.startswith(\"P\")]" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[False, False, True, True, False]" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[name.startswith(\"P\") for name in names]" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[name for name in names if name.endswith(\"th\")]" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['SepalLength', 'PetalLength']" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[name for name in names if \"gth\" in name]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Subsetting and Logicals" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Logical AND &" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SepalLengthSepalWidthPetalLengthPetalWidthName
705.93.24.81.8Iris-versicolor
776.73.05.01.7Iris-versicolor
\n", "
" ], "text/plain": [ " SepalLength SepalWidth PetalLength PetalWidth Name\n", "70 5.9 3.2 4.8 1.8 Iris-versicolor\n", "77 6.7 3.0 5.0 1.7 Iris-versicolor" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris[(iris.Name == \"Iris-versicolor\") & (iris.PetalWidth >= 1.7)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Logical OR |" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SepalLengthSepalWidthPetalLengthPetalWidthName
134.33.01.10.1Iris-setosa
1317.93.86.42.0Iris-virginica
\n", "
" ], "text/plain": [ " SepalLength SepalWidth PetalLength PetalWidth Name\n", "13 4.3 3.0 1.1 0.1 Iris-setosa\n", "131 7.9 3.8 6.4 2.0 Iris-virginica" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris[(iris.SepalLength == 4.3) | (iris.SepalLength == 7.9)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Note == for comparison not =" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Logical NOT ~" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SepalLengthSepalWidthPetalLengthPetalWidthName
134.33.01.10.1Iris-setosa
\n", "
" ], "text/plain": [ " SepalLength SepalWidth PetalLength PetalWidth Name\n", "13 4.3 3.0 1.1 0.1 Iris-setosa" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris[~(iris.SepalLength > 4.3)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selecting rows and columns\n", "\n", "* Pandas filter() command selects columns\n", "* Can filter by regular expression" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index([u'PetalLength', u'PetalWidth'], dtype='object')" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.filter(regex='^P').columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Select columns and rows at the same time" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PetalLengthPetalWidth
131.10.1
\n", "
" ], "text/plain": [ " PetalLength PetalWidth\n", "13 1.1 0.1" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.filter(regex='^P')[~(iris.SepalLength > 4.3)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2 (cont) Selecting lines only with data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find lines starting with a % sign" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['%% Data on the Dalton Brothers\\r\\n', '% Names, birth and death dates\\r\\n']" ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[name for name in txt if re.search(\"^%\", name)]\n" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Gratt,1861,1892\\r\\n', 'Bob,1892\\r\\n', '1871,Emmet,1937\\r\\n']" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dat = [name for name in txt if not re.search(\"^%\", name)]\n", "dat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3 - split lines into fields" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* For each line, we now want to extract the content for each field.\n", "* We now need to know about splitting lines and learn about lists in Python." ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "collapsed": true }, "outputs": [], "source": [ "L = [1,2, \"three\", [3,3]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[] retrieves an object from the list.\n", "Indexing starts at zero." ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "L[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can select a range of values" ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1, 2, 'three']" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "L[0:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use - to count from end\n", "e.g. last 2 items" ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'three'" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "L[-2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From second last to end" ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['three', [3, 3]]" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "L[-2:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### split\n", "split() splits a string into a list of substrings at the point indicated by the split pattern." ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Split', 'the', 'words', 'in', 'a', 'sentence\\n']" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = \"Split the words in a sentence\\n\"\n", "x.split(\" \")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Use split() to split each line into data chunks.\n", "* Use strip() to remove whitespace characters such as \\n" ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Split', 'the', 'words', 'in', 'a', 'sentence']" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x.strip().split(\" \")" ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['Gratt', '1861', '1892'], ['Bob', '1892'], ['1871', 'Emmet', '1937']]" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "field_list = [ln.strip().split(\",\") for ln in dat]\n", "field_list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4 - Standardise Rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Now we want to make sure each row has the same number of fields and in the same order.\n", "* Let's write a function to process each row." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## User-defined function in Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`def my_function(arg1, arg2, ...):\n", " statements\n", " return object\n", "code not in my_function`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Objects in the function are local to the function.\n", "* The object returned can be any data type\n", "* Functions are stored as objects\n", "* An explicit return statement is required.\n", "* : marks the start of the body of the function. The body must be indented, the end of the indentation marks the end of the function.\n", "****\n", "* So let's write a function that takes the list representing each line, extracts the person's name, their birth and death dates and re-orders them accordingly.\n", "* Let's call this function assign_fields and store it in a file called assign_fields.py\n", "\n", "* Open a text file with:\n", " `nano assign_fields.py`" ] }, { "cell_type": "code", "execution_count": 108, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "def assign_fields(x):\n", " # x is a list of words from a line.\n", " \n", " # create a list to hold the extracted fields, initialised to 'NA' by default.\n", " out = ['NA'] * 3\n", " \n", " for word in x:\n", " # extract the name value (alphabetical) and insert in the first position.\n", " if word.isalpha():\n", " out[0] = word\n", " else: \n", " # extract birth date (if any)\n", " # based on knowledge that all Dalton brothers were born before 1890\n", " # and died after 1890\n", " if (int(word) < 1890): \n", " out[1] = word\n", " elif (int(word) > 1890): \n", " out[2] = word\n", " # Returns a list format: [name, born, died] \n", " return out" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Save the assign_fields.py file and restart ipython." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Read the file in again after re-starting ipython" ] }, { "cell_type": "code", "execution_count": 109, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import re\n", "with open(\"daltons.txt\") as f:\n", " txt = f.readlines()\n", "dat = [name for name in txt if not re.search(\"^%\", name)]\n", "field_list = [ln.strip().split(\",\") for ln in dat]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* From the assign_fields module file, import the assign_fields function.\n", "* Run the assign_fields function over every row in the list of fields" ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['Gratt', '1861', '1892'], ['Bob', 'NA', '1892'], ['Emmet', '1871', '1937']]" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from assign_fields import assign_fields\n", "standard_fields = [assign_fields(ln) for ln in field_list]\n", "standard_fields" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 5 - Transform to a data frame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Let's convert the list of standardised rows into a data frame." ] }, { "cell_type": "code", "execution_count": 111, "metadata": { "collapsed": true }, "outputs": [], "source": [ "daltons = pd.DataFrame(standard_fields)" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0Gratt18611892
1BobNA1892
2Emmet18711937
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 Gratt 1861 1892\n", "1 Bob NA 1892\n", "2 Emmet 1871 1937" ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daltons" ] }, { "cell_type": "code", "execution_count": 113, "metadata": { "collapsed": true }, "outputs": [], "source": [ "daltons = pd.DataFrame(standard_fields, columns=['name','birth','death'])" ] }, { "cell_type": "code", "execution_count": 114, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namebirthdeath
0Gratt18611892
1BobNA1892
2Emmet18711937
\n", "
" ], "text/plain": [ " name birth death\n", "0 Gratt 1861 1892\n", "1 Bob NA 1892\n", "2 Emmet 1871 1937" ] }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daltons" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 6 - Normalise and coerce to correct type" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Now need to coerce our columns to the correct types e.g. numerics, characters, categories, ... In this case birth and death, need to be numerics." ] }, { "cell_type": "code", "execution_count": 122, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 3 entries, 0 to 2\n", "Data columns (total 3 columns):\n", "name 3 non-null object\n", "birth 2 non-null float64\n", "death 3 non-null int64\n", "dtypes: float64(1), int64(1), object(1)\n", "memory usage: 144.0+ bytes\n" ] } ], "source": [ "daltons.info()" ] }, { "cell_type": "code", "execution_count": 123, "metadata": {}, "outputs": [], "source": [ "daltons.birth = pd.to_numeric(daltons.birth, errors='coerce') " ] }, { "cell_type": "code", "execution_count": 124, "metadata": { "collapsed": true }, "outputs": [], "source": [ "daltons.death = pd.to_numeric(daltons.death, errors='coerce') " ] }, { "cell_type": "code", "execution_count": 125, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namebirthdeath
0Gratt1861.01892
1BobNaN1892
2Emmet1871.01937
\n", "
" ], "text/plain": [ " name birth death\n", "0 Gratt 1861.0 1892\n", "1 Bob NaN 1892\n", "2 Emmet 1871.0 1937" ] }, "execution_count": 125, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daltons" ] }, { "cell_type": "code", "execution_count": 126, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 3 entries, 0 to 2\n", "Data columns (total 3 columns):\n", "name 3 non-null object\n", "birth 2 non-null float64\n", "death 3 non-null int64\n", "dtypes: float64(1), int64(1), object(1)\n", "memory usage: 144.0+ bytes\n" ] } ], "source": [ "daltons.info()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "* The birth column contains floats instead of integers because you can't mix int and NaN data types in pandas." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Repeatability" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Storing the instructions in a file along **with comments** enables repetablility. \n", "* Remember to import the required modules.\n", "* Ipython notebooks allow nicely formatted comments, code, and output to be mixed." ] }, { "cell_type": "code", "execution_count": 128, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Daltons\n", " name birth death\n", "0 Gratt 1861.0 1892\n", "1 Bob NaN 1892\n", "2 Emmet 1871.0 1937\n", "\n", "Info\n", "\n", "RangeIndex: 3 entries, 0 to 2\n", "Data columns (total 3 columns):\n", "name 3 non-null object\n", "birth 2 non-null float64\n", "death 3 non-null int64\n", "dtypes: float64(1), int64(1), object(1)\n", "memory usage: 144.0+ bytes\n", "None\n" ] } ], "source": [ "import pandas as pd\n", "import re\n", "with open(\"daltons.txt\") as f:\n", " txt = f.readlines()\n", "dat = [name for name in txt if not re.search(\"^%\", name)]\n", "field_list = [ln.strip().split(\",\") for ln in dat]\n", "from assign_fields import assign_fields\n", "standard_fields = [assign_fields(ln) for ln in field_list]\n", "colnames = ['name', 'birth', 'death']\n", "daltons = pd.DataFrame(standard_fields, columns=colnames)\n", "daltons.birth = pd.to_numeric(daltons.birth, errors='coerce') \n", "daltons.death = pd.to_numeric(daltons.death, errors='coerce') \n", "print(\"Daltons\")\n", "print(daltons)\n", "print('\\nInfo')\n", "print(daltons.info())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fixing character vectors - re.sub" ] }, { "cell_type": "code", "execution_count": 129, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Replace-the-spaces-in-this-text'" ] }, "execution_count": 129, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "import pandas as pd\n", "string = \"Replace the spaces in this text\"\n", "re.sub(\" \", \"-\", string)" ] }, { "cell_type": "code", "execution_count": 130, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Replace-first space in this text'" ] }, "execution_count": 130, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string = \"Replace first space in this text\"\n", "re.sub(\" \", \"-\", string, count=1)" ] }, { "cell_type": "code", "execution_count": 131, "metadata": { "collapsed": true }, "outputs": [], "source": [ "iris = pd.read_csv('https://github.com/pandas-dev/pandas/raw/master/pandas/tests/data/iris.csv')" ] }, { "cell_type": "code", "execution_count": 132, "metadata": { "collapsed": true }, "outputs": [], "source": [ "names = iris.columns.tolist()" ] }, { "cell_type": "code", "execution_count": 133, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Name']" ] }, "execution_count": 133, "metadata": {}, "output_type": "execute_result" } ], "source": [ "names" ] }, { "cell_type": "code", "execution_count": 134, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['S-palL-ngth', 'S-palWidth', 'P-talL-ngth', 'P-talWidth', 'Nam-']" ] }, "execution_count": 134, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[re.sub(\"e\", '-', name) for name in names]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Parallel processing in Python\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Serial code:" ] }, { "cell_type": "code", "execution_count": 135, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['Gratt', '1861', '1892'], ['Bob', '1892'], ['1871', 'Emmet', '1937']]" ] }, "execution_count": 135, "metadata": {}, "output_type": "execute_result" } ], "source": [ "field_list" ] }, { "cell_type": "code", "execution_count": 136, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['Gratt', '1861', '1892'], ['Bob', 'NA', '1892'], ['Emmet', '1871', '1937']]" ] }, "execution_count": 136, "metadata": {}, "output_type": "execute_result" } ], "source": [ "standard_fields = [assign_fields(ln) for ln in field_list]\n", "standard_fields" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Parallel code using 'multiprocessing' module - each line processed on a different processor." ] }, { "cell_type": "code", "execution_count": 137, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['Gratt', '1861', '1892'], ['Bob', 'NA', '1892'], ['Emmet', '1871', '1937']]" ] }, "execution_count": 137, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import multiprocessing\n", "from multiprocessing import Pool\n", "\n", "try:\n", " cpus = multiprocessing.cpu_count()\n", "except NotImplementedError:\n", " cpus = 2 # arbitrary default\n", "\n", "pool = Pool(processes=cpus)\n", "pool.map(assign_fields, field_list)" ] }, { "cell_type": "code", "execution_count": 138, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 138, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cpus" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 2 }