{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"# Data Cleaning in python\n",
"This practical was based on Section 2 of \"An Introduction ot data cleaning with R\" from Statistics Netherlands, and adapted for python.\n",
"* Available at https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf\n",
"\n",
"## Logging in and getting started\n",
"\n",
"### Data files to be set up\n",
"#### unnamed.txt\n",
"```\n",
"21,6.0 \n",
"42,5.9 \n",
"18,5.7* \n",
"21,NA\n",
"```\n",
"\n",
"#### daltons.txt\n",
"```\n",
"%% Data on the Dalton Brothers\n",
"Gratt,1861,1892\n",
"Bob,1892\n",
"1871,Emmet,1937\n",
"% Names, birth and death dates\n",
"```\n",
"\n",
"### On Your Own Laptop\n",
"\n",
"##### Prerequisites\n",
"\n",
"Python 2.7 and Conda.\n",
"\n",
"#### Command line install:\n",
"```\n",
"condo create --name pythonData\n",
"conda install -n pythonData Jupyter pandas\n",
"source activate pythonData\n",
"jupyter notebook\n",
"```\n",
"Open http://localhost:8888 in browser.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### On EPCC's RDF (Research Data Facility)\n",
"\n",
"* Open a terminal window and run the following commands:\n",
"*NB Replace port number 8889 with another number between 8000-9000\n",
"```\n",
"# Login\n",
"ssh username@login.rdf.ac.uk\n",
"# Load python modules\n",
"module load python\n",
"module load anaconda\n",
"# Create working directory\n",
"mkdir dataCleaning\n",
"cd dataCleaning\n",
"# create the input data files\n",
"nano unnamed.txt\n",
"nano daltons.txt\n",
"ipython notebook --no-browser --port=8889\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Mac or Unix connection to RDF\n",
"* Open another terminal window (mac or Unix) and run this command:\n",
"```\n",
"ssh -N -f -L localhost:8888:localhost:8889 username@login.rdf.ac.uk\n",
"```\n",
"* Go to http://localhost:8888 in your browser. Open the dataCleaning directory and create a new notebook to work in."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Windows connection to RDF\n",
"Open putty\n",
"set ssh connection:\n",
"```Host Name: user@IP\n",
"port: 22\n",
"set putty/connections/SSH/tunnels\n",
"source: local port:8888\n",
"Destination: remote server: localhost:8889```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Part 1: Read csv data into a data frame\n",
"Start with a file called unnamed.txt containing the following text:\n",
"```\n",
"21,6.0 \n",
"42,5.9 \n",
"18,5.7* \n",
"21,NA\n",
"```\n",
"### Start Python Notebook\n",
"As described above."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### read_csv\n",
"* Import the pandas module as pd\n",
"* Read this with pd.read_csv()\n",
" - What has happened to the first row? - it's a header.\n"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 21 | \n",
" 6.0 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 42 | \n",
" 5.9 | \n",
"
\n",
" \n",
" 1 | \n",
" 18 | \n",
" 5.7* | \n",
"
\n",
" \n",
" 2 | \n",
" 21 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 21 6.0\n",
"0 42 5.9\n",
"1 18 5.7*\n",
"2 21 NaN"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"pd.read_csv(\"unnamed.txt\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Header row\n",
"* Read this again with header=None as an argument\n",
" - What has happened now? \n"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 21 | \n",
" 6.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 42 | \n",
" 5.9 | \n",
"
\n",
" \n",
" 2 | \n",
" 18 | \n",
" 5.7* | \n",
"
\n",
" \n",
" 3 | \n",
" 21 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1\n",
"0 21 6.0\n",
"1 42 5.9\n",
"2 18 5.7*\n",
"3 21 NaN"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_csv(\"unnamed.txt\", header=None)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Setting the column names"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" age | \n",
" height | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 21 | \n",
" 6.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 42 | \n",
" 5.9 | \n",
"
\n",
" \n",
" 2 | \n",
" 18 | \n",
" 5.7* | \n",
"
\n",
" \n",
" 3 | \n",
" 21 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age height\n",
"0 21 6.0\n",
"1 42 5.9\n",
"2 18 5.7*\n",
"3 21 NaN"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_csv(\"unnamed.txt\", header=None, names=('age','height'))"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" age | \n",
" height | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 21 | \n",
" 6.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 42 | \n",
" 5.9 | \n",
"
\n",
" \n",
" 2 | \n",
" 18 | \n",
" 5.7* | \n",
"
\n",
" \n",
" 3 | \n",
" 21 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age height\n",
"0 21 6.0\n",
"1 42 5.9\n",
"2 18 5.7*\n",
"3 21 NaN"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"person = pd.read_csv(\"unnamed.txt\", header=None, names=('age','height'))\n",
"person"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 4 entries, 0 to 3\n",
"Data columns (total 2 columns):\n",
"age 4 non-null int64\n",
"height 3 non-null object\n",
"dtypes: int64(1), object(1)\n",
"memory usage: 136.0+ bytes\n"
]
}
],
"source": [
"person.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Let’s convert the height column into numeric values\n",
"What happened to 5.7*?"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [
{
"ename": "ValueError",
"evalue": "Unable to parse string \"5.7*\" at position 2",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mperson\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mheight\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mto_numeric\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mperson\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mheight\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m/Users/eilidhtroup/anaconda/envs/pythonData/lib/python2.7/site-packages/pandas/core/tools/numeric.pyc\u001b[0m in \u001b[0;36mto_numeric\u001b[0;34m(arg, errors, downcast)\u001b[0m\n\u001b[1;32m 124\u001b[0m \u001b[0mcoerce_numeric\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mFalse\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0merrors\u001b[0m \u001b[0;32min\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0;34m'ignore'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'raise'\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32melse\u001b[0m \u001b[0mTrue\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 125\u001b[0m values = lib.maybe_convert_numeric(values, set(),\n\u001b[0;32m--> 126\u001b[0;31m coerce_numeric=coerce_numeric)\n\u001b[0m\u001b[1;32m 127\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 128\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mException\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32mpandas/_libs/src/inference.pyx\u001b[0m in \u001b[0;36mpandas._libs.lib.maybe_convert_numeric (pandas/_libs/lib.c:56638)\u001b[0;34m()\u001b[0m\n",
"\u001b[0;31mValueError\u001b[0m: Unable to parse string \"5.7*\" at position 2"
]
}
],
"source": [
"person.height = pd.to_numeric(person.height) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* What happens to 5.7\\* when you coerce to a numeric value?"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" age | \n",
" height | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 21 | \n",
" 6.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 42 | \n",
" 5.9 | \n",
"
\n",
" \n",
" 2 | \n",
" 18 | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" 21 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age height\n",
"0 21 6.0\n",
"1 42 5.9\n",
"2 18 NaN\n",
"3 21 NaN"
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"person.height = pd.to_numeric(person.height, errors='coerce') \n",
"person"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let’s check the structure. It’s a data frame containing:\n",
"* an age column of ints\n",
"* a height columns of floats."
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 4 entries, 0 to 3\n",
"Data columns (total 2 columns):\n",
"age 4 non-null int64\n",
"height 2 non-null float64\n",
"dtypes: float64(1), int64(1)\n",
"memory usage: 136.0 bytes\n"
]
}
],
"source": [
"person.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Part 2 Dealing with unstructured data\n",
"## Step 1 Readlines"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"with open(\"daltons.txt\") as f:\n",
" txt = f.readlines()"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['%% Data on the Dalton Brothers\\r\\n',\n",
" 'Gratt,1861,1892\\r\\n',\n",
" 'Bob,1892\\r\\n',\n",
" '1871,Emmet,1937\\r\\n',\n",
" '% Names, birth and death dates\\r\\n']"
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pattern matching\n",
"\n",
"Several example datasets come as part of Python's scikit-learn package. \n",
"http://scikit-learn.org/stable/tutorial/basic/tutorial.html#loading-example-dataset\n",
"\n",
"However, here we are going to download a csv file."
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"iris = pd.read_csv('https://github.com/pandas-dev/pandas/raw/master/pandas/tests/data/iris.csv')"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 150 entries, 0 to 149\n",
"Data columns (total 5 columns):\n",
"SepalLength 150 non-null float64\n",
"SepalWidth 150 non-null float64\n",
"PetalLength 150 non-null float64\n",
"PetalWidth 150 non-null float64\n",
"Name 150 non-null object\n",
"dtypes: float64(4), object(1)\n",
"memory usage: 5.9+ KB\n"
]
}
],
"source": [
"iris.info()"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"names = iris.columns.tolist() # Alternatively list(iris)"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Name']"
]
},
"execution_count": 80,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"names"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Using list comprehension"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Python's list comprehension applys a function to each element in a list."
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[8, 10, 12]"
]
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"numbers = [4,5,6]\n",
"[x*2 for x in numbers]"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 82,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"'Petal' in 'PetalLength'"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[False, False, True, True, False]"
]
},
"execution_count": 83,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[\"Petal\" in name for name in names]"
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['PetalLength', 'PetalWidth']"
]
},
"execution_count": 84,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[name for name in names if 'Petal' in name]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Regular expressions\n",
"* As above, using regular expressions"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['PetalLength', 'PetalWidth']"
]
},
"execution_count": 85,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import re\n",
"[name for name in names if re.search(\"Petal\", name)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"• ^ matches pattern at start"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['PetalLength', 'PetalWidth']"
]
},
"execution_count": 86,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[name for name in names if re.search(\"^P\", name)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"• $ matches pattern at end"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']"
]
},
"execution_count": 87,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[name for name in names if re.search(\"th$\", name)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"• [] character class, match characters enclosed in []"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['SepalLength', 'PetalLength']"
]
},
"execution_count": 88,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[name for name in names if re.search(\"[g][t][h]\", name)]\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For more see help(re) for full explanation."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### An aside on Simple string matching alternatives (regular expression are more powerful, and return position of match)"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['PetalLength', 'PetalWidth']"
]
},
"execution_count": 89,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[name for name in names if name.startswith(\"P\")]"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[False, False, True, True, False]"
]
},
"execution_count": 90,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[name.startswith(\"P\") for name in names]"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']"
]
},
"execution_count": 91,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[name for name in names if name.endswith(\"th\")]"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['SepalLength', 'PetalLength']"
]
},
"execution_count": 92,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[name for name in names if \"gth\" in name]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Subsetting and Logicals"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Logical AND &"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" SepalLength | \n",
" SepalWidth | \n",
" PetalLength | \n",
" PetalWidth | \n",
" Name | \n",
"
\n",
" \n",
" \n",
" \n",
" 70 | \n",
" 5.9 | \n",
" 3.2 | \n",
" 4.8 | \n",
" 1.8 | \n",
" Iris-versicolor | \n",
"
\n",
" \n",
" 77 | \n",
" 6.7 | \n",
" 3.0 | \n",
" 5.0 | \n",
" 1.7 | \n",
" Iris-versicolor | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" SepalLength SepalWidth PetalLength PetalWidth Name\n",
"70 5.9 3.2 4.8 1.8 Iris-versicolor\n",
"77 6.7 3.0 5.0 1.7 Iris-versicolor"
]
},
"execution_count": 93,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris[(iris.Name == \"Iris-versicolor\") & (iris.PetalWidth >= 1.7)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Logical OR |"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" SepalLength | \n",
" SepalWidth | \n",
" PetalLength | \n",
" PetalWidth | \n",
" Name | \n",
"
\n",
" \n",
" \n",
" \n",
" 13 | \n",
" 4.3 | \n",
" 3.0 | \n",
" 1.1 | \n",
" 0.1 | \n",
" Iris-setosa | \n",
"
\n",
" \n",
" 131 | \n",
" 7.9 | \n",
" 3.8 | \n",
" 6.4 | \n",
" 2.0 | \n",
" Iris-virginica | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" SepalLength SepalWidth PetalLength PetalWidth Name\n",
"13 4.3 3.0 1.1 0.1 Iris-setosa\n",
"131 7.9 3.8 6.4 2.0 Iris-virginica"
]
},
"execution_count": 94,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris[(iris.SepalLength == 4.3) | (iris.SepalLength == 7.9)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Note == for comparison not ="
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Logical NOT ~"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" SepalLength | \n",
" SepalWidth | \n",
" PetalLength | \n",
" PetalWidth | \n",
" Name | \n",
"
\n",
" \n",
" \n",
" \n",
" 13 | \n",
" 4.3 | \n",
" 3.0 | \n",
" 1.1 | \n",
" 0.1 | \n",
" Iris-setosa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" SepalLength SepalWidth PetalLength PetalWidth Name\n",
"13 4.3 3.0 1.1 0.1 Iris-setosa"
]
},
"execution_count": 95,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris[~(iris.SepalLength > 4.3)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Selecting rows and columns\n",
"\n",
"* Pandas filter() command selects columns\n",
"* Can filter by regular expression"
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index([u'PetalLength', u'PetalWidth'], dtype='object')"
]
},
"execution_count": 96,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris.filter(regex='^P').columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Select columns and rows at the same time"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" PetalLength | \n",
" PetalWidth | \n",
"
\n",
" \n",
" \n",
" \n",
" 13 | \n",
" 1.1 | \n",
" 0.1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" PetalLength PetalWidth\n",
"13 1.1 0.1"
]
},
"execution_count": 97,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris.filter(regex='^P')[~(iris.SepalLength > 4.3)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2 (cont) Selecting lines only with data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Find lines starting with a % sign"
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['%% Data on the Dalton Brothers\\r\\n', '% Names, birth and death dates\\r\\n']"
]
},
"execution_count": 98,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[name for name in txt if re.search(\"^%\", name)]\n"
]
},
{
"cell_type": "code",
"execution_count": 99,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Gratt,1861,1892\\r\\n', 'Bob,1892\\r\\n', '1871,Emmet,1937\\r\\n']"
]
},
"execution_count": 99,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dat = [name for name in txt if not re.search(\"^%\", name)]\n",
"dat"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 3 - split lines into fields"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* For each line, we now want to extract the content for each field.\n",
"* We now need to know about splitting lines and learn about lists in Python."
]
},
{
"cell_type": "code",
"execution_count": 100,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"L = [1,2, \"three\", [3,3]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[] retrieves an object from the list.\n",
"Indexing starts at zero."
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 101,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"L[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Can select a range of values"
]
},
{
"cell_type": "code",
"execution_count": 102,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[1, 2, 'three']"
]
},
"execution_count": 102,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"L[0:3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use - to count from end\n",
"e.g. last 2 items"
]
},
{
"cell_type": "code",
"execution_count": 103,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'three'"
]
},
"execution_count": 103,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"L[-2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From second last to end"
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['three', [3, 3]]"
]
},
"execution_count": 104,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"L[-2:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### split\n",
"split() splits a string into a list of substrings at the point indicated by the split pattern."
]
},
{
"cell_type": "code",
"execution_count": 105,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Split', 'the', 'words', 'in', 'a', 'sentence\\n']"
]
},
"execution_count": 105,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x = \"Split the words in a sentence\\n\"\n",
"x.split(\" \")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Use split() to split each line into data chunks.\n",
"* Use strip() to remove whitespace characters such as \\n"
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Split', 'the', 'words', 'in', 'a', 'sentence']"
]
},
"execution_count": 106,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x.strip().split(\" \")"
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[['Gratt', '1861', '1892'], ['Bob', '1892'], ['1871', 'Emmet', '1937']]"
]
},
"execution_count": 107,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"field_list = [ln.strip().split(\",\") for ln in dat]\n",
"field_list"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 4 - Standardise Rows"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Now we want to make sure each row has the same number of fields and in the same order.\n",
"* Let's write a function to process each row."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## User-defined function in Python"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`def my_function(arg1, arg2, ...):\n",
" statements\n",
" return object\n",
"code not in my_function`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Objects in the function are local to the function.\n",
"* The object returned can be any data type\n",
"* Functions are stored as objects\n",
"* An explicit return statement is required.\n",
"* : marks the start of the body of the function. The body must be indented, the end of the indentation marks the end of the function.\n",
"****\n",
"* So let's write a function that takes the list representing each line, extracts the person's name, their birth and death dates and re-orders them accordingly.\n",
"* Let's call this function assign_fields and store it in a file called assign_fields.py\n",
"\n",
"* Open a text file with:\n",
" `nano assign_fields.py`"
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd\n",
"def assign_fields(x):\n",
" # x is a list of words from a line.\n",
" \n",
" # create a list to hold the extracted fields, initialised to 'NA' by default.\n",
" out = ['NA'] * 3\n",
" \n",
" for word in x:\n",
" # extract the name value (alphabetical) and insert in the first position.\n",
" if word.isalpha():\n",
" out[0] = word\n",
" else: \n",
" # extract birth date (if any)\n",
" # based on knowledge that all Dalton brothers were born before 1890\n",
" # and died after 1890\n",
" if (int(word) < 1890): \n",
" out[1] = word\n",
" elif (int(word) > 1890): \n",
" out[2] = word\n",
" # Returns a list format: [name, born, died] \n",
" return out"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Save the assign_fields.py file and restart ipython."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Read the file in again after re-starting ipython"
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import re\n",
"with open(\"daltons.txt\") as f:\n",
" txt = f.readlines()\n",
"dat = [name for name in txt if not re.search(\"^%\", name)]\n",
"field_list = [ln.strip().split(\",\") for ln in dat]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* From the assign_fields module file, import the assign_fields function.\n",
"* Run the assign_fields function over every row in the list of fields"
]
},
{
"cell_type": "code",
"execution_count": 110,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[['Gratt', '1861', '1892'], ['Bob', 'NA', '1892'], ['Emmet', '1871', '1937']]"
]
},
"execution_count": 110,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from assign_fields import assign_fields\n",
"standard_fields = [assign_fields(ln) for ln in field_list]\n",
"standard_fields"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 5 - Transform to a data frame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Let's convert the list of standardised rows into a data frame."
]
},
{
"cell_type": "code",
"execution_count": 111,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"daltons = pd.DataFrame(standard_fields)"
]
},
{
"cell_type": "code",
"execution_count": 112,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Gratt | \n",
" 1861 | \n",
" 1892 | \n",
"
\n",
" \n",
" 1 | \n",
" Bob | \n",
" NA | \n",
" 1892 | \n",
"
\n",
" \n",
" 2 | \n",
" Emmet | \n",
" 1871 | \n",
" 1937 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2\n",
"0 Gratt 1861 1892\n",
"1 Bob NA 1892\n",
"2 Emmet 1871 1937"
]
},
"execution_count": 112,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"daltons"
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"daltons = pd.DataFrame(standard_fields, columns=['name','birth','death'])"
]
},
{
"cell_type": "code",
"execution_count": 114,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" birth | \n",
" death | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Gratt | \n",
" 1861 | \n",
" 1892 | \n",
"
\n",
" \n",
" 1 | \n",
" Bob | \n",
" NA | \n",
" 1892 | \n",
"
\n",
" \n",
" 2 | \n",
" Emmet | \n",
" 1871 | \n",
" 1937 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name birth death\n",
"0 Gratt 1861 1892\n",
"1 Bob NA 1892\n",
"2 Emmet 1871 1937"
]
},
"execution_count": 114,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"daltons"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 6 - Normalise and coerce to correct type"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Now need to coerce our columns to the correct types e.g. numerics, characters, categories, ... In this case birth and death, need to be numerics."
]
},
{
"cell_type": "code",
"execution_count": 122,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 3 entries, 0 to 2\n",
"Data columns (total 3 columns):\n",
"name 3 non-null object\n",
"birth 2 non-null float64\n",
"death 3 non-null int64\n",
"dtypes: float64(1), int64(1), object(1)\n",
"memory usage: 144.0+ bytes\n"
]
}
],
"source": [
"daltons.info()"
]
},
{
"cell_type": "code",
"execution_count": 123,
"metadata": {},
"outputs": [],
"source": [
"daltons.birth = pd.to_numeric(daltons.birth, errors='coerce') "
]
},
{
"cell_type": "code",
"execution_count": 124,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"daltons.death = pd.to_numeric(daltons.death, errors='coerce') "
]
},
{
"cell_type": "code",
"execution_count": 125,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" birth | \n",
" death | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Gratt | \n",
" 1861.0 | \n",
" 1892 | \n",
"
\n",
" \n",
" 1 | \n",
" Bob | \n",
" NaN | \n",
" 1892 | \n",
"
\n",
" \n",
" 2 | \n",
" Emmet | \n",
" 1871.0 | \n",
" 1937 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name birth death\n",
"0 Gratt 1861.0 1892\n",
"1 Bob NaN 1892\n",
"2 Emmet 1871.0 1937"
]
},
"execution_count": 125,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"daltons"
]
},
{
"cell_type": "code",
"execution_count": 126,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 3 entries, 0 to 2\n",
"Data columns (total 3 columns):\n",
"name 3 non-null object\n",
"birth 2 non-null float64\n",
"death 3 non-null int64\n",
"dtypes: float64(1), int64(1), object(1)\n",
"memory usage: 144.0+ bytes\n"
]
}
],
"source": [
"daltons.info()"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"* The birth column contains floats instead of integers because you can't mix int and NaN data types in pandas."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Repeatability"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Storing the instructions in a file along **with comments** enables repetablility. \n",
"* Remember to import the required modules.\n",
"* Ipython notebooks allow nicely formatted comments, code, and output to be mixed."
]
},
{
"cell_type": "code",
"execution_count": 128,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Daltons\n",
" name birth death\n",
"0 Gratt 1861.0 1892\n",
"1 Bob NaN 1892\n",
"2 Emmet 1871.0 1937\n",
"\n",
"Info\n",
"\n",
"RangeIndex: 3 entries, 0 to 2\n",
"Data columns (total 3 columns):\n",
"name 3 non-null object\n",
"birth 2 non-null float64\n",
"death 3 non-null int64\n",
"dtypes: float64(1), int64(1), object(1)\n",
"memory usage: 144.0+ bytes\n",
"None\n"
]
}
],
"source": [
"import pandas as pd\n",
"import re\n",
"with open(\"daltons.txt\") as f:\n",
" txt = f.readlines()\n",
"dat = [name for name in txt if not re.search(\"^%\", name)]\n",
"field_list = [ln.strip().split(\",\") for ln in dat]\n",
"from assign_fields import assign_fields\n",
"standard_fields = [assign_fields(ln) for ln in field_list]\n",
"colnames = ['name', 'birth', 'death']\n",
"daltons = pd.DataFrame(standard_fields, columns=colnames)\n",
"daltons.birth = pd.to_numeric(daltons.birth, errors='coerce') \n",
"daltons.death = pd.to_numeric(daltons.death, errors='coerce') \n",
"print(\"Daltons\")\n",
"print(daltons)\n",
"print('\\nInfo')\n",
"print(daltons.info())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fixing character vectors - re.sub"
]
},
{
"cell_type": "code",
"execution_count": 129,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Replace-the-spaces-in-this-text'"
]
},
"execution_count": 129,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import re\n",
"import pandas as pd\n",
"string = \"Replace the spaces in this text\"\n",
"re.sub(\" \", \"-\", string)"
]
},
{
"cell_type": "code",
"execution_count": 130,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Replace-first space in this text'"
]
},
"execution_count": 130,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"string = \"Replace first space in this text\"\n",
"re.sub(\" \", \"-\", string, count=1)"
]
},
{
"cell_type": "code",
"execution_count": 131,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"iris = pd.read_csv('https://github.com/pandas-dev/pandas/raw/master/pandas/tests/data/iris.csv')"
]
},
{
"cell_type": "code",
"execution_count": 132,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"names = iris.columns.tolist()"
]
},
{
"cell_type": "code",
"execution_count": 133,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Name']"
]
},
"execution_count": 133,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"names"
]
},
{
"cell_type": "code",
"execution_count": 134,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['S-palL-ngth', 'S-palWidth', 'P-talL-ngth', 'P-talWidth', 'Nam-']"
]
},
"execution_count": 134,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[re.sub(\"e\", '-', name) for name in names]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Parallel processing in Python\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Serial code:"
]
},
{
"cell_type": "code",
"execution_count": 135,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[['Gratt', '1861', '1892'], ['Bob', '1892'], ['1871', 'Emmet', '1937']]"
]
},
"execution_count": 135,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"field_list"
]
},
{
"cell_type": "code",
"execution_count": 136,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[['Gratt', '1861', '1892'], ['Bob', 'NA', '1892'], ['Emmet', '1871', '1937']]"
]
},
"execution_count": 136,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"standard_fields = [assign_fields(ln) for ln in field_list]\n",
"standard_fields"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Parallel code using 'multiprocessing' module - each line processed on a different processor."
]
},
{
"cell_type": "code",
"execution_count": 137,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[['Gratt', '1861', '1892'], ['Bob', 'NA', '1892'], ['Emmet', '1871', '1937']]"
]
},
"execution_count": 137,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import multiprocessing\n",
"from multiprocessing import Pool\n",
"\n",
"try:\n",
" cpus = multiprocessing.cpu_count()\n",
"except NotImplementedError:\n",
" cpus = 2 # arbitrary default\n",
"\n",
"pool = Pool(processes=cpus)\n",
"pool.map(assign_fields, field_list)"
]
},
{
"cell_type": "code",
"execution_count": 138,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4"
]
},
"execution_count": 138,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cpus"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 2
}