Intro to Python, Pandas and a bit of Matplotlib

Printing

To print a string or an object you just have to call print(obj)

In [24]:
print("Hello World test")
Hello World test

Note that there are no semicolons at the end.

Variables and Types

  • object oriented, not "statically typed"
  • variables without declaring their type
  • Every variable is an object
  • Variables and functions are 'lowercased_with_underscores'
  • There are different types of objects

Numbers

In [25]:
a = 7
print(a)
print(7 + 8)

print(a + 7)
7
15
14

Floating point numbers

In [60]:
b = 7.0
print(b)
7.0

Strings

In [27]:
c = 'hello'
print(c)
c = "hello"
print(c)
hello
hello

Boolean values / None

In [28]:
x = True
y = False

z = None # similar to null in java

Lists (mutable)

In [29]:
d = []
d.append(5)
print(d)

e = ['a','b','c','d','e']
print(e[3])
[5]
d

Tuple (immutable)

In [30]:
f = (1,2,3)
print(f)
(1, 2, 3)

Conditions

Simple Conditions

In [31]:
x = 2
print(x == 2) # prints out True
print(x == 3) # prints out False
print(x < 3) # prints out True
True
False
True

If Statement

In [32]:
if x == 2:
    print("x is two")
elif x==3:
    print("x is three")
else:
    print("x is not two nor three")
x is two
  • Python uses indentation to define code blocks, instead of brackets
    • no '{', '}' any more :-)

Boolean operators

In [33]:
name = "John"
age = 23
if name == "John" and age == 23:
    print("Your name is John, and you are also 23 years old.")

if name == "John" or name == "Rick":
    print("Your name is either John or Rick.")
Your name is John, and you are also 23 years old.
Your name is either John or Rick.

"In" Operator

The "in" operator could be used to check if a specified object exists within an iterable object container, such as a list

In [34]:
name = "John"
if name in ["John", "Rick"]:
    print("Your name is either John or Rick.")
Your name is either John or Rick.

"not" operator

Using "not" before a boolean expression inverts it:

In [35]:
print(not False)
True

Loops

For loop

In [36]:
primes = [2, 3, 5, 7]
for prime in primes:
    print(prime)
2
3
5
7

To get a similar behaviour as in java you can use range:

In [37]:
for i in range(5):
    print(i)
0
1
2
3
4

While loops

In [38]:
count = 0
while count < 5:
    print(count)
    count += 1  # This is the same as count = count + 1
0
1
2
3
4

Break and continue

  • break and continue as in other programming languages
  • There is no do while loop in python but you can use:
In [39]:
count = 0
while True:
    if count < 5:
        break

That's it for the moment - continue with pandas

Load data

As a first example we will load a dataset (csv) from a URL.

But as a first step we have to import Pandas:

In [40]:
import pandas as pd

Now you can access pandas with 'pd'.

Loading datasets is done by one of the following functions:

Loading a dataset from the filesystem is as easy as placing the corresponding file next to the notebook and give the filename as parameter like

pd.read_csv('myfile.csv')
In [41]:
tips_dataset = pd.read_csv('http://tinyurl.com/tips-csv')

Viewing Data

If you want see see only a few datapoints you can also just use the .head() function (shows the top datapoints) or the .tail() function(shows the bottom rows)

In [42]:
tips_dataset.head()
Out[42]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

On the left of each row you see increasing numbers. This is the index. It is not contained in the file, but helps pandas to join datasets.

In [43]:
tips_dataset.index
Out[43]:
RangeIndex(start=0, stop=244, step=1)

To get an overview of your data (at least numeric data) you can also use the .describe() function.

In [44]:
tips_dataset.describe()
Out[44]:
total_bill tip size
count 244.000000 244.000000 244.000000
mean 19.785943 2.998279 2.569672
std 8.902412 1.383638 0.951100
min 3.070000 1.000000 1.000000
25% 13.347500 2.000000 2.000000
50% 17.795000 2.900000 2.000000
75% 24.127500 3.562500 3.000000
max 50.810000 10.000000 6.000000

Sorting

Sorting is done by .sort_values(by='column')

In [45]:
tips_dataset_sorted = tips_dataset.sort_values(by='total_bill')
tips_dataset_sorted
Out[45]:
total_bill tip sex smoker day time size
67 3.07 1.00 Female Yes Sat Dinner 1
92 5.75 1.00 Female Yes Fri Dinner 2
111 7.25 1.00 Female No Sat Dinner 1
172 7.25 5.15 Male Yes Sun Dinner 2
149 7.51 2.00 Male No Thur Lunch 2
195 7.56 1.44 Male No Thur Lunch 2
218 7.74 1.44 Male Yes Sat Dinner 2
145 8.35 1.50 Female No Thur Lunch 2
135 8.51 1.25 Female No Thur Lunch 2
126 8.52 1.48 Male No Thur Lunch 2
222 8.58 1.92 Male Yes Fri Lunch 1
6 8.77 2.00 Male No Sun Dinner 2
30 9.55 1.45 Male No Sat Dinner 2
178 9.60 4.00 Female Yes Sun Dinner 2
43 9.68 1.32 Male No Sun Dinner 2
148 9.78 1.73 Male No Thur Lunch 2
53 9.94 1.56 Male No Sun Dinner 2
235 10.07 1.25 Male No Sat Dinner 2
82 10.07 1.83 Female No Thur Lunch 1
226 10.09 2.00 Female Yes Fri Lunch 2
10 10.27 1.71 Male No Sun Dinner 2
51 10.29 2.60 Female No Sun Dinner 2
16 10.33 1.67 Female No Sun Dinner 3
136 10.33 2.00 Female No Thur Lunch 2
1 10.34 1.66 Male No Sun Dinner 3
196 10.34 2.00 Male Yes Thur Lunch 2
75 10.51 1.25 Male No Sat Dinner 2
168 10.59 1.61 Female Yes Sat Dinner 2
169 10.63 2.00 Female Yes Sat Dinner 2
117 10.65 1.50 Female No Thur Lunch 2
... ... ... ... ... ... ... ...
44 30.40 5.60 Male No Sun Dinner 4
187 30.46 2.00 Male Yes Sun Dinner 5
39 31.27 5.00 Male No Sat Dinner 3
167 31.71 4.50 Male No Sun Dinner 4
173 31.85 3.18 Male Yes Sun Dinner 2
47 32.40 6.00 Male No Sun Dinner 4
83 32.68 5.00 Male Yes Thur Lunch 2
237 32.83 1.17 Male Yes Sat Dinner 2
175 32.90 3.11 Male Yes Sun Dinner 2
141 34.30 6.70 Male No Thur Lunch 6
179 34.63 3.55 Male Yes Sun Dinner 2
180 34.65 3.68 Male Yes Sun Dinner 4
52 34.81 5.20 Female No Sun Dinner 4
85 34.83 5.17 Female No Thur Lunch 4
11 35.26 5.00 Female No Sun Dinner 4
238 35.83 4.67 Female No Sat Dinner 3
56 38.01 3.00 Male Yes Sat Dinner 4
112 38.07 4.00 Male No Sun Dinner 3
207 38.73 3.00 Male Yes Sat Dinner 4
23 39.42 7.58 Male No Sat Dinner 4
95 40.17 4.73 Male Yes Fri Dinner 4
184 40.55 3.00 Male Yes Sun Dinner 2
142 41.19 5.00 Male No Thur Lunch 5
197 43.11 5.00 Female Yes Thur Lunch 4
102 44.30 2.50 Female Yes Sat Dinner 3
182 45.35 3.50 Male Yes Sun Dinner 3
156 48.17 5.00 Male No Sun Dinner 6
59 48.27 6.73 Male No Sat Dinner 4
212 48.33 9.00 Male No Sat Dinner 4
170 50.81 10.00 Male Yes Sat Dinner 3

244 rows × 7 columns

Selection

Selecting a single column:

In [46]:
tips_dataset_sorted['day']
Out[46]:
67      Sat
92      Fri
111     Sat
172     Sun
149    Thur
195    Thur
218     Sat
145    Thur
135    Thur
126    Thur
222     Fri
6       Sun
30      Sat
178     Sun
43      Sun
148    Thur
53      Sun
235     Sat
82     Thur
226     Fri
10      Sun
51      Sun
16      Sun
136    Thur
1       Sun
196    Thur
75      Sat
168     Sat
169     Sat
117    Thur
       ... 
44      Sun
187     Sun
39      Sat
167     Sun
173     Sun
47      Sun
83     Thur
237     Sat
175     Sun
141    Thur
179     Sun
180     Sun
52      Sun
85     Thur
11      Sun
238     Sat
56      Sat
112     Sun
207     Sat
23      Sat
95      Fri
184     Sun
142    Thur
197    Thur
102     Sat
182     Sun
156     Sun
59      Sat
212     Sat
170     Sat
Name: day, Length: 244, dtype: object

Selecting multiple columns:

In [47]:
tips_dataset_sorted[['day','total_bill']]
Out[47]:
day total_bill
67 Sat 3.07
92 Fri 5.75
111 Sat 7.25
172 Sun 7.25
149 Thur 7.51
195 Thur 7.56
218 Sat 7.74
145 Thur 8.35
135 Thur 8.51
126 Thur 8.52
222 Fri 8.58
6 Sun 8.77
30 Sat 9.55
178 Sun 9.60
43 Sun 9.68
148 Thur 9.78
53 Sun 9.94
235 Sat 10.07
82 Thur 10.07
226 Fri 10.09
10 Sun 10.27
51 Sun 10.29
16 Sun 10.33
136 Thur 10.33
1 Sun 10.34
196 Thur 10.34
75 Sat 10.51
168 Sat 10.59
169 Sat 10.63
117 Thur 10.65
... ... ...
44 Sun 30.40
187 Sun 30.46
39 Sat 31.27
167 Sun 31.71
173 Sun 31.85
47 Sun 32.40
83 Thur 32.68
237 Sat 32.83
175 Sun 32.90
141 Thur 34.30
179 Sun 34.63
180 Sun 34.65
52 Sun 34.81
85 Thur 34.83
11 Sun 35.26
238 Sat 35.83
56 Sat 38.01
112 Sun 38.07
207 Sat 38.73
23 Sat 39.42
95 Fri 40.17
184 Sun 40.55
142 Thur 41.19
197 Thur 43.11
102 Sat 44.30
182 Sun 45.35
156 Sun 48.17
59 Sat 48.27
212 Sat 48.33
170 Sat 50.81

244 rows × 2 columns

Boolean Indexing / Filtering

Using a single column’s values to select data.

In [48]:
tips_dataset_sorted[  tips_dataset_sorted['total_bill'] > 10.0   ]
Out[48]:
total_bill tip sex smoker day time size
235 10.07 1.25 Male No Sat Dinner 2
82 10.07 1.83 Female No Thur Lunch 1
226 10.09 2.00 Female Yes Fri Lunch 2
10 10.27 1.71 Male No Sun Dinner 2
51 10.29 2.60 Female No Sun Dinner 2
16 10.33 1.67 Female No Sun Dinner 3
136 10.33 2.00 Female No Thur Lunch 2
1 10.34 1.66 Male No Sun Dinner 3
196 10.34 2.00 Male Yes Thur Lunch 2
75 10.51 1.25 Male No Sat Dinner 2
168 10.59 1.61 Female Yes Sat Dinner 2
169 10.63 2.00 Female Yes Sat Dinner 2
117 10.65 1.50 Female No Thur Lunch 2
233 10.77 1.47 Male No Sat Dinner 2
62 11.02 1.98 Male Yes Sat Dinner 2
132 11.17 1.50 Female No Thur Lunch 2
58 11.24 1.76 Male Yes Sat Dinner 2
100 11.35 2.50 Female Yes Fri Dinner 2
128 11.38 2.00 Female No Thur Lunch 2
217 11.59 1.50 Male Yes Sat Dinner 2
232 11.61 3.39 Male No Sat Dinner 2
120 11.69 2.31 Male No Thur Lunch 2
147 11.87 1.63 Female No Thur Lunch 2
70 12.02 1.97 Male No Sat Dinner 2
97 12.03 1.50 Male Yes Fri Dinner 2
220 12.16 2.20 Male Yes Fri Lunch 2
133 12.26 2.00 Female No Thur Lunch 2
118 12.43 1.80 Female No Thur Lunch 2
99 12.46 1.50 Male No Fri Dinner 2
124 12.48 2.52 Female No Thur Lunch 2
... ... ... ... ... ... ... ...
44 30.40 5.60 Male No Sun Dinner 4
187 30.46 2.00 Male Yes Sun Dinner 5
39 31.27 5.00 Male No Sat Dinner 3
167 31.71 4.50 Male No Sun Dinner 4
173 31.85 3.18 Male Yes Sun Dinner 2
47 32.40 6.00 Male No Sun Dinner 4
83 32.68 5.00 Male Yes Thur Lunch 2
237 32.83 1.17 Male Yes Sat Dinner 2
175 32.90 3.11 Male Yes Sun Dinner 2
141 34.30 6.70 Male No Thur Lunch 6
179 34.63 3.55 Male Yes Sun Dinner 2
180 34.65 3.68 Male Yes Sun Dinner 4
52 34.81 5.20 Female No Sun Dinner 4
85 34.83 5.17 Female No Thur Lunch 4
11 35.26 5.00 Female No Sun Dinner 4
238 35.83 4.67 Female No Sat Dinner 3
56 38.01 3.00 Male Yes Sat Dinner 4
112 38.07 4.00 Male No Sun Dinner 3
207 38.73 3.00 Male Yes Sat Dinner 4
23 39.42 7.58 Male No Sat Dinner 4
95 40.17 4.73 Male Yes Fri Dinner 4
184 40.55 3.00 Male Yes Sun Dinner 2
142 41.19 5.00 Male No Thur Lunch 5
197 43.11 5.00 Female Yes Thur Lunch 4
102 44.30 2.50 Female Yes Sat Dinner 3
182 45.35 3.50 Male Yes Sun Dinner 3
156 48.17 5.00 Male No Sun Dinner 6
59 48.27 6.73 Male No Sat Dinner 4
212 48.33 9.00 Male No Sat Dinner 4
170 50.81 10.00 Male Yes Sat Dinner 3

227 rows × 7 columns

Grouping

By “group by” we are referring to a process involving one or more of the following steps

  • Splitting the data into groups based on some criteria
  • Applying a function to each group independently
  • Combining the results into a data structure
In [49]:
tips_dataset_sorted.groupby('day')
Out[49]:
<pandas.core.groupby.DataFrameGroupBy object at 0x000002492C4AC7F0>

The resut of the groupby function is just an intermediate result, you have to decide how to "combine" the results within a group

  • sum()
  • mean()
  • std()
  • min()
  • max()
In [50]:
tips_dataset_sorted.groupby('day').mean()
Out[50]:
total_bill tip size
day
Fri 17.151579 2.734737 2.105263
Sat 20.441379 2.993103 2.517241
Sun 21.410000 3.255132 2.842105
Thur 17.682742 2.771452 2.451613
In [51]:
tips_dataset_sorted.groupby('day').sum()
Out[51]:
total_bill tip size
day
Fri 325.88 51.96 40
Sat 1778.40 260.40 219
Sun 1627.16 247.39 216
Thur 1096.33 171.83 152

Note that also in the new dataset the day attribute becomes your index.

Plotting

For plotting we need an addittional import statement:

In [52]:
import matplotlib.pyplot as plt

Afterwards we can call the function .plot()

In [53]:
tips_dataset['total_bill'].plot()
plt.show()

Plotting methods allow for a handful of plot styles other than the default Line plot. These methods can be provided as the kind keyword argument to plot(). These include:

  • ‘bar’ or ‘barh’ for bar plots
  • ‘hist’ for histogram
  • ‘box’ for boxplot
  • ‘kde’ or 'density' for density plots
  • ‘area’ for area plots
  • ‘scatter’ for scatter plots
  • ‘hexbin’ for hexagonal bin plots
  • ‘pie’ for pie plots
In [54]:
tips_dataset['total_bill'].plot(kind='hist')
plt.show()
In [55]:
tips_dataset.plot(kind='scatter', x='total_bill', y='tip')
plt.show()
In [56]:
huge_tips = tips_dataset[tips_dataset['tip'] > 5]
huge_tips.head()
Out[56]:
total_bill tip sex smoker day time size
23 39.42 7.58 Male No Sat Dinner 4
44 30.40 5.60 Male No Sun Dinner 4
47 32.40 6.00 Male No Sun Dinner 4
52 34.81 5.20 Female No Sun Dinner 4
59 48.27 6.73 Male No Sat Dinner 4

There is also a nice annotation function which allows to add (for example) the weekday

In [57]:
huge_tips.plot(kind='scatter', x='total_bill', y='tip')

for index, total_bill, tip, sex, smoker, day, time, size in huge_tips.itertuples():
    plt.annotate(
        day, # text to print
        (total_bill, tip) # position in (x, y)
    )
    
plt.show()

Plot categorical data:

In [58]:
plt.figure(figsize=(10,8))#make the image a bit bigger

for name, group in tips_dataset.groupby('day'):
    plt.scatter(group['total_bill'], group['tip'], label=name)

#ste the axis labels
plt.xlabel("total_bill")
plt.ylabel("tip")
plt.legend()

plt.show()