Introduction To Pandas
Pandas is one of the widely used Python libraries for working with data. it is built on libraries like Matplotlib and NumPy. Pandas is great for data manipulation, data analysis, and data visualization.
In this tutorial we will see how pandas makes life really easy for a data analysis. Pandas can read and write data from and to CSV files or even databases easily.
Data Structures in Pandas
Series
A series is a one-dimensional object, like an array, list or could be understood as a column in table. similar to the array or list index each element in a series is assigned with a labeled index. By default, each item is given an numerical index label from 0 to N, where N is the length of the Series minus one.
How to create Series
The basic method to create a Series is to call .Series()
# import pandas
import pandas as pd
# create a Series with an arbitrary list
X1 = pd.Series([7, 'develbyte', 3.14, 'Happy Learnning'])
X1
0 7
1 develbyte
2 3.14
3 Happy Learnning
dtype: object
Note:- when the Series contains elements of multiple different datatypes the dtype of the series will be the higher datatype
int32 > int64 > flot64 >.....>object
X2 = pd.Series([7, 5, 4, 3])
print(X2)
X3 = pd.Series([7, 5, 4., 3.])
print(X3)
0 7
1 5
2 4
3 3
dtype: int64
0 7.0
1 5.0
2 4.0
3 3.0
dtype: float64
creating a series with index index of the series elements can also be changes by simply passing a list of indexes, the list of elements and the list of indexes should be of same length or you will end up with error
X1 = pd.Series([7, 'develbyte', 3.14, 'Happy Learnning'],
index=['A', 'B', 'C', 'D'])
print(X1)
A 7
B develbyte
C 3.14
D Happy Learnning
dtype: object
DataFrame
DataFrame is a two-dimensional labeled data structures with columns of same or different data types. Similar to tables in a database the DataFrame can hold multiple columns with multiple data types. You can also think of a DataFrame as a group of Series objects that share an index.
How to import Data in Dataframe
# Importing the dataset
dataset = pd.read_csv('../data/data.csv')
dataset
Country | Age | Salary | Purchased | |
---|---|---|---|---|
0 | France | 44.0 | 72000.0 | No |
1 | Spain | 27.0 | 48000.0 | Yes |
2 | Germany | 30.0 | 54000.0 | No |
3 | Spain | 38.0 | 61000.0 | No |
4 | Germany | 40.0 | NaN | Yes |
5 | France | 35.0 | 58000.0 | Yes |
6 | Spain | NaN | 52000.0 | No |
7 | France | 48.0 | 79000.0 | Yes |
8 | Germany | 50.0 | 83000.0 | No |
9 | France | 37.0 | 67000.0 | Yes |
How to inspect Data in Dataframe
- Very first information what we would like to know in a dataframe are
- number of columns
- number of records
- attribute names
- datatype of each attribute
we can get all these information by calling just one functioninfo()
it will give Concise summary of a DataFrame
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
Country 10 non-null object
Age 9 non-null float64
Salary 9 non-null float64
Purchased 10 non-null object
dtypes: float64(2), object(2)
memory usage: 392.0+ bytes
we can also use dtypes
to get the datatypes of each attribute
dataset.dtypes
Country object
Age float64
Salary float64
Purchased object
dtype: object
for just geting the column names in a dataframe use dataset.columns
# columns.values gives the column names in the DataFrame
dataset.columns.values
array(['Country', 'Age', 'Salary', 'Purchased'], dtype=object)
similarly for index values
# index.values gives the list of row indices
dataset.index.values
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
- Probably the most useful function for inspecting a data set in a DataFrame is
describe()
it will return basic statistics about the dataset’s numeric columns
dataset.describe()
Age | Salary | |
---|---|---|
count | 9.000000 | 9.000000 |
mean | 38.777778 | 63777.777778 |
std | 7.693793 | 12265.579662 |
min | 27.000000 | 48000.000000 |
25% | 35.000000 | 54000.000000 |
50% | 38.000000 | 61000.000000 |
75% | 44.000000 | 72000.000000 |
max | 50.000000 | 83000.000000 |
- For take a look at the actual data
head()
andtail()
are the most useful function - head method shows first n rows from the DataFrame, default value of n is 5 - tail method shows last n rows from the DataFrame, default value of n is 5
dataset.head()
Country | Age | Salary | Purchased | |
---|---|---|---|---|
0 | France | 44.0 | 72000.0 | No |
1 | Spain | 27.0 | 48000.0 | Yes |
2 | Germany | 30.0 | 54000.0 | No |
3 | Spain | 38.0 | 61000.0 | No |
4 | Germany | 40.0 | NaN | Yes |
dataset.tail()
Country | Age | Salary | Purchased | |
---|---|---|---|---|
5 | France | 35.0 | 58000.0 | Yes |
6 | Spain | NaN | 52000.0 | No |
7 | France | 48.0 | 79000.0 | Yes |
8 | Germany | 50.0 | 83000.0 | No |
9 | France | 37.0 | 67000.0 | Yes |
Select and Index in DataFrames
There are three main options in pandas, which allows us to access the data in DataFrame, these are based on index and location of the rows and column, these options could be confusing for beginners but its quit simple once understood.
The selection methods are:
-
Selecting data by row numbers called integer-location based indexing/selection (
.iloc
)dataset.iloc[<row selection>, <column selection>]
-
Selecting data by label or by a conditional statement (
.loc
)dataset.loc[<row selection>, <column selection>]
-
Selecting in a hybrid approach(
.ix
)dataset.ix[<row selection>, <column selection>]
Single selections using iloc and DataFrame
Rows:
R1 = dataset.iloc[0] # first row of data frame - Note a Series data type output.
R2 = dataset.iloc[1] # second row of data frame
R3 = dataset.iloc[-1] # last row of data frame
print ("---------------------------------------")
print (R1)
print ("---------------------------------------")
print (R2)
print ("---------------------------------------")
print (R3)
---------------------------------------
Country France
Age 44
Salary 72000
Purchased No
Name: 0, dtype: object
---------------------------------------
Country Spain
Age 27
Salary 48000
Purchased Yes
Name: 1, dtype: object
---------------------------------------
Country France
Age 37
Salary 67000
Purchased Yes
Name: 9, dtype: object
Columns:
C1 = dataset.iloc[:,0] # first column of data frame (first_name)
C2 = dataset.iloc[:,1] # second column of data frame (last_name)
C3 = dataset.iloc[:,-1] # last column of data frame (id)
print ("---------------------------------------")
print (C1)
print ("---------------------------------------")
print (C2)
print ("---------------------------------------")
print (C3)
---------------------------------------
0 France
1 Spain
2 Germany
3 Spain
4 Germany
5 France
6 Spain
7 France
8 Germany
9 France
Name: Country, dtype: object
---------------------------------------
0 44.0
1 27.0
2 30.0
3 38.0
4 40.0
5 35.0
6 NaN
7 48.0
8 50.0
9 37.0
Name: Age, dtype: float64
---------------------------------------
0 No
1 Yes
2 No
3 No
4 Yes
5 Yes
6 No
7 Yes
8 No
9 Yes
Name: Purchased, dtype: object
Multiple row and column selections using iloc
and DataFrame
MR1 = dataset.iloc[0:5] # first five rows of dataframe
MR2 = dataset.iloc[:, 0:2] # first two columns of data frame with all rows
MR3 = dataset.iloc[[0, 3, 6, 9], [0, 3]] # 1st, 4th, 7th, 9th row + 1st 3th 4th columns.
MR4 = dataset.iloc[0:5, 1:3] # first 5 rows and 4th, 5th columns of data frame.
print ("---------------------------------------")
print (MR1)
print ("---------------------------------------")
print (MR2)
print ("---------------------------------------")
print (MR3)
print ("---------------------------------------")
print (MR4)
---------------------------------------
Country Age Salary Purchased
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
4 Germany 40.0 NaN Yes
---------------------------------------
Country Age
0 France 44.0
1 Spain 27.0
2 Germany 30.0
3 Spain 38.0
4 Germany 40.0
5 France 35.0
6 Spain NaN
7 France 48.0
8 Germany 50.0
9 France 37.0
---------------------------------------
Country Purchased
0 France No
3 Spain No
6 Spain No
9 France Yes
---------------------------------------
Age Salary
0 44.0 72000.0
1 27.0 48000.0
2 30.0 54000.0
3 38.0 61000.0
4 40.0 NaN
.iloc
returns a Pandas Series when one only row or Column is selected
print(".iloc returns a Pandas Series when one only row or Column is selected")
print(type(dataset.iloc[0]))
print(type(dataset.iloc[:,1]))
.iloc returns a Pandas Series when one only row or Column is selected
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
.iloc
returns a Pandas DataFrame when multiple rows or Columns are selected
print("\n .iloc returns a Pandas DataFrame when multiple rows or Columns are selected")
print(type(dataset.iloc[0:2]))
print(type(dataset.iloc[:,1:3]))
print(type(dataset.iloc[1:2, 3:6]))
.iloc returns a Pandas DataFrame when multiple rows or Columns are selected
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
.iloc
returns a Pandas Series when multiple rows are selected with only one column
print("\n .iloc returns a Pandas Series when multiple rows are selected with only one column")
print(type(dataset.iloc[1:2, 3]))
.iloc returns a Pandas Series when multiple rows are selected with only one column
<class 'pandas.core.series.Series'>
above output could easily converted to Pandas DataFrame
by passing a single-valued list as column index
print("\n Above output could easly converted to Pandas DataFrame by passing a single-valued list as column index")
print(type(dataset.iloc[1:2, [3]]))
Above output could easily converted to Pandas DataFrame by passing a single-valued list as column index
<class 'pandas.core.frame.DataFrame'>
some more fun with iloc
X = dataset.iloc[:, :-1]
Y = dataset.iloc[:, 3]
print ("---------------------------------------")
print(type(X))
print(X)
print ("---------------------------------------")
print(type(Y))
print(Y)
---------------------------------------
<class 'pandas.core.frame.DataFrame'>
Country Age Salary
0 France 44.0 72000.0
1 Spain 27.0 48000.0
2 Germany 30.0 54000.0
3 Spain 38.0 61000.0
4 Germany 40.0 NaN
5 France 35.0 58000.0
6 Spain NaN 52000.0
7 France 48.0 79000.0
8 Germany 50.0 83000.0
9 France 37.0 67000.0
---------------------------------------
<class 'pandas.core.series.Series'>
0 No
1 Yes
2 No
3 No
4 Yes
5 Yes
6 No
7 Yes
8 No
9 Yes
Name: Purchased, dtype: object
Selecting pandas data using loc
:-
The Pandas loc
indexer can be used with DataFrames in two different scenarios:
a.) Selecting rows by label/index
b.) Selecting rows with a Boolean/conditional lookup
The loc
indexer is used with the same syntax as iloc: data.loc[<row selection>, <column selection>]
Index can be in a DataFrame by using set_index()
method
dataset.set_index("Country", inplace=True)
dataset
Age | Salary | Purchased | |
---|---|---|---|
Country | |||
France | 44.0 | 72000.0 | No |
Spain | 27.0 | 48000.0 | Yes |
Germany | 30.0 | 54000.0 | No |
Spain | 38.0 | 61000.0 | No |
Germany | 40.0 | NaN | Yes |
France | 35.0 | 58000.0 | Yes |
Spain | NaN | 52000.0 | No |
France | 48.0 | 79000.0 | Yes |
Germany | 50.0 | 83000.0 | No |
France | 37.0 | 67000.0 | Yes |
Selecting rows by index
dataset.loc['France']
Age | Salary | Purchased | |
---|---|---|---|
Country | |||
France | 44.0 | 72000.0 | No |
France | 35.0 | 58000.0 | Yes |
France | 48.0 | 79000.0 | Yes |
France | 37.0 | 67000.0 | Yes |
Selecting rows by label/index
dataset.loc[['France', 'Spain'], ['Age', 'Salary']]
Age | Salary | |
---|---|---|
Country | ||
France | 44.0 | 72000.0 |
France | 35.0 | 58000.0 |
France | 48.0 | 79000.0 |
France | 37.0 | 67000.0 |
Spain | 27.0 | 48000.0 |
Spain | 38.0 | 61000.0 |
Spain | NaN | 52000.0 |
dataset.loc[['France', 'Spain'], 'Age':'Purchased']
Age | Salary | Purchased | |
---|---|---|---|
Country | |||
France | 44.0 | 72000.0 | No |
France | 35.0 | 58000.0 | Yes |
France | 48.0 | 79000.0 | Yes |
France | 37.0 | 67000.0 | Yes |
Spain | 27.0 | 48000.0 | Yes |
Spain | 38.0 | 61000.0 | No |
Spain | NaN | 52000.0 | No |
resetting the index and setting a new index
dataset.reset_index(inplace=True)
dataset.set_index('Age', inplace=True)
dataset
Country | Salary | Purchased | |
---|---|---|---|
Age | |||
44.0 | France | 72000.0 | No |
27.0 | Spain | 48000.0 | Yes |
30.0 | Germany | 54000.0 | No |
38.0 | Spain | 61000.0 | No |
40.0 | Germany | NaN | Yes |
35.0 | France | 58000.0 | Yes |
NaN | Spain | 52000.0 | No |
48.0 | France | 79000.0 | Yes |
50.0 | Germany | 83000.0 | No |
37.0 | France | 67000.0 | Yes |
Selecting rows by label/index
dataset.loc[[44.0, 27.0], ['Country', 'Salary']]
Country | Salary | |
---|---|---|
Age | ||
44.0 | France | 72000.0 |
27.0 | Spain | 48000.0 |
Selecting rows with a Boolean/conditional lookup
dataset.loc[dataset['Country'] == 'France', ['Country', 'Salary']]
Country | Salary | |
---|---|---|
Age | ||
44.0 | France | 72000.0 |
35.0 | France | 58000.0 |
48.0 | France | 79000.0 |
37.0 | France | 67000.0 |
Selections can be achieved outside of the main .loc
for clarity
Form a separate variable with your selections:
like in the example below Select only the True values in idx
and only the 3 columns specified:
idx = dataset['Country'].apply(lambda x: x.lower() == 'france')
dataset.loc[idx, ['Country', 'Salary']]
Country | Salary | |
---|---|---|
Age | ||
44.0 | France | 72000.0 |
35.0 | France | 58000.0 |
48.0 | France | 79000.0 |
37.0 | France | 67000.0 |
Selecting pandas data using ix
:-
ix[]
indexer is a hybrid of.loc
and.iloc
,ix
is label based indexer, it behave just like.loc
, it also supports integer based indexing like.iloc
ix
indexing works just the same as.loc
when passed strings
dataset.ix[['Country']] == dataset.loc[['Country']]
Country | Salary | Purchased | |
---|---|---|---|
Age | |||
Country | False | False | False |
ix
indexing works the same as .iloc
when passed integers
dataset.reset_index(inplace=True)
dataset.ix[[2]] == dataset.iloc[[2]]
Age | Country | Salary | Purchased | |
---|---|---|---|---|
2 | True | True | True | True |
Add and Delete in DataFrame
- Adding row in DataFrame
Note:- General recommendation for adding a row is to use .loc
to insert rows in DataFrame
If you would use .ix
, you might try to reference a numerically valued index with the index value and accidentally overwrite an existing row of your DataFrame.
dataset.loc[10] = ['India', 27, 65000, 'Yes']
dataset.ix[11] = ['India', 26, 60000, 'Yes']
dataset
Age | Country | Salary | Purchased | |
---|---|---|---|---|
0 | 44 | France | 72000.0 | No |
1 | 27 | Spain | 48000.0 | Yes |
2 | 30 | Germany | 54000.0 | No |
3 | 38 | Spain | 61000.0 | No |
4 | 40 | Germany | NaN | Yes |
5 | 35 | France | 58000.0 | Yes |
6 | NaN | Spain | 52000.0 | No |
7 | 48 | France | 79000.0 | Yes |
8 | 50 | Germany | 83000.0 | No |
9 | 37 | France | 67000.0 | Yes |
10 | India | 27 | 65000.0 | Yes |
11 | India | 26 | 60000.0 | Yes |
- Adding column in DataFrame
Columns in DataFrame is basically a series,so adding a column in a DataFrame is as simple as assigning a new column to a DataFrame
column = pd.Series(range(1,13), dtype=float)
column
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
7 8.0
8 9.0
9 10.0
10 11.0
11 12.0
dtype: float64
#adding column in DataFrame
dataset['id'] = column
dataset
Age | Country | Salary | Purchased | id | |
---|---|---|---|---|---|
0 | 44 | France | 72000.0 | No | 1.0 |
1 | 27 | Spain | 48000.0 | Yes | 2.0 |
2 | 30 | Germany | 54000.0 | No | 3.0 |
3 | 38 | Spain | 61000.0 | No | 4.0 |
4 | 40 | Germany | NaN | Yes | 5.0 |
5 | 35 | France | 58000.0 | Yes | 6.0 |
6 | NaN | Spain | 52000.0 | No | 7.0 |
7 | 48 | France | 79000.0 | Yes | 8.0 |
8 | 50 | Germany | 83000.0 | No | 9.0 |
9 | 37 | France | 67000.0 | Yes | 10.0 |
10 | India | 27 | 65000.0 | Yes | 11.0 |
11 | India | 26 | 60000.0 | Yes | 12.0 |
- Delete a column from DataFrame by column name
df = dataset.drop('Purchased', axis=1)
df
Age | Country | Salary | id | |
---|---|---|---|---|
0 | 44 | France | 72000.0 | 1.0 |
1 | 27 | Spain | 48000.0 | 2.0 |
2 | 30 | Germany | 54000.0 | 3.0 |
3 | 38 | Spain | 61000.0 | 4.0 |
4 | 40 | Germany | NaN | 5.0 |
5 | 35 | France | 58000.0 | 6.0 |
6 | NaN | Spain | 52000.0 | 7.0 |
7 | 48 | France | 79000.0 | 8.0 |
8 | 50 | Germany | 83000.0 | 9.0 |
9 | 37 | France | 67000.0 | 10.0 |
10 | India | 27 | 65000.0 | 11.0 |
11 | India | 26 | 60000.0 | 12.0 |
- Delete a column from DataFrame by row index
one thing to be noted here is if the in-place is set to be True the deleting happens the the existing dataframe by default in-place is False, which creates new DataFrame with deleted rows
dataset.drop(dataset.index[2:7], inplace=True)
dataset
Age | Country | Salary | Purchased | id | |
---|---|---|---|---|---|
0 | 44 | France | 72000.0 | No | 1.0 |
1 | 27 | Spain | 48000.0 | Yes | 2.0 |
7 | 48 | France | 79000.0 | Yes | 8.0 |
8 | 50 | Germany | 83000.0 | No | 9.0 |
9 | 37 | France | 67000.0 | Yes | 10.0 |
10 | India | 27 | 65000.0 | Yes | 11.0 |
11 | India | 26 | 60000.0 | Yes | 12.0 |