Introduction To Pandas

Pandas is one of the widely used Python libraries for working with data. it is built on libraries like Matplotlib and NumPy. Pandas is great for data manipulation, data analysis, and data visualization.

In this tutorial we will see how pandas makes life really easy for a data analysis. Pandas can read and write data from and to CSV files or even databases easily.

Data Structures in Pandas

Series

A series is a one-dimensional object, like an array, list or could be understood as a column in table. similar to the array or list index each element in a series is assigned with a labeled index. By default, each item is given an numerical index label from 0 to N, where N is the length of the Series minus one.

How to create Series The basic method to create a Series is to call .Series()

# import pandas
import pandas as pd

# create a Series with an arbitrary list
X1 = pd.Series([7, 'develbyte', 3.14, 'Happy Learnning'])
X1

                7
        develbyte
             3.14
  Happy Learnning
dtype: object

Note:- when the Series contains elements of multiple different datatypes the dtype of the series will be the higher datatype

int32 > int64 > flot64 >.....>object

X2 = pd.Series([7, 5, 4, 3])
print(X2)

X3 = pd.Series([7, 5, 4., 3.])
print(X3)

  7
  5
  4
  3
dtype: int64
  7.0
  5.0
  4.0
  3.0
dtype: float64

creating a series with index index of the series elements can also be changes by simply passing a list of indexes, the list of elements and the list of indexes should be of same length or you will end up with error

X1 = pd.Series([7, 'develbyte', 3.14, 'Happy Learnning'],
              index=['A', 'B', 'C', 'D'])

print(X1)

A                  7
B          develbyte
C               3.14
D    Happy Learnning
dtype: object

DataFrame

DataFrame is a two-dimensional labeled data structures with columns of same or different data types. Similar to tables in a database the DataFrame can hold multiple columns with multiple data types. You can also think of a DataFrame as a group of Series objects that share an index.

How to import Data in Dataframe

# Importing the dataset
dataset = pd.read_csv('../data/data.csv')
dataset

	Country	Age	Salary	Purchased
0	France	44.0	72000.0	No
1	Spain	27.0	48000.0	Yes
2	Germany	30.0	54000.0	No
3	Spain	38.0	61000.0	No
4	Germany	40.0	NaN	Yes
5	France	35.0	58000.0	Yes
6	Spain	NaN	52000.0	No
7	France	48.0	79000.0	Yes
8	Germany	50.0	83000.0	No
9	France	37.0	67000.0	Yes

How to inspect Data in Dataframe

Very first information what we would like to know in a dataframe are - number of columns - number of records - attribute names - datatype of each attribute
we can get all these information by calling just one function info() it will give Concise summary of a DataFrame

dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
Country      10 non-null object
Age          9 non-null float64
Salary       9 non-null float64
Purchased    10 non-null object
dtypes: float64(2), object(2)
memory usage: 392.0+ bytes

we can also use dtypes to get the datatypes of each attribute

dataset.dtypes

Country       object
Age          float64
Salary       float64
Purchased     object
dtype: object

for just geting the column names in a dataframe use dataset.columns

# columns.values gives the column names in the DataFrame
dataset.columns.values

array(['Country', 'Age', 'Salary', 'Purchased'], dtype=object)

similarly for index values

# index.values gives the list of row indices
dataset.index.values

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Probably the most useful function for inspecting a data set in a DataFrame is describe() it will return basic statistics about the dataset’s numeric columns

dataset.describe()

	Age	Salary
count	9.000000	9.000000
mean	38.777778	63777.777778
std	7.693793	12265.579662
min	27.000000	48000.000000
25%	35.000000	54000.000000
50%	38.000000	61000.000000
75%	44.000000	72000.000000
max	50.000000	83000.000000

For take a look at the actual data head() and tail() are the most useful function - head method shows first n rows from the DataFrame, default value of n is 5 - tail method shows last n rows from the DataFrame, default value of n is 5

dataset.head()

	Country	Age	Salary	Purchased
0	France	44.0	72000.0	No
1	Spain	27.0	48000.0	Yes
2	Germany	30.0	54000.0	No
3	Spain	38.0	61000.0	No
4	Germany	40.0	NaN	Yes

dataset.tail()

	Country	Age	Salary	Purchased
5	France	35.0	58000.0	Yes
6	Spain	NaN	52000.0	No
7	France	48.0	79000.0	Yes
8	Germany	50.0	83000.0	No
9	France	37.0	67000.0	Yes

Select and Index in DataFrames

There are three main options in pandas, which allows us to access the data in DataFrame, these are based on index and location of the rows and column, these options could be confusing for beginners but its quit simple once understood.

The selection methods are:

Selecting data by row numbers called integer-location based indexing/selection (.iloc)

dataset.iloc[<row selection>, <column selection>]
Selecting data by label or by a conditional statement (.loc)

dataset.loc[<row selection>, <column selection>]
Selecting in a hybrid approach(.ix)

dataset.ix[<row selection>, <column selection>]

Single selections using iloc and DataFrame

Rows:

R1 = dataset.iloc[0] # first row of data frame - Note a Series data type output.
R2 = dataset.iloc[1] # second row of data frame
R3 = dataset.iloc[-1] # last row of data frame
print ("---------------------------------------")
print (R1)
print ("---------------------------------------")
print (R2)
print ("---------------------------------------")
print (R3)

---------------------------------------
Country      France
Age              44
Salary        72000
Purchased        No
Name: 0, dtype: object
---------------------------------------
Country      Spain
Age             27
Salary       48000
Purchased      Yes
Name: 1, dtype: object
---------------------------------------
Country      France
Age              37
Salary        67000
Purchased       Yes
Name: 9, dtype: object

Columns:

C1 = dataset.iloc[:,0] # first column of data frame (first_name)
C2 = dataset.iloc[:,1] # second column of data frame (last_name)
C3 = dataset.iloc[:,-1] # last column of data frame (id)
print ("---------------------------------------")
print (C1)
print ("---------------------------------------")
print (C2)
print ("---------------------------------------")
print (C3)

---------------------------------------
   France
    Spain
  Germany
    Spain
  Germany
   France
    Spain
   France
  Germany
   France
Name: Country, dtype: object
---------------------------------------
  44.0
  27.0
  30.0
  38.0
  40.0
  35.0
   NaN
  48.0
  50.0
  37.0
Name: Age, dtype: float64
---------------------------------------
   No
  Yes
   No
   No
  Yes
  Yes
   No
  Yes
   No
  Yes
Name: Purchased, dtype: object

Multiple row and column selections using iloc and DataFrame

MR1 = dataset.iloc[0:5] # first five rows of dataframe
MR2 = dataset.iloc[:, 0:2] # first two columns of data frame with all rows
MR3 = dataset.iloc[[0, 3, 6, 9], [0, 3]] # 1st, 4th, 7th, 9th row + 1st 3th 4th columns.
MR4 = dataset.iloc[0:5, 1:3] # first 5 rows and 4th, 5th columns of data frame.

print ("---------------------------------------")
print (MR1)
print ("---------------------------------------")
print (MR2)
print ("---------------------------------------")
print (MR3)
print ("---------------------------------------")
print (MR4)

---------------------------------------
   Country   Age   Salary Purchased
 France  44.0  72000.0        No
  Spain  27.0  48000.0       Yes
Germany  30.0  54000.0        No
  Spain  38.0  61000.0        No
Germany  40.0      NaN       Yes
---------------------------------------
   Country   Age
 France  44.0
  Spain  27.0
Germany  30.0
  Spain  38.0
Germany  40.0
 France  35.0
  Spain   NaN
 France  48.0
Germany  50.0
 France  37.0
---------------------------------------
  Country Purchased
France        No
 Spain        No
 Spain        No
France       Yes
---------------------------------------
    Age   Salary
44.0  72000.0
27.0  48000.0
30.0  54000.0
38.0  61000.0
40.0      NaN

.iloc returns a Pandas Series when one only row or Column is selected

print(".iloc returns a Pandas Series when one only row or Column is selected")
print(type(dataset.iloc[0]))
print(type(dataset.iloc[:,1]))

.iloc returns a Pandas Series when one only row or Column is selected
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>

.iloc returns a Pandas DataFrame when multiple rows or Columns are selected

print("\n .iloc returns a Pandas DataFrame when multiple rows or Columns are selected")
print(type(dataset.iloc[0:2]))
print(type(dataset.iloc[:,1:3]))
print(type(dataset.iloc[1:2, 3:6]))

 .iloc returns a Pandas DataFrame when multiple rows or Columns are selected
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>

.iloc returns a Pandas Series when multiple rows are selected with only one column

print("\n .iloc returns a Pandas Series when multiple rows are selected with only one column")
print(type(dataset.iloc[1:2, 3]))

 .iloc returns a Pandas Series when multiple rows are selected with only one column
<class 'pandas.core.series.Series'>

above output could easily converted to Pandas DataFrame by passing a single-valued list as column index

print("\n Above output could easly converted to Pandas DataFrame by passing a single-valued list as column index")
print(type(dataset.iloc[1:2, [3]]))

 Above output could easily converted to Pandas DataFrame by passing a single-valued list as column index
<class 'pandas.core.frame.DataFrame'>

some more fun with iloc

X = dataset.iloc[:, :-1]
Y = dataset.iloc[:, 3]

print ("---------------------------------------")
print(type(X))
print(X)
print ("---------------------------------------")
print(type(Y))
print(Y)

---------------------------------------
<class 'pandas.core.frame.DataFrame'>
   Country   Age   Salary
 France  44.0  72000.0
  Spain  27.0  48000.0
Germany  30.0  54000.0
  Spain  38.0  61000.0
Germany  40.0      NaN
 France  35.0  58000.0
  Spain   NaN  52000.0
 France  48.0  79000.0
Germany  50.0  83000.0
 France  37.0  67000.0
---------------------------------------
<class 'pandas.core.series.Series'>
   No
  Yes
   No
   No
  Yes
  Yes
   No
  Yes
   No
  Yes
Name: Purchased, dtype: object

Selecting pandas data using loc:- The Pandas loc indexer can be used with DataFrames in two different scenarios:

 a.) Selecting rows by label/index
 b.) Selecting rows with a Boolean/conditional lookup

The loc indexer is used with the same syntax as iloc: data.loc[<row selection>, <column selection>]

Index can be in a DataFrame by using set_index() method

dataset.set_index("Country", inplace=True)
dataset

	Age	Salary	Purchased
Country
France	44.0	72000.0	No
Spain	27.0	48000.0	Yes
Germany	30.0	54000.0	No
Spain	38.0	61000.0	No
Germany	40.0	NaN	Yes
France	35.0	58000.0	Yes
Spain	NaN	52000.0	No
France	48.0	79000.0	Yes
Germany	50.0	83000.0	No
France	37.0	67000.0	Yes

Selecting rows by index

dataset.loc['France']

	Age	Salary	Purchased
Country
France	44.0	72000.0	No
France	35.0	58000.0	Yes
France	48.0	79000.0	Yes
France	37.0	67000.0	Yes

Selecting rows by label/index

dataset.loc[['France', 'Spain'], ['Age', 'Salary']]

	Age	Salary
Country
France	44.0	72000.0
France	35.0	58000.0
France	48.0	79000.0
France	37.0	67000.0
Spain	27.0	48000.0
Spain	38.0	61000.0
Spain	NaN	52000.0

dataset.loc[['France', 'Spain'], 'Age':'Purchased']

	Age	Salary	Purchased
Country
France	44.0	72000.0	No
France	35.0	58000.0	Yes
France	48.0	79000.0	Yes
France	37.0	67000.0	Yes
Spain	27.0	48000.0	Yes
Spain	38.0	61000.0	No
Spain	NaN	52000.0	No

resetting the index and setting a new index

dataset.reset_index(inplace=True)

dataset.set_index('Age', inplace=True)
dataset

	Country	Salary	Purchased
Age
44.0	France	72000.0	No
27.0	Spain	48000.0	Yes
30.0	Germany	54000.0	No
38.0	Spain	61000.0	No
40.0	Germany	NaN	Yes
35.0	France	58000.0	Yes
NaN	Spain	52000.0	No
48.0	France	79000.0	Yes
50.0	Germany	83000.0	No
37.0	France	67000.0	Yes

Selecting rows by label/index

dataset.loc[[44.0, 27.0], ['Country', 'Salary']]

	Country	Salary
Age
44.0	France	72000.0
27.0	Spain	48000.0

Selecting rows with a Boolean/conditional lookup

dataset.loc[dataset['Country'] == 'France', ['Country', 'Salary']]

	Country	Salary
Age
44.0	France	72000.0
35.0	France	58000.0
48.0	France	79000.0
37.0	France	67000.0

Selections can be achieved outside of the main .loc for clarity Form a separate variable with your selections: like in the example below Select only the True values in idx and only the 3 columns specified:

idx = dataset['Country'].apply(lambda x: x.lower() == 'france')
dataset.loc[idx, ['Country', 'Salary']]

	Country	Salary
Age
44.0	France	72000.0
35.0	France	58000.0
48.0	France	79000.0
37.0	France	67000.0

Selecting pandas data using ix:-

ix[] indexer is a hybrid of .loc and .iloc,
ix is label based indexer, it behave just like .loc, it also supports integer based indexing like .iloc
ix indexing works just the same as .loc when passed strings

dataset.ix[['Country']] == dataset.loc[['Country']]

	Country	Salary	Purchased
Age
Country	False	False	False

ix indexing works the same as .iloc when passed integers

dataset.reset_index(inplace=True)
dataset.ix[[2]] == dataset.iloc[[2]]

	Age	Country	Salary	Purchased
2	True	True	True	True

Add and Delete in DataFrame

Adding row in DataFrame

Note:- General recommendation for adding a row is to use .loc to insert rows in DataFrame If you would use .ix, you might try to reference a numerically valued index with the index value and accidentally overwrite an existing row of your DataFrame.

dataset.loc[10] = ['India', 27, 65000, 'Yes']
dataset.ix[11] = ['India', 26, 60000, 'Yes']
dataset

	Age	Country	Salary	Purchased
0	44	France	72000.0	No
1	27	Spain	48000.0	Yes
2	30	Germany	54000.0	No
3	38	Spain	61000.0	No
4	40	Germany	NaN	Yes
5	35	France	58000.0	Yes
6	NaN	Spain	52000.0	No
7	48	France	79000.0	Yes
8	50	Germany	83000.0	No
9	37	France	67000.0	Yes
10	India	27	65000.0	Yes
11	India	26	60000.0	Yes

Adding column in DataFrame

Columns in DataFrame is basically a series,so adding a column in a DataFrame is as simple as assigning a new column to a DataFrame

column = pd.Series(range(1,13), dtype=float)
column

    1.0
    2.0
    3.0
    4.0
    5.0
    6.0
    7.0
    8.0
    9.0
   10.0
  11.0
  12.0
dtype: float64

#adding column in DataFrame
dataset['id'] = column
dataset

	Age	Country	Salary	Purchased	id
0	44	France	72000.0	No	1.0
1	27	Spain	48000.0	Yes	2.0
2	30	Germany	54000.0	No	3.0
3	38	Spain	61000.0	No	4.0
4	40	Germany	NaN	Yes	5.0
5	35	France	58000.0	Yes	6.0
6	NaN	Spain	52000.0	No	7.0
7	48	France	79000.0	Yes	8.0
8	50	Germany	83000.0	No	9.0
9	37	France	67000.0	Yes	10.0
10	India	27	65000.0	Yes	11.0
11	India	26	60000.0	Yes	12.0

Delete a column from DataFrame by column name

df = dataset.drop('Purchased', axis=1)
df

	Age	Country	Salary	id
0	44	France	72000.0	1.0
1	27	Spain	48000.0	2.0
2	30	Germany	54000.0	3.0
3	38	Spain	61000.0	4.0
4	40	Germany	NaN	5.0
5	35	France	58000.0	6.0
6	NaN	Spain	52000.0	7.0
7	48	France	79000.0	8.0
8	50	Germany	83000.0	9.0
9	37	France	67000.0	10.0
10	India	27	65000.0	11.0
11	India	26	60000.0	12.0

Delete a column from DataFrame by row index

one thing to be noted here is if the in-place is set to be True the deleting happens the the existing dataframe by default in-place is False, which creates new DataFrame with deleted rows

dataset.drop(dataset.index[2:7], inplace=True)
dataset

	Age	Country	Salary	Purchased	id
0	44	France	72000.0	No	1.0
1	27	Spain	48000.0	Yes	2.0
7	48	France	79000.0	Yes	8.0
8	50	Germany	83000.0	No	9.0
9	37	France	67000.0	Yes	10.0
10	India	27	65000.0	Yes	11.0
11	India	26	60000.0	Yes	12.0