Data types in data processing
- When processing data with pandas, we often encounter the problem of data type. When we get the data, we first need to make sure that we get the correct type of data. Generally, through the conversion of data type, this article introduces the data type inside pandas (data types are commonly used dtyps), and the data correspondence between pandas and numpy.

- The main types are object, int64, float64, datetime64, bool, etc. category and timedelta are introduced separately in other articles.Of course, this article will also cover a simple introduction.
Problems with data types are usually discovered after a problem occurs, so once you have some experience, you will get the data, and then directly look at the data type, whether it is consistent with the data format you want to process, which can avoid some embarrassing problems from the beginning.So let's take a simple example and introduce a data type using jupyter notebook.
####Import two commonly used data processing packages, numpy and pandas, by convention
import numpy as np
import pandas as pd
# Reads the data from the csv file, the data table only has 5 rows, contains float, string, int three data python types, namely float64, object, int64 for the corresponding pandas
# There are six columns in the csv file, the first is the header and the rest is the data.
df = pd.read_csv("sales_data_types.csv")
print(df)
Customer Number Customer Name 2016 2017 \
0 10002 Quest Industries $125,000.00 $162,500.00
1 552278 Smith Plumbing $920,000.00 $1,012,000.00
2 23477 ACME Industrial $50,000.00 $62,500.00
3 24900 Brekke LTD $350,000.00 $490,000.00
4 651029 Harbor Co $15,000.00 $12,750.00
Percent Growth Jan Units Month Day Year Active
0 30.00% 500 1 10 2015 Y
1 10.00% 700 6 15 2014 Y
2 25.00% 125 3 29 2016 Y
3 4.00% 75 10 27 2015 Y
4 -15.00% Closed 2 2 2014 N
df.dtypes
Customer Number int64
Customer Name object
2016 object
2017 object
Percent Growth object
Jan Units object
Month int64
Day int64
Year int64
Active object
dtype: object
# If you want to get the sum of 2016 and 2017 data, you can try, but this is not the answer we need, because the data type in these two columns is object, and after doing so, you get a longer string.
# Of course, we can get more detailed information about the data frame through df.info().
df['2016']+df['2017']
0 $125,000.00 $162,500.00
1 $920,000.00 $1,012,000.00
2 $50,000.00 $62,500.00
3 $350,000.00 $490,000.00
4 $15,000.00 $12,750.00
dtype: object
df.info()
# Customer Number column is float64, but should be int64
# The data for the 2016 2017 columns are object s, not float64 or int64 formats
# Percent and Jan Units are also objects, not numbers
# Month, Day, and Year should be converted to datetime64[ns] format
# Active column should be Boolean
# Without data cleaning, it is difficult to do the next step in data analysis. There are three more common methods in pandas for data format conversion.
# 1. astype() Forces the conversion of data types
# 2. Data transformation by creating custom functions
# 3. to_nueric() and to_datetime() provided by pandas
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 10 columns):
Customer Number 5 non-null int64
Customer Name 5 non-null object
2016 5 non-null object
2017 5 non-null object
Percent Growth 5 non-null object
Jan Units 5 non-null object
Month 5 non-null int64
Day 5 non-null int64
Year 5 non-null int64
Active 5 non-null object
dtypes: int64(4), object(6)
memory usage: 480.0+ bytes
First, the most commonly used astype()
For example, astype() can be used to convert the data in the first column to an integer int type
df['Customer Number'].astype("int")
# This does not change the original data frame, it just returns a copy
0 10002
1 552278
2 23477
3 24900
4 651029
Name: Customer Number, dtype: int32
# To really change a data frame, you usually need to assign values, such as
df["Customer Number"] = df["Customer Number"].astype("int")
print(df)
print("--------"*10)
print(df.dtypes)
Customer Number Customer Name 2016 2017 \
0 10002 Quest Industries $125,000.00 $162,500.00
1 552278 Smith Plumbing $920,000.00 $1,012,000.00
2 23477 ACME Industrial $50,000.00 $62,500.00
3 24900 Brekke LTD $350,000.00 $490,000.00
4 651029 Harbor Co $15,000.00 $12,750.00
Percent Growth Jan Units Month Day Year Active
0 30.00% 500 1 10 2015 Y
1 10.00% 700 6 15 2014 Y
2 25.00% 125 3 29 2016 Y
3 4.00% 75 10 27 2015 Y
4 -15.00% Closed 2 2 2014 N
--------------------------------------------------------------------------------
Customer Number int32
Customer Name object
2016 object
2017 object
Percent Growth object
Jan Units object
Month int64
Day int64
Year int64
Active object
dtype: object
# By assigning values to transform data from the original data frame, you can take a fresh look at our newly generated data frame
print(df)
Customer Number Customer Name 2016 2017 \
0 10002 Quest Industries $125,000.00 $162,500.00
1 552278 Smith Plumbing $920,000.00 $1,012,000.00
2 23477 ACME Industrial $50,000.00 $62,500.00
3 24900 Brekke LTD $350,000.00 $490,000.00
4 651029 Harbor Co $15,000.00 $12,750.00
Percent Growth Jan Units Month Day Year Active
0 30.00% 500 1 10 2015 Y
1 10.00% 700 6 15 2014 Y
2 25.00% 125 3 29 2016 Y
3 4.00% 75 10 27 2015 Y
4 -15.00% Closed 2 2 2014 N
#Then, columns like Percent Growth, Jan Units, with special symbols, cannot be converted directly by the astype ("flaot") method.
#This, in conjunction with the conversion of a string in python to a floating point number, requires that the original character only contain the number itself, not other special characters
#We can try to convert the Active column to a Boolean value and see what happens. All five results are True, indicating that they didn't work.
#df["Active"].astype("bool")
df['2016'].astype('float')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-19-47cc9d68cd65> in <module>()
----> 1 df['2016'].astype('float')
C:\Anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, raise_on_error, **kwargs)
3052 # else, only a single dtype is given
3053 new_data = self._data.astype(dtype=dtype, copy=copy,
-> 3054 raise_on_error=raise_on_error, **kwargs)
3055 return self._constructor(new_data).__finalize__(self)
3056
C:\Anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, **kwargs)
3187
3188 def astype(self, dtype, **kwargs):
-> 3189 return self.apply('astype', dtype=dtype, **kwargs)
3190
3191 def convert(self, **kwargs):
C:\Anaconda3\lib\site-packages\pandas\core\internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
3054
3055 kwargs['mgr'] = self
-> 3056 applied = getattr(b, f)(**kwargs)
3057 result_blocks = _extend_blocks(applied, result_blocks)
3058
C:\Anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, copy, raise_on_error, values, **kwargs)
459 **kwargs):
460 return self._astype(dtype, copy=copy, raise_on_error=raise_on_error,
--> 461 values=values, **kwargs)
462
463 def _astype(self, dtype, copy=False, raise_on_error=True, values=None,
C:\Anaconda3\lib\site-packages\pandas\core\internals.py in _astype(self, dtype, copy, raise_on_error, values, klass, mgr, **kwargs)
502
503 # _astype_nansafe works fine with 1-d only
--> 504 values = _astype_nansafe(values.ravel(), dtype, copy=True)
505 values = values.reshape(self.shape)
506
C:\Anaconda3\lib\site-packages\pandas\types\cast.py in _astype_nansafe(arr, dtype, copy)
535
536 if copy:
--> 537 return arr.astype(dtype)
538 return arr.view(dtype)
539
ValueError: could not convert string to float: '$15,000.00 '
The above issues illustrate some issues
- If the data is pure data, it can be converted to numbers
- Astype is basically two kinds of use, the number is converted to a simple string, the string of a simple number is converted to a number, and strings containing other non-numbers cannot be converted through astype.
- You need to introduce other methods to transform, so you have the following custom function methods
Clean up data through custom functions
- Currency can be converted using the following functions
def convert_currency(var):
"""
convert the string number to a float
_ Remove $
- Remove commas,
- Convert to floating point type
"""
new_value = var.replace(",","").replace("$","")
return float(new_value)
# Strings are converted to floating-point numbers by removing the $and commas with the replace function, allowing pandas to select a specific type, float or int, that pandas deems appropriate, in which case the data is converted to float64
# Convert all data in column 2016 through the apply function in pandas
df["2016"].apply(convert_currency)
0 125000.0
1 920000.0
2 50000.0
3 350000.0
4 15000.0
Name: 2016, dtype: float64
# Of course, you can pass this simpler function one line at a time through the lambda function
df["2016"].apply(lambda x: x.replace(",","").replace("$","")).astype("float64")
0 125000.0
1 920000.0
2 50000.0
3 350000.0
4 15000.0
Name: 2016, dtype: float64
#You can also use lambda expressions to clean up PercentGrowth data
df["Percent Growth"].apply(lambda x: x.replace("%","")).astype("float")/100
0 0.30
1 0.10
2 0.25
3 0.04
4 -0.15
Name: Percent Growth, dtype: float64
# It can also be solved by a custom function, with the same result
# The last custom function is to use the np.where() function to convert the Active column to a Boolean value.
df["Active"] = np.where(df["Active"] == "Y", True, False)
df["Active"]
0 True
1 True
2 True
3 True
4 False
Name: Active, dtype: bool
# Here's a look at the data format
df["2016"]=df["2016"].apply(lambda x: x.replace(",","").replace("$","")).astype("float64")
df["2017"]=df["2017"].apply(lambda x: x.replace(",","").replace("$","")).astype("float64")
df["Percent Growth"]=df["Percent Growth"].apply(lambda x: x.replace("%","")).astype("float")/100
df.dtypes
Customer Number int32
Customer Name object
2016 float64
2017 float64
Percent Growth float64
Jan Units object
Month int64
Day int64
Year int64
Active bool
dtype: object
# View the DataFrame again
# At this point only the formats in Jan Units need to be converted and the year, month and day merged, which can be handled using several functions that come with the pandas
print(df)
Customer Number Customer Name 2016 2017 Percent Growth \
0 10002 Quest Industries 125000.0 162500.0 0.30
1 552278 Smith Plumbing 920000.0 1012000.0 0.10
2 23477 ACME Industrial 50000.0 62500.0 0.25
3 24900 Brekke LTD 350000.0 490000.0 0.04
4 651029 Harbor Co 15000.0 12750.0 -0.15
Jan Units Month Day Year Active
0 500 1 10 2015 True
1 700 6 15 2014 True
2 125 3 29 2016 True
3 75 10 27 2015 True
4 Closed 2 2 2014 False
Processing with functions in pandas
# pd.to_numeric() in pandas handles data in Jan Units
pd.to_numeric(df["Jan Units"],errors='coerce').fillna(0)
0 500.0
1 700.0
2 125.0
3 75.0
4 0.0
Name: Jan Units, dtype: float64
# Finally, the year, month and day are merged using pd.to_datatime()
pd.to_datetime(df[['Month', 'Day', 'Year']])
0 2015-01-10
1 2014-06-15
2 2016-03-29
3 2015-10-27
4 2014-02-02
dtype: datetime64[ns]
# Do this without forgetting to reassign, otherwise the original data will not change
df["Jan Units"] = pd.to_numeric(df["Jan Units"],errors='coerce')
df["Start_date"] = pd.to_datetime(df[['Month', 'Day', 'Year']])
df
|
Customer Number |
Customer Name |
2016 |
2017 |
Percent Growth |
Jan Units |
Month |
Day |
Year |
Active |
Start_date |
0 |
10002 |
Quest Industries |
125000.0 |
162500.0 |
0.30 |
500.0 |
1 |
10 |
2015 |
True |
2015-01-10 |
1 |
552278 |
Smith Plumbing |
920000.0 |
1012000.0 |
0.10 |
700.0 |
6 |
15 |
2014 |
True |
2014-06-15 |
2 |
23477 |
ACME Industrial |
50000.0 |
62500.0 |
0.25 |
125.0 |
3 |
29 |
2016 |
True |
2016-03-29 |
3 |
24900 |
Brekke LTD |
350000.0 |
490000.0 |
0.04 |
75.0 |
10 |
27 |
2015 |
True |
2015-10-27 |
4 |
651029 |
Harbor Co |
15000.0 |
12750.0 |
-0.15 |
NaN |
2 |
2 |
2014 |
False |
2014-02-02 |
df.dtypes
Customer Number int32
Customer Name object
2016 float64
2017 float64
Percent Growth float64
Jan Units float64
Month int64
Day int64
Year int64
Active bool
Start_date datetime64[ns]
dtype: object
# Integrate these transformations
def convert_percent(val):
"""
Convert the percentage string to an actual floating point percent
- Remove %
- Divide by 100 to make decimal
"""
new_val = val.replace('%', '')
return float(new_val) / 100
df_2 = pd.read_csv("sales_data_types.csv",dtype={"Customer_Number":"int"},converters={
"2016":convert_currency,
"2017":convert_currency,
"Percent Growth":convert_percent,
"Jan Units":lambda x:pd.to_numeric(x,errors="coerce"),
"Active":lambda x: np.where(x=="Y",True,False)
})
df_2.dtypes
Customer Number int64
Customer Name object
2016 float64
2017 float64
Percent Growth float64
Jan Units float64
Month int64
Day int64
Year int64
Active bool
dtype: object
df_2
|
Customer Number |
Customer Name |
2016 |
2017 |
Percent Growth |
Jan Units |
Month |
Day |
Year |
Active |
0 |
10002 |
Quest Industries |
125000.0 |
162500.0 |
0.30 |
500.0 |
1 |
10 |
2015 |
True |
1 |
552278 |
Smith Plumbing |
920000.0 |
1012000.0 |
0.10 |
700.0 |
6 |
15 |
2014 |
True |
2 |
23477 |
ACME Industrial |
50000.0 |
62500.0 |
0.25 |
125.0 |
3 |
29 |
2016 |
True |
3 |
24900 |
Brekke LTD |
350000.0 |
490000.0 |
0.04 |
75.0 |
10 |
27 |
2015 |
True |
4 |
651029 |
Harbor Co |
15000.0 |
12750.0 |
-0.15 |
NaN |
2 |
2 |
2014 |
False |
So far, there are two types of data in pandas: timedelta and category. Later, we will focus on category, which is designed with reference to category in R, added after pandas 0.16, and then tidy up the common methods of pandas as needed.