Read csv and Excel data with pandas

This article uses real stock data as an example to teach you how to read common data files in Python.

Content:

  • Read csv data
  • Read Excel data
  • Merge multiple tables

Data file download address:

CSV data file nasdaq-listings.csv

https://raw.githubusercontent.com/fishstar/Data-Analysis-Practice/master/import%26manage_data_in_Python/nasdaq-listings.csv

Excel data file listings.xlsx

https://github.com/fishstar/Data-Analysis-Practice/blob/master/import%26manage_data_in_Python/listings.xlsx

Read csv data

csv file uses comma to separate values, which is one of the commonly used data formats. Its specific form can refer to the data file given above. Next, we will use the pandas data analysis package in Python to read and view the data.

  • pandas.read_csv(): read the data in csv format and store it in DataFrame format.
  • df.head(): displays the first 5 rows of DF.
  • df.info(): display data summary.
# Import pandas package
import pandas as pd

# Read csv file
nasdaq = pd.read_csv('nasdaq-listings.csv')

# Display the first 10 rows of data
print(nasdaq.head(10))
  Stock Symbol           Company Name  Last Sale  Market Capitalization  \
0         AAPL             Apple Inc.     141.05           7.400000e+11   
1        GOOGL          Alphabet Inc.     840.18           5.810000e+11   
2         GOOG          Alphabet Inc.     823.56           5.690000e+11   
3         MSFT  Microsoft Corporation      64.95           5.020000e+11   
4         AMZN       Amazon.com, Inc.     884.67           4.220000e+11   
5           FB         Facebook, Inc.     139.39           4.030000e+11   
6        CMCSA    Comcast Corporation      37.14           1.760000e+11   
7         INTC      Intel Corporation      35.25           1.660000e+11   
8         CSCO    Cisco Systems, Inc.      32.42           1.620000e+11   
9         AMGN             Amgen Inc.     161.61           1.190000e+11   

  IPO Year             Sector  \
0     1980         Technology   
1      NAN         Technology   
2     2004         Technology   
3     1986         Technology   
4     1997  Consumer Services   
5     2012         Technology   
6      NAN  Consumer Services   
7      NAN         Technology   
8     1990         Technology   
9     1983        Health Care   

                                            Industry Last Update  
0                             Computer Manufacturing     4/26/17  
1    Computer Software: Programming, Data Processing     4/24/17  
2    Computer Software: Programming, Data Processing     4/23/17  
3            Computer Software: Prepackaged Software     4/26/17  
4                     Catalog/Specialty Distribution     4/24/17  
5    Computer Software: Programming, Data Processing     4/26/17  
6                                Television Services     4/26/17  
7                                     Semiconductors     4/23/17  
8                  Computer Communications Equipment     4/23/17  
9  Biotechnology: Biological Products (No Diagnos...     4/24/17  

Use the. info() method to view the summary information for the data box.

# View data summary
nasdaq.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1115 entries, 0 to 1114
Data columns (total 8 columns):
Stock Symbol             1115 non-null object
Company Name             1115 non-null object
Last Sale                1115 non-null float64
Market Capitalization    1115 non-null float64
IPO Year                 1115 non-null object
Sector                   1115 non-null object
Industry                 1115 non-null object
Last Update              1115 non-null object
dtypes: float64(2), object(6)
memory usage: 69.8+ KB

From the information listed above, the Nasdaq stock data includes eight columns: Stock Symbol, Company Name, Last Sale, Market Capitalization, IPO Year, Sector, Industry and Last Update.

The read data needs to be able to restore the information in the original data. For example, Last Update should be data in time format, but there are missing values such as NAN in IPO Year, which are not reflected at present. So we need to set parameters to improve the reading mode of csv file.

pandas.read_csv() parameter

  • na_vlaues: sets the missing value form.
  • parse_dates: resolves the specified column to the time date format.
nasdaq = pd.read_csv('nasdaq-listings.csv', na_values='NAN', parse_dates=['Last Update'])
print(nasdaq.head())
  Stock Symbol           Company Name  Last Sale  Market Capitalization  \
0         AAPL             Apple Inc.     141.05           7.400000e+11   
1        GOOGL          Alphabet Inc.     840.18           5.810000e+11   
2         GOOG          Alphabet Inc.     823.56           5.690000e+11   
3         MSFT  Microsoft Corporation      64.95           5.020000e+11   
4         AMZN       Amazon.com, Inc.     884.67           4.220000e+11   

   IPO Year             Sector  \
0    1980.0         Technology   
1       NaN         Technology   
2    2004.0         Technology   
3    1986.0         Technology   
4    1997.0  Consumer Services   

                                          Industry Last Update  
0                           Computer Manufacturing  2017-04-26  
1  Computer Software: Programming, Data Processing  2017-04-24  
2  Computer Software: Programming, Data Processing  2017-04-23  
3          Computer Software: Prepackaged Software  2017-04-26  
4                   Catalog/Specialty Distribution  2017-04-24  
nasdaq.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1115 entries, 0 to 1114
Data columns (total 8 columns):
Stock Symbol             1115 non-null object
Company Name             1115 non-null object
Last Sale                1115 non-null float64
Market Capitalization    1115 non-null float64
IPO Year                 593 non-null float64
Sector                   1036 non-null object
Industry                 1036 non-null object
Last Update              1115 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(3), object(4)
memory usage: 69.8+ KB

Read Excel data

Excel file is a traditional data format, but in the face of massive data, it is more advantageous to use programming method to process data. The data file used in this example is shown in the figure below. Note that it has three sheet tables.

Similar to csv files, you can use pandas.read_excel() function to read the excel file and store it as a data box format.

pandas.read_excel() reads the excel file with the following parameters:

  • sheet_name: set the name of the sheet read.
  • na_values: sets the form of the missing value.
# Read Excel data and select nyse
nyse = pd.read_excel('listings.xlsx', sheet_name='nyse', na_values='n/a')

# Show previous lines
print(nyse.head())
  Stock Symbol            Company Name  Last Sale  Market Capitalization  \
0          DDD  3D Systems Corporation      14.48           1.647165e+09   
1          MMM              3M Company     188.65           1.127366e+11   
2         WBAI         500.com Limited      13.96           5.793129e+08   
3         WUBA             58.com Inc.      36.11           5.225238e+09   
4          AHC   A.H. Belo Corporation       6.20           1.347351e+08   

   IPO Year             Sector  \
0       NaN         Technology   
1       NaN        Health Care   
2    2013.0  Consumer Services   
3    2013.0         Technology   
4       NaN  Consumer Services   

                                          Industry  
0          Computer Software: Prepackaged Software  
1                       Medical/Dental Instruments  
2            Services-Misc. Amusement & Recreation  
3  Computer Software: Programming, Data Processing  
4                             Newspapers/Magazines  
# Show summary information
nyse.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3147 entries, 0 to 3146
Data columns (total 7 columns):
Stock Symbol             3147 non-null object
Company Name             3147 non-null object
Last Sale                3079 non-null float64
Market Capitalization    3147 non-null float64
IPO Year                 1361 non-null float64
Sector                   2177 non-null object
Industry                 2177 non-null object
dtypes: float64(3), object(4)
memory usage: 172.2+ KB

In fact, only one sheet is read here, but there are three sheets in the Excel file. The following will show how to read all sheets.

  • pd.ExcelFile(): store Excel file as excelfile object.
  • ExcelFile.sheet_names: get the names of all the sheet tables of excelfile, and store them in the python list.
# Read Excel file to ExcelFile format
xls = pd.ExcelFile('listings.xlsx')

# Get the name of the sheet table
exchanges = xls.sheet_names
print(exchanges)
['amex', 'nasdaq', 'nyse']

And still use pd.read_excel() reads the ExcelFile data, but passes it to the parameter sheet_ The value of name is the above exchange list. The return value lists is in dictionary format, and each element corresponds to the DataFrame of a table.

# Read the data of all sheet s and store it in the dictionary
listings = pd.read_excel(xls, sheet_name=exchanges, na_values='n/a')

# View summary information for nasdaq tables
listings['nasdaq'].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3167 entries, 0 to 3166
Data columns (total 7 columns):
Stock Symbol             3167 non-null object
Company Name             3167 non-null object
Last Sale                3165 non-null float64
Market Capitalization    3167 non-null float64
IPO Year                 1386 non-null float64
Sector                   2767 non-null object
Industry                 2767 non-null object
dtypes: float64(3), object(4)
memory usage: 173.3+ KB

Merge multiple tables

Now we want to merge multiple tables in Excel. Because they have the same column structure, they can be simply stacked. Next, join the two tables nyse and nasdaq, and use the pd.concat() function.

pd.concat The () function is used to connect multiple DataFrame data frames. Note that the merged dataframes need to be placed in the list.

# Read the data of nyse and nasdaq tables
nyse = pd.read_excel('listings.xlsx', sheet_name='nyse', na_values='n/a')
nasdaq = pd.read_excel('listings.xlsx', sheet_name='nasdaq', na_values='n/a')

# Add a column of Exchange to mark the data from which table
nyse['Exchange'] = 'NYSE'
nasdaq['Exchange'] = 'NASDAQ'

# Splicing two dataframes
combined_listings = pd.concat([nyse, nasdaq])  # Notice here []
combined_listings.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6314 entries, 0 to 3166
Data columns (total 8 columns):
Stock Symbol             6314 non-null object
Company Name             6314 non-null object
Last Sale                6244 non-null float64
Market Capitalization    6314 non-null float64
IPO Year                 2747 non-null float64
Sector                   4944 non-null object
Industry                 4944 non-null object
Exchange                 6314 non-null object
dtypes: float64(3), object(5)
memory usage: 444.0+ KB

If we want to merge all tables, we can also ExcelFile.sheet_ Name to get the names of all the sheet tables, and then read the data of each sheet table through the loop, and finally use pd.concat() function to merge all sheet tables.

# Create ExcelFile variable
xls = pd.ExcelFile('listings.xlsx')

# Get sheet name
exchanges = xls.sheet_names

# Create an empty list
listings = []

# Use the loop to import the data of each page one by one and store it in the list
for exchange in exchanges:
    listing = pd.read_excel(xls, sheet_name=exchange, na_values='n/a')
    listing['Exchange'] = exchange
    listings.append(listing)

# Merge data
listing_data = pd.concat(listings)

# View a summary of the merged data
listing_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6674 entries, 0 to 3146
Data columns (total 8 columns):
Stock Symbol             6674 non-null object
Company Name             6674 non-null object
Last Sale                6590 non-null float64
Market Capitalization    6674 non-null float64
IPO Year                 2852 non-null float64
Sector                   5182 non-null object
Industry                 5182 non-null object
Exchange                 6674 non-null object
dtypes: float64(3), object(5)
memory usage: 469.3+ KB

Author: DrFish Link: https://www.jianshu.com/p/19a40868b418

Source network, only for learning, if there is infringement, please contact delete.

Don't panic. I have a set of learning materials, including 40 + E-books, 800 + teaching videos, involving Python foundation, reptile, framework, data analysis, machine learning, etc. I'm not afraid you won't learn! https://shimo.im/docs/JWCghr8prjCVCxxK/ Python learning materials

Pay attention to the official account [Python circle].

file

Tags: Programming Excel Python github

Posted on Sat, 13 Jun 2020 06:55:15 -0400 by markepic