This article uses real stock data as an example to teach you how to read common data files in Python.
Content:
- Read csv data
- Read Excel data
- Merge multiple tables
Data file download address:
CSV data file nasdaq-listings.csv
Excel data file listings.xlsx
Read csv data
csv file uses comma to separate values, which is one of the commonly used data formats. Its specific form can refer to the data file given above. Next, we will use the pandas data analysis package in Python to read and view the data.
- pandas.read_csv(): read the data in csv format and store it in DataFrame format.
- df.head(): displays the first 5 rows of DF.
- df.info(): display data summary.
# Import pandas package import pandas as pd # Read csv file nasdaq = pd.read_csv('nasdaq-listings.csv') # Display the first 10 rows of data print(nasdaq.head(10))
Stock Symbol Company Name Last Sale Market Capitalization \ 0 AAPL Apple Inc. 141.05 7.400000e+11 1 GOOGL Alphabet Inc. 840.18 5.810000e+11 2 GOOG Alphabet Inc. 823.56 5.690000e+11 3 MSFT Microsoft Corporation 64.95 5.020000e+11 4 AMZN Amazon.com, Inc. 884.67 4.220000e+11 5 FB Facebook, Inc. 139.39 4.030000e+11 6 CMCSA Comcast Corporation 37.14 1.760000e+11 7 INTC Intel Corporation 35.25 1.660000e+11 8 CSCO Cisco Systems, Inc. 32.42 1.620000e+11 9 AMGN Amgen Inc. 161.61 1.190000e+11 IPO Year Sector \ 0 1980 Technology 1 NAN Technology 2 2004 Technology 3 1986 Technology 4 1997 Consumer Services 5 2012 Technology 6 NAN Consumer Services 7 NAN Technology 8 1990 Technology 9 1983 Health Care Industry Last Update 0 Computer Manufacturing 4/26/17 1 Computer Software: Programming, Data Processing 4/24/17 2 Computer Software: Programming, Data Processing 4/23/17 3 Computer Software: Prepackaged Software 4/26/17 4 Catalog/Specialty Distribution 4/24/17 5 Computer Software: Programming, Data Processing 4/26/17 6 Television Services 4/26/17 7 Semiconductors 4/23/17 8 Computer Communications Equipment 4/23/17 9 Biotechnology: Biological Products (No Diagnos... 4/24/17
Use the. info() method to view the summary information for the data box.
# View data summary nasdaq.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1115 entries, 0 to 1114 Data columns (total 8 columns): Stock Symbol 1115 non-null object Company Name 1115 non-null object Last Sale 1115 non-null float64 Market Capitalization 1115 non-null float64 IPO Year 1115 non-null object Sector 1115 non-null object Industry 1115 non-null object Last Update 1115 non-null object dtypes: float64(2), object(6) memory usage: 69.8+ KB
From the information listed above, the Nasdaq stock data includes eight columns: Stock Symbol, Company Name, Last Sale, Market Capitalization, IPO Year, Sector, Industry and Last Update.
The read data needs to be able to restore the information in the original data. For example, Last Update should be data in time format, but there are missing values such as NAN in IPO Year, which are not reflected at present. So we need to set parameters to improve the reading mode of csv file.
pandas.read_csv() parameter
- na_vlaues: sets the missing value form.
- parse_dates: resolves the specified column to the time date format.
nasdaq = pd.read_csv('nasdaq-listings.csv', na_values='NAN', parse_dates=['Last Update']) print(nasdaq.head())
Stock Symbol Company Name Last Sale Market Capitalization \ 0 AAPL Apple Inc. 141.05 7.400000e+11 1 GOOGL Alphabet Inc. 840.18 5.810000e+11 2 GOOG Alphabet Inc. 823.56 5.690000e+11 3 MSFT Microsoft Corporation 64.95 5.020000e+11 4 AMZN Amazon.com, Inc. 884.67 4.220000e+11 IPO Year Sector \ 0 1980.0 Technology 1 NaN Technology 2 2004.0 Technology 3 1986.0 Technology 4 1997.0 Consumer Services Industry Last Update 0 Computer Manufacturing 2017-04-26 1 Computer Software: Programming, Data Processing 2017-04-24 2 Computer Software: Programming, Data Processing 2017-04-23 3 Computer Software: Prepackaged Software 2017-04-26 4 Catalog/Specialty Distribution 2017-04-24
nasdaq.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1115 entries, 0 to 1114 Data columns (total 8 columns): Stock Symbol 1115 non-null object Company Name 1115 non-null object Last Sale 1115 non-null float64 Market Capitalization 1115 non-null float64 IPO Year 593 non-null float64 Sector 1036 non-null object Industry 1036 non-null object Last Update 1115 non-null datetime64[ns] dtypes: datetime64[ns](1), float64(3), object(4) memory usage: 69.8+ KB
Read Excel data
Excel file is a traditional data format, but in the face of massive data, it is more advantageous to use programming method to process data. The data file used in this example is shown in the figure below. Note that it has three sheet tables.
Similar to csv files, you can use pandas.read_excel() function to read the excel file and store it as a data box format.
pandas.read_excel() reads the excel file with the following parameters:
- sheet_name: set the name of the sheet read.
- na_values: sets the form of the missing value.
# Read Excel data and select nyse nyse = pd.read_excel('listings.xlsx', sheet_name='nyse', na_values='n/a') # Show previous lines print(nyse.head())
Stock Symbol Company Name Last Sale Market Capitalization \ 0 DDD 3D Systems Corporation 14.48 1.647165e+09 1 MMM 3M Company 188.65 1.127366e+11 2 WBAI 500.com Limited 13.96 5.793129e+08 3 WUBA 58.com Inc. 36.11 5.225238e+09 4 AHC A.H. Belo Corporation 6.20 1.347351e+08 IPO Year Sector \ 0 NaN Technology 1 NaN Health Care 2 2013.0 Consumer Services 3 2013.0 Technology 4 NaN Consumer Services Industry 0 Computer Software: Prepackaged Software 1 Medical/Dental Instruments 2 Services-Misc. Amusement & Recreation 3 Computer Software: Programming, Data Processing 4 Newspapers/Magazines
# Show summary information nyse.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 3147 entries, 0 to 3146 Data columns (total 7 columns): Stock Symbol 3147 non-null object Company Name 3147 non-null object Last Sale 3079 non-null float64 Market Capitalization 3147 non-null float64 IPO Year 1361 non-null float64 Sector 2177 non-null object Industry 2177 non-null object dtypes: float64(3), object(4) memory usage: 172.2+ KB
In fact, only one sheet is read here, but there are three sheets in the Excel file. The following will show how to read all sheets.
- pd.ExcelFile(): store Excel file as excelfile object.
- ExcelFile.sheet_names: get the names of all the sheet tables of excelfile, and store them in the python list.
# Read Excel file to ExcelFile format xls = pd.ExcelFile('listings.xlsx') # Get the name of the sheet table exchanges = xls.sheet_names print(exchanges)
['amex', 'nasdaq', 'nyse']
And still use pd.read_excel() reads the ExcelFile data, but passes it to the parameter sheet_ The value of name is the above exchange list. The return value lists is in dictionary format, and each element corresponds to the DataFrame of a table.
# Read the data of all sheet s and store it in the dictionary listings = pd.read_excel(xls, sheet_name=exchanges, na_values='n/a') # View summary information for nasdaq tables listings['nasdaq'].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3167 entries, 0 to 3166 Data columns (total 7 columns): Stock Symbol 3167 non-null object Company Name 3167 non-null object Last Sale 3165 non-null float64 Market Capitalization 3167 non-null float64 IPO Year 1386 non-null float64 Sector 2767 non-null object Industry 2767 non-null object dtypes: float64(3), object(4) memory usage: 173.3+ KB
Merge multiple tables
Now we want to merge multiple tables in Excel. Because they have the same column structure, they can be simply stacked. Next, join the two tables nyse and nasdaq, and use the pd.concat() function.
pd.concat The () function is used to connect multiple DataFrame data frames. Note that the merged dataframes need to be placed in the list.
# Read the data of nyse and nasdaq tables nyse = pd.read_excel('listings.xlsx', sheet_name='nyse', na_values='n/a') nasdaq = pd.read_excel('listings.xlsx', sheet_name='nasdaq', na_values='n/a') # Add a column of Exchange to mark the data from which table nyse['Exchange'] = 'NYSE' nasdaq['Exchange'] = 'NASDAQ' # Splicing two dataframes combined_listings = pd.concat([nyse, nasdaq]) # Notice here [] combined_listings.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 6314 entries, 0 to 3166 Data columns (total 8 columns): Stock Symbol 6314 non-null object Company Name 6314 non-null object Last Sale 6244 non-null float64 Market Capitalization 6314 non-null float64 IPO Year 2747 non-null float64 Sector 4944 non-null object Industry 4944 non-null object Exchange 6314 non-null object dtypes: float64(3), object(5) memory usage: 444.0+ KB
If we want to merge all tables, we can also ExcelFile.sheet_ Name to get the names of all the sheet tables, and then read the data of each sheet table through the loop, and finally use pd.concat() function to merge all sheet tables.
# Create ExcelFile variable xls = pd.ExcelFile('listings.xlsx') # Get sheet name exchanges = xls.sheet_names # Create an empty list listings = [] # Use the loop to import the data of each page one by one and store it in the list for exchange in exchanges: listing = pd.read_excel(xls, sheet_name=exchange, na_values='n/a') listing['Exchange'] = exchange listings.append(listing) # Merge data listing_data = pd.concat(listings) # View a summary of the merged data listing_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 6674 entries, 0 to 3146 Data columns (total 8 columns): Stock Symbol 6674 non-null object Company Name 6674 non-null object Last Sale 6590 non-null float64 Market Capitalization 6674 non-null float64 IPO Year 2852 non-null float64 Sector 5182 non-null object Industry 5182 non-null object Exchange 6674 non-null object dtypes: float64(3), object(5) memory usage: 469.3+ KB
Author: DrFish Link: https://www.jianshu.com/p/19a40868b418
Source network, only for learning, if there is infringement, please contact delete.
Don't panic. I have a set of learning materials, including 40 + E-books, 800 + teaching videos, involving Python foundation, reptile, framework, data analysis, machine learning, etc. I'm not afraid you won't learn! https://shimo.im/docs/JWCghr8prjCVCxxK/ Python learning materials
Pay attention to the official account [Python circle].