Jump to content
 







Main menu
   


Navigation  



Main page
Contents
Current events
Random article
About Wikipedia
Contact us
Donate
 




Contribute  



Help
Learn to edit
Community portal
Recent changes
Upload file
 








Search  

































Create account

Log in
 









Create account
 Log in
 




Pages for logged out editors learn more  



Contributions
Talk
 



















Contents

   



(Top)
 


1 History  





2 Data Model  



2.1  Indices  







3 Criticisms  





4 See also  





5 References  





6 Further reading  














pandas (software)






العربية

Беларуская
Deutsch
Eesti
Español
فارسی
Français
Galego

Italiano
עברית
Magyar

Português
Русский
Shqip
Српски / srpski
Türkçe
Українська

 

Edit links
 









Article
Talk
 

















Read
Edit
View history
 








Tools
   


Actions  



Read
Edit
View history
 




General  



What links here
Related changes
Upload file
Special pages
Permanent link
Page information
Cite this page
Get shortened URL
Download QR code
Wikidata item
 




Print/export  



Download as PDF
Printable version
 




In other projects  



Wikimedia Commons
 
















Appearance
   

 






From Wikipedia, the free encyclopedia
 


Pandas
Original author(s)Wes McKinney
Developer(s)Community
Initial release11 January 2008; 16 years ago (2008-01-11) [citation needed]
Stable release

2.2.2[1] / 10 April 2024; 2 months ago (10 April 2024)

Preview release

2.0rc1 / 15 March 2023

Repository
Written inPython, Cython, C
Operating systemCross-platform
TypeTechnical computing
LicenseNew BSD License
Websitepandas.pydata.org

Pandas (styled as pandas) is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license.[2] The name is derived from the term "paneldata", an econometrics term for data sets that include observations over multiple time periods for the same individuals,[3] as well as a play on the phrase "Python data analysis".[4]: 5  Wes McKinney started building what would become Pandas at AQR Capital while he was a researcher there from 2007 to 2010.[5]

The development of Pandas introduced into Python many comparable features of working with DataFrames that were established in the R programming language. The library is built upon another library, NumPy.

History[edit]

Developer Wes McKinney started working on Pandas in 2008 while at AQR Capital Management out of the need for a high performance, flexible tool to perform quantitative analysis on financial data. Before leaving AQR he was able to convince management to allow him to open source the library.

Another AQR employee, Chang She, joined the effort in 2012 as the second major contributor to the library.

In 2015, Pandas signed on as a fiscally sponsored project of NumFOCUS, a 501(c)(3) nonprofit charity in the United States.[6]

Data Model[edit]

Pandas is built around data structures called Series and DataFrames. Data for these collections can be imported from various file formats such as comma-separated values, JSON, Parquet, SQL database tables or queries, and Microsoft Excel.[7]

ASeries is a 1-dimensional data structure built on top of NumPy's array.[8]: 97  Unlike in NumPy, each data point has an associated label. The collection of these labels is called an index.[4]: 112  Series can be used arithmetically, as in the statement series_3 = series_1 + series_2: this will align data points with corresponding index values in series_1 and series_2, then add them together to produce new values in series_3.[4]: 114 ADataFrame is a 2-dimensional data structure of rows and columns, similar to a spreadsheet, and analogous to a Python dictionary mapping column names (keys) to Series (values), with each Series sharing an index.[4]: 115  DataFrames can be concatenated together or "merged" on columns or indices in a manner similar to joinsinSQL.[4]: 177–182  Pandas implements a subset of relational algebra, and supports one-to-one, many-to-one, and many-to-many joins.[8]: 147–148  Pandas also supports the less common Panel and Panel4D, which are 3-dimensional and 4-dimension data structures respectively.[8]: 141 

Users can transform or summarize data by applying arbitrary functions.[4]: 132  Since Pandas is built on top of NumPy, all NumPy functions work on Series and DataFrames as well.[8]: 115  Pandas also includes built-in operations for arithmetic, string manipulation, and summary statistics such as mean, median, and standard deviation.[4]: 139, 211  These built-in functions are designed to handle missing data, usually represented by the floating-point value NaN.[4]: 142–143 

Subsets of data can be selected by column name, index, or Boolean expressions. For example, df[df['col1'] >5] will return all rows in the DataFrame df for which the value of the column col1 exceeds 5.[4]: 126–128  Data can be grouped together by a column value, as in df['col1'].groupby(df['col2']), or by a function which is applied to the index. For example, df.groupby(lambda i: i % 2) groups data by whether the index is even.[4]: 253–259 

Pandas includes support for time series, such as the ability to interpolate values [4]: 316–317  and filter using a range of timestamps (e.g. data['1/1/2023':'2/2/2023'] will return all dates between January 1st and February 2nd).[4]: 295  Pandas represents missing time series data using a special NaT (Not a Timestamp) object, instead of the NaN value it uses elsewhere.[4]: 292 

Indices[edit]

By default, a Pandas index is a series of integers ascending from 0, similar to the indices of Python arrays. However, indices can use any NumPy data type, including floating point, timestamps, or strings.[4]: 112 

Pandas' syntax for mapping index values to relevant data is the same syntax Python uses to map dictionary keys to values. For example, if s is a Series, s['a'] will return the data point at index a. Unlike dictionary keys, index values are not guaranteed to be unique. If a Series uses the index value a for multiple data points, then s['a'] will instead return a new Series containing all matching values.[4]: 136  A DataFrame's column names are stored and implemented identically to an index. As such, a DataFrame can be thought of as having two indices: one column-based and one row-based. Because column names are stored as an index, there are also not required to be unique.[8]: 103–105 

Ifdata is a Series, then data['a'] returns all values with the index value of a. However, if data is a DataFrame, then data['a'] returns all values in the column(s) named a. To avoid this ambiguity, Pandas supports the syntax data.loc['a'] as an alternative way to filter using the index. Pandas also supports the syntax data.iloc[n], which always takes an integer n and returns the nth value, counting from 0. This allows a user to act as though the index is an array-like sequence of integers, regardless of how it's actually defined.[8]: 110–113 

Pandas supports hierarchical indices with multiple values per data point. An index with this structure, called a "MultiIndex", allows a single DataFrame to represent multiple dimensions, similar to a pivot tableinMicrosoft Excel.[4]: 147–148  Each level of a MultiIndex can be given a unique name.[8]: 133  In practice, data with more than 2 dimensions is often represented using DataFrames with hierarchical indices, instead of the higher-dimension Panel and Panel4D data structures[8]: 128 

Criticisms[edit]

Pandas has been criticized for its inefficiency. Pandas can require 5 to 10 times as much memory as the size of the underlying data, and the entire dataset must be loaded in RAM. The library does not optimize query plans or support parallel computing across multiple cores. Wes McKinney, the creator of Pandas, has recommended Apache Arrow as an alternative to address these performance concerns and other limitations.[9]

See also[edit]

References[edit]

  1. ^ "Pandas 2.2.2". 10 April 2024.
  • ^ "License – Package overview – pandas 1.0.0 documentation". pandas. 28 January 2020. Retrieved 30 January 2020.
  • ^ Wes McKinney (2011). "pandas: a Foundational Python Library for Data Analysis and Statistics" (PDF). Retrieved 2 August 2018.
  • ^ a b c d e f g h i j k l m n o p McKinney, Wes (2014). Python for Data Analysis (First ed.). O'Reilly. ISBN 978-1-449-31979-3.
  • ^ Kopf, Dan. "Meet the man behind the most important tool in data science". Quartz. Retrieved 17 November 2020.
  • ^ "NumFOCUS – pandas: a fiscally sponsored project". NumFOCUS. Retrieved 3 April 2018.
  • ^ "IO tools (Text, CSV, HDF5, …) — pandas 1.4.1 documentation".
  • ^ a b c d e f g h VanderPlas, Jake (2016). Python Data Science Handbook: Essential Tools for Working with Data (First ed.). O'Reilly. ISBN 978-1-491-91205-8.
  • ^ McKinney, Wes (21 September 2017). "Apache Arrow and the "10 Things I Hate About pandas"". wesmckinney.com. Retrieved 21 December 2023.
  • Further reading[edit]


    Retrieved from "https://en.wikipedia.org/w/index.php?title=Pandas_(software)&oldid=1231052164"

    Categories: 
    Free statistical software
    Python (programming language) scientific libraries
    Software using the BSD license
    Hidden categories: 
    Articles with short description
    Short description is different from Wikidata
    Articles lacking reliable references from May 2023
    All articles lacking reliable references
    Use dmy dates from August 2019
    All articles with unsourced statements
    Articles with unsourced statements from March 2021
     



    This page was last edited on 26 June 2024, at 05:42 (UTC).

    Text is available under the Creative Commons Attribution-ShareAlike License 4.0; additional terms may apply. By using this site, you agree to the Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization.



    Privacy policy

    About Wikipedia

    Disclaimers

    Contact Wikipedia

    Code of Conduct

    Developers

    Statistics

    Cookie statement

    Mobile view



    Wikimedia Foundation
    Powered by MediaWiki