StatLib---Datasets Archive

If you have an interesting dataset, or collection of data from a book, please consider submitting the data.

To submit a dataset, please see the submissions guidelines, via

send submissions from general Some of the entries are shar archives. If you don't know how to deal with a shar archive, send the message send shar from general for instructions.

The datasets archive currently contains:

NIST Statistical Reference Datasets (StRD)
A pointer to a NIST site that contains reference datasets for the objective evaluation of the computational accuracy of statistical software. Both users and developers of statistical software can use these datasets to ensure and improve software accuracy.
agresti
Contains data from "An Introduction to Categorical Data Analysis," by Alan Agresti, John Wiley, 1996, plus SAS code for various analyses. (aa@stat.ufl.edu) [28/Feb/96] (12k)
Aldrich_Nelson.zip
This data is used in the following book: Aldrich, J. and Forrest, N. (1984) "Linear Probability, Logit and Probit Models". Series: Quantitative Applications in The Social Sciences. A Sage University Paper. Submittted by hector.romero@iesa.edu.ve. [08/Sep/06] (3.5kbytes)
alr
This file contains data from Applied Linear Regression, 2nd Edition, by Sanford Weisberg, John Wiley, 1985 (sandy@umnstat.stat.umn.edu) (36808 bytes)
analcatdata
A collection of the data sets used in the book "Analyzing Categorical Data," by Jeffrey S. Simonoff, Springer-Verlag, New York, 2003. Submitted by Jeff Simonoff (jsimonof@stern.nyu.edu). [6/Jul/03] (1.2M)
Andrews
This data for the book DATA by Andrews and Herzberg. Available by FTP, gopher, WWW, but not e-mail.
Arsenic
This datafile contains measurements of drinking water and toenail levels of arsenic, as well as related covariates, for 21 individuals with private wells in N.H. Source: Karagas MR, Morris JS, Weiss JE, Spate V, Baskett C, Greenberg ER. Toenail Samples as an Indicator of Drinking Water Arsenic Exposure. Cancer Epidemiology, Biomarkers and Prevention 1996;5:849-852. (Therese.A.Stukel_AT_Dartmouth.EDU) (MS Word format) [21/Jul/98] (5 kbytes)
arsenic.zip
This is a zip file containing the arsenic data from Southwestern Taiwan that was used in the analysis reported by the National Academy of Sciences reports on arsenic (NRC 1999 and 2001), analyzed by Morales et al. (2000) and also discussed by Ryan (2004). See the following for a good summary, plus links to the NAS reports: http://www4.nas.edu/news.nsf/isbn/0309076293?OpenDocument Submitted by Louise Ryan (lryan@hsph.harvard.edu). [19/Dec/03](35 kbytes)
backache
This file contains the `backache in pregnancy' data analysed in Exercise D.2 of Problem-Solving: A Statistician's Guide, 2nd edn., by C. Chatfield, Chapman and Hall, 1995. (cc@maths.bath.ac.uk) [2/Oct/95] (16 kbytes)
balloon
A data set consisting of 2001 observations of radiation, taken from a balloon. The data contain a trend and outliers. Source: Laurie Davies (mata00@de0hrz1a.BITNET) (43k) [5/Feb/93]
bankresearch
A pre-classified dataset containing 11,000 web pages from 11 different categories. Although this dataset was designed for unsupervised clustering experiments it can be used for any type web page machine-learning technique. For more information see BankSearch Dataset Page. Submitted by m.p.sinka@rdg.ac.uk. [11/Feb/03] (64M)
baseball
Data on the salaries of North American Major League Baseball players. The dataset has performance and salary information on palyers during the 1986 season. This was the 1988 ASA Graphics Section Poster Session dataset, orgainised by Lorraine Denby. There are two files to retreive:
baseball.data
consists of a shar archive of the data and helpful information including a description of the data, pitcher, hitter, and team statistics (54448 bytes)
baseball.corr
A set of differences from the published data set (in Unix diff format)
baseball.hoaglin-velleman
Another set of differences from the published data set (in Unix diff format) See Hoaglin and Velleman, The American Statistican, August, 1994, page 227--285
BCSgames.txt
This file contains the games and results from college football's 2004, 2005, and 2006 seasons in which two teams that were ranked in the top 25 of the Bowl Championship Series (BCS) for a given week played each other. Submitted by sbuchman AT stat.cmu.edu. [4/Apr/2008](6.7 Kbytes)
biomed
I was able to find the old 1982 "biomedical dataset" generated by Larry Cox. It consists of two groups. These give observation number, blood id number,age, date, and four blood measurements. I don't really remember the instructions for analysis, although I seem to recall that the idea was to figure out if some of the blood measurements that were less difficult to obtain were as good at distinguishing carriers from normals as the more difficult measurements. Unfortunately, I don't remeember which measurement is which. There are two files to retreive:
biomed.desc
a short description of the data and a reference (1457 bytes)
biomed.data
A shar archive of containing the data for carriers and normals. (7843 bytes)
bodyfat
Lists estimates of the percentage of body fat determined by underwater weighing and various body circumference measurements for 252 men. Submitted by Roger Johnson (rwjohnso@silver.sdsmt.edu) [2/Oct/95](35 kbytes)
bolts
Data from an experiment on the affects of machine adjustments on the time to count bolts. Data appear as the STATS (Issue 10) Challenge. Submitted by W. Robert Stephenson (wrstephe@iastate.edu). [8/Nov/93] (5k)
boston
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980. (51256 bytes)
boston_corrected
This consists of the Boston house price data of Harrison and Rubinfeld (1978) JEEM with corrections and augmentation of the data with the latitude and longitude of each observation. Submitted by Kelley Pace (kpace@unix1.sncc.lsu.edu). [11/Oct/99] (62136 bytes)
business
Link to data from two case study books; Basic Business Statistics; and Business Analysis Using Regression by Foster, Stine and Waterman. Published by Springer-Verlag (1998)
cars
This was the 1983 ASA Data Exposition dataset. The dataset was collected by Ernesto Ramos and David Donoho and dealt with automobiles. I don't remember the instructions for analysis. Data on mpg, cylinders, displacement, etc. (8 variables) for 406 different cars. The dataset includes the names of the cars. The data are in one file:
cars.data
A shar archive containing files with a desciption of the cars data, the names of the cars, and the cars data itself. (33438 bytes)
cars.desc
The original instructions for this exposition. (6206 bytes)
chatfield
Data details and listings for 'The Analysis of Time Series' by Chris Chatfield. Submitted by C Chatfield (cc@maths.bath.ac.uk). [5/Jun/03] (19kbytes)
cloud
These data are those collected in a cloud-seeding experiment in Tasmania. The rainfalls are period rainfalls in inches. TE and TW are the east and west target areas respectively, while NC, SC and NWC are the corresponding rainfalls in the north, south and north-west control areas respectively. S = seeded, U = unseeded. Submitted by Alan Miller (alan@dmsmelb.mel.dms.CSIRO.AU) [4/May/94] (7 kbytes)
chscase
A collection of the data sets used in the book "A Casebook for a First Course in Statistics and Data Analysis," by Samprit Chatterjee, Mark S. Handcock and Jeffrey S. Simonoff, John Wiley and Sons, New York, 1995. Submitted by Samprit Chatterjee (schatterjee@stern.nyu.edu), Mark Handcock (mhandcock@stern.nyu.edu) and Jeff Simonoff (jsimonoff@stern.nyu.edu). (325 kbytes) Updated, [1/Dec/95]
christensen
Contains the data from "Analysis of Variance, Design, and Regression: Applied Statistical Methods" by Ronald Christensen (1996, Chapman and Hall). Ronald Christensen (fletcher@math.unm.edu), [22/Oct/96] (57k)
christensen-llm
Contains data from "Log-Linear Models and Logistic Regression, Second Edition" by Ronald Christensen (1997, Springer Verlag). Ronald Christensen (fletcher@stat.unm.edu), [24/Mar/97] (33k)
cjs.sept95.case
Data on tree growth used in the Case Study published in the September, 1995 issue of the Canadian Journal of Statistics. Nancy Reid (reid@utstat.utstat.toronto.edu) [4/Oct/95] (141k)
colleges
1995 Data Analysis Exposition sponsored by the Statistical Graphics Section of the American Statistical Association. The U.S. News data contains information on tuition, etc., for over 1300 schools, while the AAUP data includes average salary, etc. Robin Lock, (rlock@vm.stlawu.edu).
collins
Data derived from an analysis of the Brown and Frown corpora and used for my doctoral dissertation titled ``Variations in Written English: Characterizing Authors' Rhetorical Language Choices Across Corpora of Published Texts" (Jeff Collins, Carnegie Mellon Univ, May 2003). Submitted by Jeff Collins (jeff.collins@acm.org). [14/Jul/03] (112k)
confidence
This file contains the monthly frequencies for six consumer confidence items collected by the conference board and the university of michigan in 1992. Reference in Sociological Methodology. Submitted by Gordon Bechtel (BECHTEL@NERVM.NERDC.UFL.EDU) [22/Oct/96] (6k)
Congdon.zip
Data and WINBUGS programs for the Wiley Publication "Bayesian Statistical Modelling" (2001), ISBN: 0-471-49600-6, submitted by Peter Congdon (p.congdon@qmul.ac.uk). [14/Sep/01] (433k)
CongdonABM
Data and WINBUGS programs for the Wiley Publication " Applied Bayesian Modelling" (2003), ISBN: 0-471-48695-7, submitted by Peter Congdon (p.congdon@qmul.ac.uk). [14/Dec/04] (574k)
CongdonBMCD
Data and WINBUGS programs for the Wiley Publication "Bayesian Models for Categorical Data" (2005), ISBN: 0-470-09237-8, submitted by Peter Congdon (p.congdon@qmul.ac.uk). [3/Jun/05] (510k)
CPS_85_Wages
These data consist of a random sample of 534 persons from the CPS, with information on wages and other characteristics of the workers, including sex, number of years of education, years of work experience, occupational status, region of residence and union membership. Source: Berndt, ER. The Practice of Econometrics. 1991. NY: Addison-Wesley. (Therese.A.Stukel_AT_Dartmouth.EDU) (MS Word format) [21/Jul/98] (23 kbytes)
csb
See the separate csb collection for Data from the book "Case Studies in Biometry".
detroit
Data on annual homicides in Detroit, 1961-73, from Gunst & Mason's book `Regression Analysis and its Application', Marcel Dekker. Contains data on 14 relevant variables collected by J.C. Fisher. (alan@dmsmelb.mel.dms.csiro.au) [10/Feb/92] (3357 bytes)
diggle
Data-sets from Diggle, P.J. (1990). Time Series : A Biostatistical Introduction. Oxford University Press. Submitted by Peter Diggle, (maa026@central1.lancaster.ac.uk) (35800 bytes)
dirtbike
This data set collects the specifications of off-road motorcycles sold in USA. Submitted by lu.jjane@gmail.com. [19/Jul/05] (16kbytes)
disclosure
Data-sets from Fienberg, S.E., Makov, U.E. and Sanil, A.P (1994). A Bayesian Approach to Data Disclosure: Optimal Intruder Behavior for Continuous Data. Submitted by S.E. Fienberg (fienberg@stat.cmu.edu) [4/Jun/98] (111 kbytes)
djdc0093
Dow-Jones Industrial Average (DJIA) closing values from 1900 to 1993. See also spdc2693. Submitted by eduardo ley, (edley@eco.uc3m.es) [13/Mar/96](383 kbytes)
DJ 1985-2003
For each stock of Dow Jones 30, starting from 1985 up to and including Oct. 30th, 2003, daily quotations with open/close and adjusted close values, min./max values, volume are submitted. The said data are the result of a merge between different CSV files, downloaded from www.yahoo.com, as I made it to write my dissertation at the Dept. of Computer Science of University of Turin, Italy with Rosa Meo Prof. Data format is as follows. ID (Unique Identifier); date; open; high; low; close; volume; adjclose; ticker. Submitted by Danilo Careggio (careggio.danilo@tiscali.it) [6/Feb/04] (2.3M)
EGViolators
Data on cars speeding at the southern end of the NJ turnpike, and whether black drivers speed much more frequently than do white drivers. It is the basis for a paper to appear in Law Probability and Risk, by Kadane and Lamberth. [28/Apr/09] (760Kbytes)
employment
Data from two cases described in the paper "Hierarchical Models for Employment Decisions," by Kadane and Woodworth. A constant number of days has been subtracted from each date to preserve confidentiality. CaseK and CaseW. Submitted by George Woodworth (george-woodworth@uiowa.edu). [4/Dec/01] (25 kbytes)
federalistpapers.txt
Data set used in an analysis of The Federalist papers, including the disputed texts. (publication info to follow). Submitted by Jeff Collins (jeff.collins@acm.org). [31/Oct/02] (8.6k)
fienberg
The data from Fienberg's "The Analysis of Cross-Classified Data", in a form that can easily be read into Glim (or easily read by a human). [25/Sept/91] (mikem@stat.cmu.edu) (14398 bytes).
fl2000.txt
County data from the 2000 Presidential Election in Florida. For each of the 67 Florida counties, the data include the type of voting machine used, the number of columns in the presidential ballot, the undervote, the overvote, and the official certified votes for each of the twelve presidential candidates. Of particular interest are the Buchanan vote in Palm Beach county, and the overvote as a function of voting machine type and number of columns (see Agresti and Presnell, "Misvotes, Undervotes, and Overvotes: The 2000 Presidential Election in Florida," Statistical Science, Vol. 17, No. 4, 1-5, 2002. Submitted by (presnell@stat.ufl.edu). [28/Jan/03] (8.0kbytes)
fraser-river
Time series of monthly flows for the Fraser River at Hope, B.C. A. Ian McLeod (aim@julian.uwo.ca) [26/April/93] (10 kbytes)
hba1c_bloodGlucose
These data represent glycosylated hemoglobin (hba1c) readings reported in DCCT percentages and random blood glucose (rbg) readings reported in mmol/l. The readings are derived from 349 diabetic patients attending a hospital out- patient department at the Karl Bremer District Hospital in Western Cape, South Africa. The original data were published as a scatter plot in a Masters thesis (p.12): Daramola O.F. (2012). Assessing the validity of random blood glucose testing for monitoring glycemic control and predicting HbA1c values in type 2 diabetics at Karl Bremer hospital. Masters Thesis [Family Medicine and Primary Care]. Stellenbosch University: Stellenbosch, South Africa. http://scholar.sun.ac.za/handle/10019.1/80458 Submitted by Daniel D Reidpath(daniel.reidpath@monash.edu)[10/13/14] (11kbytes)
hip
This is the hip measurement data from Table B.13 in Chatfield's Problem Solving (1995, 2nd edn, Chapman and Hall). It is given in 8 columns. First 4 columns are for Control Group. Last 4 columns are for Treatment group (Note there is no pairing. Patient 1 in Control Group is NOT patient 1 in Treatment Group). (cc@maths.bath.ac.uk) [28/Feb/96] (2k)
houses.zip
These spatial data contain 20,640 observations on housing prices with 9 economic covariates. It appeared in Pace and Barry (1997), "Sparse Spatial Autoregressions", Statistics and Probability Letters. Submitted by Kelley Pace (kpace@unix1.sncc.lsu.edu). [9/Nov/99] (536 kbytes)
humandevel
United Nations Development Program, Human Development Index. A nation's HDI is composed of life expectancy, adult literacy and Gross National Product per capita. Information on 130 countries plus documentation. (arnold@stat.ncsu.edu (Tim Arnold)) [31/Oct/91] (10031 bytes).
hutsof99
Data from "The Multivariate Social Scientist --- Introductory Statistics Using Generalized Linear Models" by Graeme D. Hutcheson and Nick Sofroniou, SAGE Publications, 1999, plus GLIM 4 code for various analyses. Submitted by Nick Sofroniou (nso@gcal.ac.uk) [12/Jul/99] (56k)
IQ_Brain_Size
This datafile contains 20 observations (10 pairs of twins) on 9 variables. This data set can be used to demonstrate simple linear regression and correlation. Source: Tramo MJ, Loftus WC, Green RL, Stukel TA, Weaver JB, Gazzaniga MS. Brain Size, Head Size, and intelligence quotient in Monozygotic Twins. Neurology 1998;50:1246-1252. (Therese.A.Stukel_AT_Dartmouth.EDU) (MS Word format) [21/Jul/98][28/Nov/01] (5 kbytes)
irish.ed
Longtitudinal educational transition data set for a sample of 500 Irish students, with 4 independent variables (sex, verbal reasoning score, father's occupation, type of school). Submitted by Adrian E. Raftery (raftery@stat.washington.edu), [20/Dec/93] (13 kbytes)
kidney
Data from McGilchrist and Aisbett, Biometrics 47, 461-66, 1991. Times to infection, at the point of insertion of the catheter, for kidney patients using portable dialysis equipment. There are 2 observations on each of 38 patients. The data has been used to illustrate random effects (frailty) models for survival data. Submitted by therneau@Mayo.EDU (Terry Therneau), [10/Jun/99] (4kbytes)
lmpavw
time series used in "Long-Memory Processes, the Allan Variance and Wavelets" by D. B. Percival and P. Guttorp, a chapter in "Wavelets in Geophysics", edited by E. Foufoula-Georgiou and P. Kumar, Academic Press, 1994 This "time" series was collected by Mike Gregg, Applied Physics Laboratory, University of Washington, and is a measurement of vertical shear (in units of 1/seconds) versus depth (in units of meters) in the ocean. The role of "time" in this series is thus played by depth. Permission has been obtained to redistribute this data. Questions concerning this series should be send to Don Percival (dbp@apl.washington.edu). [6/Feb/94] (62 kbytes)
longley
The infamous Longley data, "An appraisal of least-squares programs from the point of view of the user", JASA, 62(1967) p819-841. (therneau@mayo.edu) (1301 bytes)
LPR
"In Geressy, Rotolo and Jackson v. Digital Equipment (980 F. Supp. 640 (E.D.N.Y. 1997)), Judge Weinstein presented an innovative method for determining remittiturs. A remittitur is a ruling from a judge, in response to the motion of a defendant found liable, such that the plaintiff can choose between reducing the amount granted by a the jury for indemnification or a new trial. This ruling is valid when the amount granted by the jury is unreasonable.
In order to define what is unreasonable compensation, Judge Weinstein collected a pool of 86 cases in which there were awards for pain and suffering, Next, he decided if the damage dealt to each of the 3 plaintiffs was comparable to that on each case in the pool. This database is composed by Judge Weinstein's decision on comparability and information about the 86 cases in the pool." The zip archive contains the following files:
LPR-database.csv - Is the database.
LPR-database-variables.txt - Is a glossary explaining the variables in the database.
LPR-database-description.txt - Is the description of the database.
(3.2kb) Submitted by Rafael Stern(rafaelst@stat.cmu.edu) [03/05/12]
lupus
87 persons with lupus nephritis. Followed up 15+ years. 35 deaths. Var = duration of disease. Over 40 baseline variables avaiable from authors. Submited by todd mackenzie (tmacke@po-box.mcgill.ca) (4k)
hipel-mcleod
McLeod Hipel Time Series Datasets Collection. The shar file, mhsets.shar, contains over 300 time series datasets taken from various case studies. These data sets are suitable for model building exercises such as are discussed in our textbook, "Time Series Modelling of Water Resources and Environmental Systems" by K.W. Hipel and A.I. McLeod (1994), published by Elsevier, Amsterdam. 1994. ISBN 0-444-89270-2. (1013 pages). For PC users there is also a zip file, mhsets.zip. The shar file and the zip files are about 1.7 Mb and 0.5 Mb respectively. [1/Mar/95] Ian McLeod (aim@fisher.stats.uwo.ca)
mu284
This file contains the data in "The MU284 Population" from Appendix B of the book "Model Assisted Survey Sampling" by Sarndal, Swensson and Wretman, published by Springer-Verlag, New York, 1992. The data set contains 284 observations on 11 variables, plus a line with variabel names. Please consult the mentioned appendix for more information about this data set. [24/Mar/97][16/Mar/06] Esbjorn Ohlsson (esbj@matematik.su.se) (16k)
newton_hema
Data on fluctuating proportions of marked cells in marrow from heterozygous Safari cats--from a study of early hematopoiesis. Michael Newton (newton@stat.wisc.edu) [8/Nov/93] (5k)
nflpass
Lists all-time NFL passers through 1994 by the NFL passing efficiency rating. Associated passing statistics from which this rating is computed are included. Roger W. Johnson, rjohnso@silver.sdsmt.edu [28/Feb/96] (8k)
NLTCS
This data set is an extract from the National Long Term Care Survey (NLTCS). 16 binary variables in the extract are functional disability measures: 6 activities of daily living and 10 instrumental activities of daily living, pooled over 1982, 1984, 1989, and 1994 waves of the survey. The Center for Demographic Studies, Duke University, gave its permission to redistribute the 2^16 extract via placement on StatLib under the NLTCS Data Use Agreement. If you download the data, please provide the Center for Demographic Studies, Duke University, with your name and contact information (e-mail NLTCS@cds.duke.edu).(149k)
NO2
The data are a subsample of 500 observations from a data set that originate in a study where air pollution at a road is related to traffic volume and meteorological variables, collected by the Norwegian Public Roads Administration. The response variable (column 1) consist of hourly values of the logarithm of the concentration of NO2 (particles), measured at Alnabru in Oslo, Norway, between October 2001 and August 2003. The predictor variables (columns 2 to 8) are the logarithm of the number of cars per hour, temperature $2$ meter above ground (degree C), wind speed (meters/second), the temperature difference between $25$ and $2$ meters above ground (degree C), wind direction (degrees between 0 and 360), hour of day and day number from October 1. 2001. Submitted by Magne Aldrin (magne.aldrin@nr.no). [28/Jul/04] (19kbytes)
nonlin
The data sets from Bates and Watts (1988) "Nonlinear Regression Analysis and Its Applications", Wiley. They are in S dump format as data frames. (If you don't know what a data frame is, don't worry. Just consider them to be lists. Data frames are described in a book on "Statistical Modelling in S" (bates@stat.wisc.edu) [7/Feb/90] (19851 bytes)
papir
This file contains two multivariate regression data sets from paper industry, described in Aldrin, M. (1996), "Moderate projection pursuit regression for multivariate response data", Computational Statistics and Data Analysis, 21, 501-531. Submitted by Magne Aldrin (magne.aldrin@nr.no) [14/Apr/99] (17916 bytes)
pbc
The data set found in appendix D of Fleming and Harrington, Counting Processes and Survival Analysis, Wiley, 1991. Submitted by therneau@Mayo.EDU (Terry Therneau), [25/Jul/94] (36 kbytes)
pbcseq
A follow-up to the PBC data set, this contains the data for both the baseline and subsequent visits at 6 months, 1 year, and annually thereafter. There are 1945 observations on 312 subjects. Submitted by therneau@Mayo.EDU (Terry Therneau), [10/Jun/99] (160kbytes)
places
Data taken from the Places Rated Almanac, giving the ratings on 9 composite variables of 329 locations. (From an ASA data exposition, 1986) The data are in one file:
places.data
A shar archive of three files which document the data, present the data itself, and provide a key to the actual places used. (27720 byes)
Plasma_Retinol
This datafile (N=315) investigates the relationship between personal characteristics and dietary factors, and plasma concentrations of retinol, beta-carotene and other carotenoids. Analysis unpublished. Related paper: Nierenberg DW, Stukel TA, Baron JA, Dain BJ, Greenberg ER. Determinants of plasma levels of beta-carotene and retinol. American Journal of Epidemiology 1989;130:511-521.(Therese.A.Stukel_AT_Dartmouth.EDU) (MS Word format) [21/Jul/98][28/Nov/01] (26 kbytes)
PM10
The data are a subsample of 500 observations from a data set that originate in a study where air pollution at a road is related to traffic volume and meteorological variables, collected by the Norwegian Public Roads Administration. The response variable (column 1) consist of hourly values of the logarithm of the concentration of PM10 (particles), measured at Alnabru in Oslo, Norway, between October 2001 and August 2003. The predictor variables (columns 2 to 8) are the logarithm of the number of cars per hour, temperature $2$ meter above ground (degree C), wind speed (meters/second), the temperature difference between $25$ and $2$ meters above ground (degree C), wind direction (degrees between 0 and 360), hour of day and day number from October 1. 2001. Submitted by Magne Aldrin (magne.aldrin@nr.no). [28/Jul/04] (19kbytes)
pollen
Synthetic dataset about the geometric features of pollen grains. There are 3848 observations on 5 variables. From the 1986 ASA Data Exposition dataset, made up by David Coleman of RCA Labs. The data are in one file:
pollen.data
A shar archive of 9 files. The first file gives a short description of the data, then there are 8 data files, each with 481 observations. (205954 bytes)
pollen.extra
Some extra comments about the data. Look here for hints.
pollution
This is the pollution data so loved by writers of papers on ridge regression. Source: McDonald, G.C. and Schwing, R.C. (1973) 'Instabilities of regression estimates relating air pollution to mortality', Technometrics, vol.15, 463-482. (8540 bytes)
profb
Scores and point spreads for all NFL games in the 1989-91 seasons. Contributed by Robin Lock (rlock@stlawu.bitnet) [15/Sept/92] (27733 bytes)
prnn
This shar archive contains the datasets used in `Pattern Recognition and Neural Networks' by B.D. Ripley, Cambridge University Press (1996), ISBN 0 521 46086 7 (ripley@stats.ox.ac.uk) [1/Dec/95] (101 kbytes)
rabe
This file contains data from Regression Analysis By Example, 2nd Edition, by Samprit Chatterjee and Bertram Price, John Wiley, 1991. (schatter@stern.nyu.edu) [6/Feb/92] (40309 bytes)
rir
This file contains data from Residuals and Influence in Regression, R. Dennis Cook and Sanford Weisberg, Chapman and Hall, 1982. (sandy@umnstat.stat.umn.edu) (5206 bytes). [Updated 25/May/93]
riverflow
Datasets mentioned in "Parsimony, Model Adequacy and Periodic Correlation in Time Series Forecasting", ISI Review, A.I. McLeod (1992, to appear). Submitted by A.Ian McLeod (aim@stats.uwo.ca). Time series data. A shar archive. [22/Jan/92] (294052 bytes).
rmftsa
Data Sets for "Regression Models for Time Series Analysis" by B. Kedem and K. Fokianos, Wiley 2002. Submitted by Kostas Fokianos (fokianos@ucy.ac.cy) [8/Nov/02] (176k)
sapa
time series used in "Spectral Analysis for Physical Applications" by D. B. Percival and A. T. Walden, Cambridge University Press, 1993. (dbp@apl.washington.edu) [4/Nov/92](50788 bytes)
saubts
Two ocean wave time series used in "Spectral Analysis of Univariate and Bivariate Time Series" by D. B. Percival, Chapter 11 of "Statistical Methods for Physical Science," edited by J. L. Stanford and S. B. Vardeman, Academic Press, 1993. (dbp@apl.washington.edu) [14/Apr/93] (47 kbytes)
schizo
Schizophrenic Eye-Tracking Data in Rubin and Wu (1997) Biometrics. Yingnian Wu (wu@hustat.harvard.edu) [14/Oct/97] (21k)
sensory
Data for the sensory evaluation experiment in Brien, C.J. and Payne, R.W. (1996) Tiers, structure formulae and the analysis of complicated experiments. submitted for publication. Chris Brien (matcjb@ntx.city.unisa.edu.au) [22/Oct/96] (19k)
ships
Ship damage data, from "Generalized Linear Models" by McCullagh and Nelder, section 6.3.2, page 137. (therneau@mayo.edu) (1709 bytes)
sleuth
Contains 110 data sets from the book "The Statistical Sleuth" by Fred Ramsey and Dan Schafer; Duxbury Press, 1997. (schafer@stat.orst.edu) [14/Oct/97] (172k)
sleep
Data from which conclusions were drawn in the article "Sleep in Mammals: Ecological and Constitutional Correlates" by Allison, T. and Cicchetti, D. (1976), _Science_, November 12, vol. 194, pp. 732-734. Includes brain and body weight, life span, gestation time, time sleeping, and predation and danger indices for 62 mammals. Submitted by Roger Johnson (rwjohnso@silver.sdsmt.edu) [27/Jul/94] (8k)
smoothmeth
A collection of the data sets used in the book "Smoothing Methods in Statistics," by Jeffrey S. Simonoff, Springer-Verlag, New York, 1996. Submitted by Jeff Simonoff (jsimonoff@stern.nyu.edu). [13/Mar/96] (242kbytes)
socmob
Social Mobility (US, 1973). Two four-way 17x17x2x2 contingency tables: Father's occupation, Son's occupation (first and current), family structure, race. Submitted by Timothy J. Biblarz (biblarz@uscvm.bitnet). [corrected 25/Jan/93]
space_ga
Election data including spatial coordinates on 3,107 US counties. Used in Pace and Barry (1997), Geographical Analysis, Volume 29, 1997, p. 232-247. Submitted by Kelley Pace (kpace@unix1.sncc.lsu.edu). [3/Nov/99] (548 kbytes)
spdc2693
Standard and Poor's 500 Index closing values from 1926 to 1993. See also djdc0093. Submitted by eduardo ley, (edley@eco.uc3m.es) [13/Mar/96] (333 kbytes)
stanford
Two versions of the Stanford Heart Transplant Data, one "The Statistical Analysis of Failure Time Data" by Kalbfleisch and Prentice, Appendix I, pages 230-232, the other from the original paper by Crowley and Hu. (therneau@mayo.edu) (15003 bytes) [Corrected, 8/Mar/93]
stanford.diff
The differences between the two Stanford data sets.
strikes
Data on industrial disputes and their covariates in 18 OECD countries, 1951-1985. Prepared by Bruce Western (western@datacomm.iue.it) [2/Oct/95] (44k)
tecator
The task is to predict the fat content of a meat sample on the basis of its near infrared absorbance spectrum. Regression. Submitted by thodberg@nn.meatre.dk (Hans Henrik Thodberg) [23/Jan/95] (302 kbytes)
transplant
Data on deaths within 30 days of heart transplant surgery at 131 U.S. hospitals. see Bayesian Biostatistics, D. Berry & D. Stangl, eds, 1996, Marcel Dekker. Cindy L. Christiansen and Carl N. Morris Cindy Christiansen [22/Oct/96] (3k)
tsa
Software and Data Sets for "Time Series Analysis and Its Applications" by R.H. Shumway & D.S. Stoffer, Springer, 2000. Submitted by David Stoffer (stoffer@stat.pitt.edu)[10/Mar/00]
tumor
Tumor Recurrence data for patients with Bladder cancer Taken from Wei, Lin and Weissfeld, JASA 1989, p 1067. From: therneau@mayo.edu (Terry Therneau) [23/Mar/93] [5/Jun/96] (3k)
veteran
Veteran's Administration Lung Cancer Trial, Taken from Kalbfleisch and Prentice, pages 223-224 (therneau@mayo.edu) (8249 bytes)
visualizing.data
This zip file contains 25 data sets from the book Visualizing Data published by Hobart Press (books@hobart.com) and written by William S. Cleveland (wsc@research.att.com). There is also a README file so there are 26 files in all. Each of the 25 files has the data in an ascii comma separated format. The name of each data file is the name of the data set used in the book. To find the description of the data set in the book look under the entry "data, name" in the index. For example, one data set is barley. To find the description of barley, look in the index under the entry "data, barley". The S archive of Statlib has a file created by S that contains the data sets in a format that makes it easy to read them into S. [12/Nov/93][17/Oct/94][23/Oct/97]
wind
daily average wind speeds for 1961-1978 at 12 synoptic meteorological stations in the Republic of Ireland (Haslett and Raftery, Applied Statistics 1989). There is a LARGE amount of data. Please be sure you want it before you ask for it!! There are two entries to obtain.
wind.desc
A short desciption of the data (815 bytes)
wind.data
The data (532494 bytes).
wind.correlations
Estimated correlations between daily 3 p.m. wind measurements during September and October 1997 for a network of 45 stations in the Sydney region. From Nott and Dunsmuir, ``Analysis of Spatial Covariance Structure from Monitoring Data,'' Technical Report, Department of Statistics, University of New South Wales. Submitted by David Nott (djn@maths.unsw.edu.au). [8/Mar/00] (13 kbytes)
witmer
A shar archive of data from the book Data Analysis: An Introduction(1992) Prentice Hall bu Jeff Witmer. Submitted by Jeff Witmer (fwitmer@ocvaxa.cc.oberlin.edu) [28/Jun/94] (29 kbytes)
wseries
These data tell whether or not the home team won for each game played in all World Series prior to 1994. The data appear as the STATS Challenge for Issue 11. Submitted by Jeff Witmer (fwitmer@ocvaxa.cc.oberlin.edu) [20/Mar/94] (3 kbytes)
Vinnie.Johnson
Data on the shooting of Vinnie Johnson of the Detroit Pistons during the 1985-1986 through 1988-1989 seasons. Source was the New York Times. Submitted by Rob Kass (kass@stat.cmu.edu) [18/Aug/95] (26 kbytes)
submissions
Information on how to submit data to this archive.

Other Sources

For WWW (Mosaic, Netscape, etc.) users, here are a set of links to other sources of Data. These sources are not kept on StatLib, and we have no control over them. If you find a link is consistently not working, let us know.
Time Series Data Library
Rob Hyndman's collection of over 500 time series organized by subject.
EconData
Several hundred thousand economic time series, produced by the U.S. Government and distributed by the government in a variety of formats and media, have been put into a standard, highly efficient, easy-to- use form for personal computers.
Oceanographic & Earth Science Data
From Scripps Institution of Oceanography Library
The Data Zoo
California coastal data collection
Journal of Statistics Education Information Service
Also has some data

Credit where credit is due

If you use an algorithm, dataset, or other information from StatLib, please acknowledge both StatLib and the original contributor of the material.
Last modified: Tue Jul 19 09:03:38 EDT 2005 By Pantelis Vlachos