StatLib---Datasets Archive
If you have an interesting dataset, or collection of data from a book,
please consider submitting the data.
To submit a dataset, please see the submissions guidelines, via
send submissions from general
Some of the entries are shar archives. If you don't know how to deal
with a shar archive, send the message
send shar from general
for instructions.
The datasets archive currently contains:
- NIST Statistical
Reference Datasets (StRD)
- A pointer to a NIST site that
contains reference datasets for the objective evaluation of the
computational accuracy of statistical software. Both users and
developers of statistical software can use these datasets to ensure
and improve software accuracy.
- agresti
- Contains data from "An Introduction to Categorical
Data Analysis," by Alan Agresti, John Wiley, 1996, plus SAS
code for various analyses. (aa@stat.ufl.edu)
[28/Feb/96] (12k)
- Aldrich_Nelson.zip
- This data is used in the following book:
Aldrich, J. and Forrest, N. (1984) "Linear Probability, Logit and Probit
Models".
Series: Quantitative Applications in The Social Sciences. A Sage
University
Paper. Submittted by hector.romero@iesa.edu.ve. [08/Sep/06] (3.5kbytes)
- alr
- This file contains data from Applied Linear
Regression, 2nd Edition, by Sanford Weisberg, John
Wiley, 1985 (sandy@umnstat.stat.umn.edu) (36808 bytes)
- analcatdata
- A collection of the data sets used in the book
"Analyzing Categorical Data," by Jeffrey S. Simonoff,
Springer-Verlag, New York, 2003. Submitted by Jeff
Simonoff (jsimonof@stern.nyu.edu). [6/Jul/03] (1.2M)
- Andrews
- This data for the book DATA by Andrews and Herzberg.
Available by FTP, gopher, WWW, but not e-mail.
- Arsenic
- This datafile contains measurements of drinking water and toenail
levels of arsenic, as well as related covariates, for 21
individuals with private wells in N.H. Source: Karagas MR, Morris
JS, Weiss JE, Spate V, Baskett C, Greenberg ER. Toenail Samples as
an Indicator of Drinking Water Arsenic Exposure. Cancer
Epidemiology, Biomarkers and Prevention
1996;5:849-852. (Therese.A.Stukel_AT_Dartmouth.EDU) (MS Word format)
[21/Jul/98] (5 kbytes)
- arsenic.zip
- This is a zip file containing the arsenic data from Southwestern
Taiwan that was used in the analysis reported by the National Academy of
Sciences reports on arsenic (NRC 1999 and 2001), analyzed by Morales et al.
(2000) and also discussed by Ryan (2004). See the following for a good
summary, plus links to the NAS reports:
http://www4.nas.edu/news.nsf/isbn/0309076293?OpenDocument
Submitted by Louise Ryan (lryan@hsph.harvard.edu). [19/Dec/03](35 kbytes)
- backache
- This file contains the `backache in pregnancy' data
analysed in Exercise D.2 of Problem-Solving: A Statistician's
Guide, 2nd edn., by C. Chatfield, Chapman and Hall, 1995.
(cc@maths.bath.ac.uk) [2/Oct/95] (16 kbytes)
- balloon
- A data set consisting of 2001 observations of
radiation, taken from a balloon. The data contain a
trend and outliers. Source: Laurie Davies
(mata00@de0hrz1a.BITNET) (43k) [5/Feb/93]
- bankresearch
- A pre-classified dataset containing 11,000 web pages from
11 different categories. Although this dataset was designed for
unsupervised clustering experiments it can be used for any type
web page machine-learning technique. For more information see BankSearch Dataset Page.
Submitted by m.p.sinka@rdg.ac.uk. [11/Feb/03] (64M)
- baseball
- Data on the salaries of North American Major League
Baseball players. The dataset has performance and
salary information on palyers during the 1986 season.
This was the 1988 ASA Graphics Section Poster Session
dataset, orgainised by Lorraine Denby. There are two
files to retreive:
- baseball.data
- consists of a shar archive of the data and
helpful information including a description of
the data,
pitcher, hitter, and team statistics (54448 bytes)
- baseball.corr
- A set of differences from the
published data set (in Unix diff format)
- baseball.hoaglin-velleman
- Another set of differences from the
published data set (in Unix diff format)
See Hoaglin and Velleman, The American Statistican,
August, 1994, page 227--285
- BCSgames.txt
- This file contains the games and results from college
football's 2004, 2005, and 2006 seasons in which two teams that were
ranked in the top 25 of the Bowl Championship Series (BCS) for a given
week played each other. Submitted by sbuchman AT
stat.cmu.edu. [4/Apr/2008](6.7 Kbytes)
- biomed
- I was able to find the old 1982 "biomedical dataset"
generated by Larry Cox. It consists of two groups.
These give observation number, blood id number,age,
date, and four blood measurements. I don't really
remember the instructions for analysis, although I
seem to recall that the idea was to figure out if some
of the blood measurements that were less difficult to
obtain were as good at distinguishing carriers from
normals as the more difficult measurements.
Unfortunately, I don't remeember which
measurement is which.
There are two files to retreive:
- biomed.desc
- a short description of the data and a
reference (1457 bytes)
- biomed.data
- A shar archive of containing the data
for carriers and normals. (7843 bytes)
- bodyfat
- Lists estimates of the percentage of body fat determined
by underwater weighing and various body circumference
measurements for 252 men. Submitted by Roger Johnson
(rwjohnso@silver.sdsmt.edu) [2/Oct/95](35 kbytes)
- bolts
- Data from an experiment on the affects of machine
adjustments on the time to count bolts. Data appear
as the STATS (Issue 10) Challenge. Submitted by W.
Robert Stephenson (wrstephe@iastate.edu). [8/Nov/93] (5k)
- boston
- The Boston house-price data of Harrison, D. and
Rubinfeld, D.L. 'Hedonic prices and the demand for
clean air', J. Environ. Economics & Management, vol.5,
81-102, 1978. Used in Belsley, Kuh & Welsch,
'Regression diagnostics ...', Wiley, 1980. (51256
bytes)
- boston_corrected
- This consists of the Boston house price data of Harrison and Rubinfeld
(1978) JEEM with corrections and augmentation of the data with the latitude
and longitude of each observation. Submitted by Kelley Pace
(kpace@unix1.sncc.lsu.edu). [11/Oct/99] (62136 bytes)
- business
- Link to data from two case study books;
Basic Business Statistics; and
Business Analysis Using Regression
by Foster, Stine and Waterman. Published by Springer-Verlag (1998)
- cars
- This was the 1983 ASA Data Exposition dataset. The
dataset was collected by Ernesto Ramos and David
Donoho and dealt with automobiles. I don't remember
the instructions for analysis. Data on mpg, cylinders,
displacement, etc. (8 variables) for 406 different
cars. The dataset includes the names of the cars.
The data are in one file:
- cars.data
- A shar archive containing files with
a desciption of the cars data, the names of
the cars, and the cars data itself. (33438 bytes)
- cars.desc
- The original instructions for this
exposition. (6206 bytes)
- chatfield
- Data details and listings for 'The Analysis of Time
Series' by Chris Chatfield. Submitted by C Chatfield
(cc@maths.bath.ac.uk). [5/Jun/03] (19kbytes)
- cloud
- These data are those collected in a cloud-seeding experiment in Tasmania.
The rainfalls are period rainfalls in inches. TE and TW are the east and
west target areas respectively, while NC, SC and NWC are the corresponding
rainfalls in the north, south and north-west control areas respectively.
S = seeded, U = unseeded.
Submitted by Alan Miller (alan@dmsmelb.mel.dms.CSIRO.AU)
[4/May/94] (7 kbytes)
- chscase
- A collection of the data sets used in the book
"A Casebook for a First Course in Statistics and Data
Analysis," by Samprit Chatterjee, Mark S. Handcock and
Jeffrey S. Simonoff, John Wiley and Sons, New York, 1995.
Submitted by Samprit Chatterjee (schatterjee@stern.nyu.edu),
Mark Handcock (mhandcock@stern.nyu.edu) and
Jeff Simonoff (jsimonoff@stern.nyu.edu). (325 kbytes)
Updated, [1/Dec/95]
- christensen
- Contains the data from "Analysis of Variance, Design,
and Regression: Applied Statistical Methods" by Ronald Christensen
(1996, Chapman and Hall).
Ronald Christensen (fletcher@math.unm.edu), [22/Oct/96] (57k)
- christensen-llm
- Contains data from "Log-Linear Models and Logistic
Regression, Second Edition" by Ronald Christensen (1997, Springer
Verlag). Ronald Christensen (fletcher@stat.unm.edu), [24/Mar/97]
(33k)
- cjs.sept95.case
- Data on tree growth used in the Case Study published in the
September, 1995 issue of the Canadian Journal of Statistics.
Nancy Reid (reid@utstat.utstat.toronto.edu)
[4/Oct/95] (141k)
- colleges
- 1995 Data Analysis Exposition
sponsored by the Statistical Graphics Section of the American
Statistical Association. The U.S. News data contains information
on tuition, etc., for over 1300 schools, while the AAUP data
includes average salary, etc.
Robin Lock, (rlock@vm.stlawu.edu).
- collins
- Data derived from an analysis of the Brown and Frown corpora and
used for my doctoral dissertation titled ``Variations in Written
English: Characterizing Authors' Rhetorical Language Choices
Across Corpora of Published Texts" (Jeff Collins, Carnegie Mellon
Univ, May 2003). Submitted by Jeff Collins
(jeff.collins@acm.org). [14/Jul/03] (112k)
- confidence
- This file contains the monthly frequencies for six consumer
confidence items collected by the conference board and the
university of michigan in 1992. Reference in Sociological
Methodology. Submitted by Gordon Bechtel (BECHTEL@NERVM.NERDC.UFL.EDU)
[22/Oct/96] (6k)
- Congdon.zip
- Data and WINBUGS programs for the Wiley Publication "Bayesian
Statistical Modelling" (2001), ISBN: 0-471-49600-6, submitted by
Peter Congdon (p.congdon@qmul.ac.uk). [14/Sep/01] (433k)
- CongdonABM
- Data and WINBUGS programs for the Wiley Publication " Applied
Bayesian Modelling" (2003), ISBN: 0-471-48695-7, submitted by
Peter Congdon (p.congdon@qmul.ac.uk). [14/Dec/04] (574k)
- CongdonBMCD
- Data and WINBUGS programs for the Wiley Publication "Bayesian Models for Categorical Data" (2005), ISBN: 0-470-09237-8, submitted by
Peter Congdon (p.congdon@qmul.ac.uk). [3/Jun/05] (510k)
- CPS_85_Wages
- These data consist of a random sample of 534 persons from the
CPS, with information on wages and other characteristics of the
workers, including sex, number of years of education, years of
work experience, occupational status, region of residence and
union membership. Source: Berndt, ER. The Practice of
Econometrics. 1991. NY:
Addison-Wesley. (Therese.A.Stukel_AT_Dartmouth.EDU) (MS Word format) [21/Jul/98] (23 kbytes)
- csb
- See the separate csb collection for
Data from the book "Case Studies in Biometry".
- detroit
- Data on annual homicides in Detroit, 1961-73, from
Gunst & Mason's book `Regression Analysis and its
Application', Marcel Dekker. Contains data on 14
relevant variables collected by J.C. Fisher.
(alan@dmsmelb.mel.dms.csiro.au) [10/Feb/92]
(3357 bytes)
- diggle
- Data-sets from Diggle, P.J. (1990). Time Series : A
Biostatistical Introduction. Oxford University Press.
Submitted by Peter Diggle, (maa026@central1.lancaster.ac.uk)
(35800 bytes)
- dirtbike
- This data set collects the specifications of off-road motorcycles
sold in USA. Submitted by lu.jjane@gmail.com. [19/Jul/05] (16kbytes)
- disclosure
- Data-sets from Fienberg, S.E., Makov, U.E. and Sanil, A.P
(1994). A
Bayesian Approach to Data Disclosure: Optimal Intruder
Behavior for Continuous Data. Submitted by S.E. Fienberg
(fienberg@stat.cmu.edu) [4/Jun/98] (111 kbytes)
- djdc0093
- Dow-Jones Industrial Average (DJIA) closing values from
1900 to 1993. See also spdc2693.
Submitted by eduardo ley, (edley@eco.uc3m.es)
[13/Mar/96](383 kbytes)
- DJ 1985-2003
- For each stock of Dow Jones 30, starting from 1985 up to and including Oct. 30th, 2003, daily quotations with
open/close and adjusted close values, min./max values, volume are submitted. The said data are the result of a merge
between different CSV files, downloaded from www.yahoo.com, as I made it to write my dissertation at the Dept. of Computer
Science of University of Turin, Italy with Rosa Meo Prof. Data format is as follows. ID (Unique Identifier); date; open;
high; low; close; volume; adjclose; ticker. Submitted by Danilo Careggio (careggio.danilo@tiscali.it)
[6/Feb/04] (2.3M)
- EGViolators
- Data on cars speeding at the
southern end of the NJ turnpike, and whether black drivers speed much more
frequently than do white drivers. It is the basis for a paper to appear in Law
Probability and Risk, by Kadane and Lamberth. [28/Apr/09] (760Kbytes)
- employment
- Data from two cases described in the paper "Hierarchical Models for
Employment Decisions," by Kadane and
Woodworth. A constant number of
days has been subtracted from each date
to preserve confidentiality. CaseK
and CaseW. Submitted by George Woodworth
(george-woodworth@uiowa.edu). [4/Dec/01] (25 kbytes)
- federalistpapers.txt
- Data set used in an analysis of The Federalist papers,
including the disputed texts. (publication info to follow). Submitted
by Jeff Collins (jeff.collins@acm.org). [31/Oct/02] (8.6k)
- fienberg
- The data from Fienberg's "The Analysis of
Cross-Classified Data", in a form that can easily be
read into Glim (or easily read by a human). [25/Sept/91]
(mikem@stat.cmu.edu) (14398 bytes).
- fl2000.txt
- County data from the 2000 Presidential Election in Florida.
For each of the 67 Florida counties, the data include the type of
voting machine used, the number of columns in the presidential
ballot, the undervote, the overvote, and the official certified
votes for each of the twelve presidential candidates. Of
particular interest are the Buchanan vote in Palm Beach county,
and the overvote as a function of voting machine type and number
of columns (see Agresti and Presnell, "Misvotes, Undervotes, and
Overvotes: The 2000 Presidential Election in Florida,"
Statistical Science, Vol. 17, No. 4, 1-5, 2002. Submitted by
(presnell@stat.ufl.edu). [28/Jan/03] (8.0kbytes)
- fraser-river
- Time series of monthly flows for the Fraser River at
Hope, B.C. A. Ian McLeod (aim@julian.uwo.ca)
[26/April/93] (10 kbytes)
- hba1c_bloodGlucose
- These data represent glycosylated hemoglobin (hba1c) readings reported in DCCT
percentages and random blood glucose (rbg) readings reported in mmol/l. The
readings are derived from 349 diabetic patients attending a hospital out-
patient department at the Karl Bremer District Hospital in Western Cape, South
Africa.
The original data were published as a scatter plot in a Masters thesis (p.12):
Daramola O.F. (2012). Assessing the validity of random blood glucose testing
for monitoring glycemic control and predicting HbA1c values in type 2
diabetics at Karl Bremer hospital. Masters Thesis [Family Medicine and
Primary Care]. Stellenbosch University: Stellenbosch, South Africa.
http://scholar.sun.ac.za/handle/10019.1/80458
Submitted by Daniel D Reidpath(daniel.reidpath@monash.edu)[10/13/14] (11kbytes)
- hip
- This is the hip measurement data from Table B.13 in
Chatfield's Problem Solving (1995, 2nd edn, Chapman and Hall).
It is given in 8 columns. First 4 columns are for Control Group.
Last 4 columns are for Treatment group (Note there is no pairing.
Patient 1 in Control Group is NOT patient 1 in Treatment Group).
(cc@maths.bath.ac.uk) [28/Feb/96] (2k)
- houses.zip
- These spatial data contain 20,640 observations on housing prices
with 9 economic covariates. It appeared
in Pace and Barry (1997), "Sparse Spatial Autoregressions",
Statistics and Probability Letters. Submitted by Kelley Pace
(kpace@unix1.sncc.lsu.edu). [9/Nov/99] (536 kbytes)
- humandevel
- United Nations Development Program, Human Development
Index. A nation's HDI is composed of life expectancy,
adult literacy and Gross National Product per capita.
Information on 130 countries plus documentation.
(arnold@stat.ncsu.edu (Tim Arnold)) [31/Oct/91]
(10031 bytes).
- hutsof99
- Data from "The Multivariate Social Scientist --- Introductory
Statistics Using Generalized Linear Models" by Graeme D. Hutcheson and Nick
Sofroniou, SAGE Publications, 1999, plus GLIM 4 code for various
analyses. Submitted by Nick Sofroniou
(nso@gcal.ac.uk) [12/Jul/99] (56k)
- IQ_Brain_Size
- This datafile contains 20 observations (10 pairs of twins) on 9
variables. This data set can be used to demonstrate simple linear
regression and correlation. Source: Tramo MJ, Loftus WC, Green
RL, Stukel TA, Weaver JB, Gazzaniga MS. Brain Size, Head Size,
and intelligence quotient in Monozygotic Twins. Neurology
1998;50:1246-1252. (Therese.A.Stukel_AT_Dartmouth.EDU) (MS Word format) [21/Jul/98][28/Nov/01] (5 kbytes)
- irish.ed
- Longtitudinal educational transition data set for a sample
of 500 Irish students, with 4 independent variables
(sex, verbal reasoning score, father's occupation,
type of school). Submitted by Adrian E. Raftery
(raftery@stat.washington.edu), [20/Dec/93] (13 kbytes)
- kidney
- Data from McGilchrist and Aisbett, Biometrics 47, 461-66, 1991.
Times to infection, at the point of insertion of the catheter, for kidney
patients using portable dialysis equipment. There are 2 observations on
each of 38 patients. The data has been used to illustrate random effects
(frailty) models for survival data. Submitted by therneau@Mayo.EDU
(Terry Therneau), [10/Jun/99] (4kbytes)
- lmpavw
- time series used in "Long-Memory Processes, the Allan Variance and
Wavelets" by D. B. Percival and P. Guttorp, a chapter in "Wavelets
in Geophysics", edited by E. Foufoula-Georgiou and P. Kumar,
Academic Press, 1994
This "time" series was collected by Mike Gregg, Applied Physics
Laboratory, University of Washington, and is a measurement of
vertical shear (in units of 1/seconds) versus depth (in units of meters)
in the ocean. The role of "time" in this series is thus played by depth.
Permission has been obtained to redistribute this data. Questions
concerning this series should be send to Don Percival
(dbp@apl.washington.edu). [6/Feb/94] (62 kbytes)
- longley
- The infamous Longley data, "An appraisal of
least-squares programs from the point of view of the
user", JASA, 62(1967) p819-841.
(therneau@mayo.edu) (1301 bytes)
- LPR
- "In Geressy, Rotolo and Jackson v. Digital Equipment (980 F. Supp. 640 (E.D.N.Y. 1997)), Judge Weinstein presented an innovative method for determining remittiturs. A remittitur is a ruling from a judge, in response to the motion of a defendant found liable, such that the plaintiff can choose between reducing the amount granted by a the jury for indemnification or a new trial. This ruling is valid when the amount granted by the jury is unreasonable.
In order to define what is unreasonable compensation, Judge Weinstein collected a pool of 86 cases in which there were awards for pain and suffering, Next, he decided if the damage dealt to each of the 3 plaintiffs was comparable to that on each case in the pool. This database is composed by Judge Weinstein's decision on comparability and information about the 86 cases in the pool."
The zip archive contains the following files:
LPR-database.csv - Is the database.
LPR-database-variables.txt - Is a glossary explaining the variables in the database.
LPR-database-description.txt - Is the description of the database.
(3.2kb) Submitted by Rafael Stern(rafaelst@stat.cmu.edu) [03/05/12]
- lupus
- 87 persons with lupus nephritis. Followed up 15+
years. 35 deaths. Var = duration of disease. Over 40 baseline
variables avaiable from authors.
Submited by todd mackenzie (tmacke@po-box.mcgill.ca) (4k)
- hipel-mcleod
- McLeod Hipel Time Series Datasets Collection.
The shar file, mhsets.shar,
contains over 300 time series datasets taken
from various case studies. These data sets are suitable for model building
exercises such as are discussed in our textbook,
"Time Series Modelling of Water
Resources and Environmental Systems" by K.W. Hipel and A.I. McLeod (1994),
published by Elsevier, Amsterdam. 1994. ISBN 0-444-89270-2. (1013 pages).
For PC users there is also a zip file, mhsets.zip.
The shar file and the zip files are about 1.7 Mb and 0.5 Mb respectively.
[1/Mar/95] Ian McLeod (aim@fisher.stats.uwo.ca)
- mu284
- This file contains the data in
"The MU284 Population" from Appendix B of the book "Model Assisted
Survey Sampling" by Sarndal, Swensson and Wretman, published by
Springer-Verlag, New York, 1992. The data set contains 284
observations on 11 variables, plus a line with variabel names. Please
consult the mentioned appendix for more information about this data
set.
[24/Mar/97][16/Mar/06] Esbjorn Ohlsson (esbj@matematik.su.se) (16k)
- newton_hema
- Data on fluctuating proportions of marked cells in
marrow from heterozygous Safari cats--from a study of
early hematopoiesis. Michael Newton
(newton@stat.wisc.edu) [8/Nov/93] (5k)
- nflpass
- Lists all-time NFL passers through 1994 by the NFL passing
efficiency rating. Associated passing statistics from which
this rating is computed are included.
Roger W. Johnson, rjohnso@silver.sdsmt.edu
[28/Feb/96] (8k)
- NLTCS
- This data set is an extract from the National Long Term Care Survey
(NLTCS).
16 binary variables in the extract are functional disability measures: 6
activities of daily living and 10 instrumental activities of daily
living,
pooled over 1982, 1984, 1989, and 1994 waves of the survey. The Center
for Demographic Studies, Duke University, gave its
permission to redistribute the 2^16 extract via placement on StatLib
under
the NLTCS Data Use Agreement. If you download the data, please provide
the
Center for Demographic Studies, Duke University, with your name and
contact
information (e-mail NLTCS@cds.duke.edu).(149k)
- NO2
- The data are a subsample of 500 observations from a data set
that originate in a study where air pollution at a road is related
to traffic volume and meteorological variables, collected by the
Norwegian Public Roads Administration. The response variable
(column 1) consist of hourly values of the logarithm of the
concentration of NO2 (particles), measured at Alnabru in Oslo,
Norway, between October 2001 and August 2003. The predictor
variables (columns 2 to 8) are the logarithm of the number of cars
per hour, temperature $2$ meter above ground (degree C), wind
speed (meters/second), the temperature difference between $25$ and
$2$ meters above ground (degree C), wind direction (degrees
between 0 and 360), hour of day and day number from October
1. 2001. Submitted by Magne Aldrin (magne.aldrin@nr.no). [28/Jul/04] (19kbytes)
- nonlin
- The data sets from Bates and Watts (1988) "Nonlinear
Regression Analysis and Its Applications", Wiley.
They are in S dump format as data frames. (If you
don't know what a data frame is, don't worry. Just
consider them to be lists. Data frames are described
in a book on "Statistical Modelling in S"
(bates@stat.wisc.edu)
[7/Feb/90] (19851 bytes)
- papir
- This file contains two multivariate regression data sets from paper
industry, described in Aldrin, M. (1996), "Moderate projection pursuit
regression for multivariate response data", Computational Statistics and
Data Analysis, 21, 501-531. Submitted by Magne Aldrin
(magne.aldrin@nr.no) [14/Apr/99] (17916 bytes)
- pbc
- The data set found in appendix D of Fleming and
Harrington, Counting Processes and Survival Analysis, Wiley, 1991.
Submitted by therneau@Mayo.EDU (Terry Therneau),
[25/Jul/94] (36 kbytes)
- pbcseq
- A follow-up to the PBC data set, this contains the
data for both the baseline and subsequent visits at 6 months, 1 year, and
annually thereafter. There are 1945 observations on 312
subjects. Submitted by therneau@Mayo.EDU (Terry Therneau),
[10/Jun/99] (160kbytes)
- places
- Data taken from the Places Rated Almanac, giving the
ratings on 9 composite variables of 329 locations.
(From an ASA data exposition, 1986)
The data are in one file:
- places.data
- A shar archive of three files
which document the data, present the data
itself, and provide a key to the actual places
used. (27720 byes)
- Plasma_Retinol
- This datafile (N=315) investigates the relationship between
personal characteristics and dietary factors, and plasma
concentrations of retinol, beta-carotene and other carotenoids.
Analysis unpublished. Related paper: Nierenberg DW, Stukel TA,
Baron JA, Dain BJ, Greenberg ER. Determinants of plasma levels of
beta-carotene and retinol. American Journal of Epidemiology
1989;130:511-521.(Therese.A.Stukel_AT_Dartmouth.EDU) (MS Word format)
[21/Jul/98][28/Nov/01] (26 kbytes)
- PM10
- The data are a subsample of 500 observations from a data set that
originate in a study where air pollution at a road is related to
traffic volume and meteorological variables, collected by the
Norwegian Public Roads Administration. The response variable
(column 1) consist of hourly values of the logarithm of the
concentration of PM10 (particles), measured at Alnabru in Oslo,
Norway, between October 2001 and August 2003. The predictor
variables (columns 2 to 8) are the logarithm of the number of cars
per hour, temperature $2$ meter above ground (degree C), wind
speed (meters/second), the temperature difference between $25$ and
$2$ meters above ground (degree C), wind direction (degrees
between 0 and 360), hour of day and day number from October
1. 2001. Submitted by Magne Aldrin (magne.aldrin@nr.no). [28/Jul/04] (19kbytes)
- pollen
- Synthetic dataset about the geometric features of
pollen grains. There are 3848 observations on 5
variables. From the 1986 ASA Data Exposition dataset, made
up by David Coleman of RCA Labs.
The data are in one file:
- pollen.data
- A shar archive of 9 files. The first
file gives a short description of the data,
then there are 8 data files, each with 481
observations. (205954 bytes)
- pollen.extra
- Some extra comments about the data. Look here for hints.
- pollution
- This is the pollution data so loved by writers of
papers on ridge regression. Source: McDonald, G.C. and
Schwing, R.C. (1973) 'Instabilities of regression
estimates relating air pollution to mortality',
Technometrics, vol.15, 463-482. (8540 bytes)
- profb
- Scores and point spreads for all NFL games in the
1989-91 seasons. Contributed by Robin Lock
(rlock@stlawu.bitnet) [15/Sept/92] (27733 bytes)
- prnn
- This shar archive contains the datasets used in `Pattern
Recognition and
Neural Networks' by B.D. Ripley, Cambridge University Press (1996),
ISBN 0 521 46086 7 (ripley@stats.ox.ac.uk) [1/Dec/95] (101 kbytes)
- rabe
- This file contains data from Regression Analysis By Example,
2nd Edition, by Samprit Chatterjee and Bertram Price,
John Wiley, 1991. (schatter@stern.nyu.edu)
[6/Feb/92] (40309 bytes)
- rir
- This file contains data from Residuals and Influence
in Regression, R. Dennis Cook and Sanford Weisberg,
Chapman and Hall, 1982. (sandy@umnstat.stat.umn.edu)
(5206 bytes). [Updated 25/May/93]
- riverflow
- Datasets mentioned in "Parsimony, Model Adequacy and
Periodic Correlation in Time Series Forecasting", ISI
Review, A.I. McLeod (1992, to appear). Submitted by
A.Ian McLeod (aim@stats.uwo.ca). Time series data. A
shar archive. [22/Jan/92] (294052 bytes).
- rmftsa
- Data Sets for
"Regression Models for Time Series Analysis" by B. Kedem and
K. Fokianos, Wiley 2002. Submitted by Kostas Fokianos
(fokianos@ucy.ac.cy) [8/Nov/02] (176k)
- sapa
- time series used in "Spectral Analysis for Physical
Applications" by D. B. Percival and A. T. Walden,
Cambridge University Press, 1993.
(dbp@apl.washington.edu) [4/Nov/92](50788 bytes)
- saubts
- Two ocean wave time series used in "Spectral Analysis
of Univariate and Bivariate Time Series" by D. B.
Percival, Chapter 11 of "Statistical Methods for
Physical Science," edited by J. L. Stanford and S. B.
Vardeman, Academic Press, 1993. (dbp@apl.washington.edu)
[14/Apr/93] (47 kbytes)
- schizo
- Schizophrenic Eye-Tracking Data in Rubin and Wu (1997)
Biometrics.
Yingnian Wu (wu@hustat.harvard.edu) [14/Oct/97] (21k)
- sensory
- Data for the sensory evaluation experiment in Brien,
C.J. and Payne, R.W. (1996) Tiers, structure formulae and the
analysis of complicated experiments. submitted for publication.
Chris Brien (matcjb@ntx.city.unisa.edu.au) [22/Oct/96] (19k)
- ships
- Ship damage data, from "Generalized Linear Models" by
McCullagh and Nelder, section 6.3.2, page 137.
(therneau@mayo.edu) (1709 bytes)
- sleuth
- Contains 110 data sets from the book "The Statistical
Sleuth" by Fred Ramsey and Dan Schafer; Duxbury Press,
1997. (schafer@stat.orst.edu) [14/Oct/97] (172k)
- sleep
- Data from which conclusions were drawn in the
article "Sleep in Mammals: Ecological and Constitutional Correlates"
by Allison, T. and Cicchetti, D. (1976), _Science_, November 12,
vol. 194, pp. 732-734. Includes brain and body weight, life
span, gestation time, time sleeping, and predation and danger
indices for 62 mammals. Submitted by Roger Johnson
(rwjohnso@silver.sdsmt.edu) [27/Jul/94] (8k)
- smoothmeth
- A collection of the data sets used in the book
"Smoothing Methods in Statistics," by Jeffrey S. Simonoff,
Springer-Verlag, New York, 1996. Submitted by Jeff Simonoff
(jsimonoff@stern.nyu.edu).
[13/Mar/96] (242kbytes)
- socmob
- Social Mobility (US, 1973). Two four-way 17x17x2x2
contingency tables: Father's occupation, Son's occupation
(first and current), family structure, race.
Submitted by Timothy J. Biblarz (biblarz@uscvm.bitnet).
[corrected 25/Jan/93]
- space_ga
- Election data including spatial coordinates on 3,107 US
counties. Used in Pace and Barry (1997), Geographical Analysis,
Volume 29, 1997, p. 232-247. Submitted by Kelley Pace
(kpace@unix1.sncc.lsu.edu). [3/Nov/99] (548 kbytes)
- spdc2693
- Standard and Poor's 500 Index closing values from 1926 to 1993.
See also djdc0093.
Submitted by eduardo ley, (edley@eco.uc3m.es)
[13/Mar/96] (333 kbytes)
- stanford
- Two versions of the Stanford Heart Transplant Data, one "The
Statistical Analysis of Failure Time Data" by
Kalbfleisch and Prentice, Appendix I, pages 230-232,
the other from the original paper by Crowley and Hu.
(therneau@mayo.edu) (15003 bytes) [Corrected, 8/Mar/93]
- stanford.diff
- The differences between the two Stanford data sets.
- strikes
- Data on industrial disputes and their covariates in 18 OECD
countries, 1951-1985. Prepared by Bruce Western
(western@datacomm.iue.it) [2/Oct/95] (44k)
- tecator
- The task is to predict the fat content of a meat sample
on the basis of its near infrared absorbance spectrum.
Regression.
Submitted by thodberg@nn.meatre.dk (Hans Henrik Thodberg)
[23/Jan/95] (302 kbytes)
- transplant
- Data on deaths within 30 days of heart transplant
surgery at 131 U.S. hospitals. see Bayesian Biostatistics, D.
Berry & D. Stangl, eds, 1996, Marcel Dekker.
Cindy L. Christiansen and Carl N. Morris
Cindy Christiansen
[22/Oct/96] (3k)
- tsa
- Software and Data Sets for
"Time Series Analysis and Its Applications" by
R.H. Shumway & D.S. Stoffer, Springer, 2000. Submitted by David
Stoffer (stoffer@stat.pitt.edu)[10/Mar/00]
- tumor
- Tumor Recurrence data for patients with Bladder cancer
Taken from Wei, Lin and Weissfeld, JASA 1989, p 1067.
From: therneau@mayo.edu (Terry Therneau) [23/Mar/93]
[5/Jun/96] (3k)
- veteran
- Veteran's Administration Lung Cancer Trial, Taken from
Kalbfleisch and Prentice, pages 223-224
(therneau@mayo.edu) (8249 bytes)
- visualizing.data
- This zip file contains 25 data sets from the book
Visualizing Data published by Hobart Press
(books@hobart.com) and written by William S. Cleveland
(wsc@research.att.com). There is also a README file so
there are 26 files in all. Each of the 25 files has
the data in an ascii comma separated format. The name of each
data file is the name of the data set used in the
book. To find the description of the data set in the
book look under the entry "data, name" in the index.
For example, one data set is barley. To find the
description of barley, look in the index under the
entry "data, barley". The S archive of Statlib has a
file created by S that contains the data sets in a
format that makes it easy to read them into S.
[12/Nov/93][17/Oct/94][23/Oct/97]
- wind
- daily average wind speeds for 1961-1978 at 12 synoptic
meteorological stations in the Republic of Ireland
(Haslett and Raftery, Applied Statistics 1989).
There is a LARGE amount of data. Please be sure you
want it before you ask for it!!
There are two entries to obtain.
- wind.desc
- A short desciption of the data (815 bytes)
- wind.data
- The data (532494 bytes).
- wind.correlations
- Estimated correlations between daily 3 p.m.
wind measurements during September and October 1997 for
a network of 45 stations in the Sydney region. From
Nott and Dunsmuir, ``Analysis of Spatial Covariance Structure
from Monitoring Data,'' Technical Report, Department of
Statistics, University of New South Wales. Submitted
by David Nott (djn@maths.unsw.edu.au). [8/Mar/00] (13 kbytes)
- witmer
- A shar archive of data from the book
Data Analysis: An Introduction(1992) Prentice Hall bu Jeff Witmer.
Submitted by Jeff Witmer (fwitmer@ocvaxa.cc.oberlin.edu)
[28/Jun/94] (29 kbytes)
- wseries
- These data tell whether or not the home team won
for each game played in all World Series prior to 1994.
The data appear as the STATS Challenge for Issue 11.
Submitted by Jeff Witmer (fwitmer@ocvaxa.cc.oberlin.edu)
[20/Mar/94] (3 kbytes)
- Vinnie.Johnson
- Data on the shooting of Vinnie Johnson of the Detroit
Pistons during the 1985-1986 through 1988-1989 seasons. Source was the
New York Times.
Submitted by Rob Kass (kass@stat.cmu.edu)
[18/Aug/95] (26 kbytes)
- submissions
- Information on how to submit data to this archive.
Other Sources
For WWW (Mosaic, Netscape, etc.) users, here are a set of links to other
sources of Data. These sources are not kept on StatLib, and we have no
control over them. If you find a link is consistently not working, let us
know.
- Time
Series Data Library
- Rob Hyndman's collection of over 500 time series organized by subject.
- EconData
- Several hundred thousand economic time series, produced by
the U.S. Government and distributed by the government in a variety of
formats and media, have been put into a standard, highly efficient, easy-to-
use form for personal computers.
-
Oceanographic & Earth Science Data
- From Scripps Institution of Oceanography Library
- The Data Zoo
- California coastal data collection
-
Journal of Statistics Education Information Service
- Also has some data
Credit where credit is due
If you use an algorithm, dataset, or other information from StatLib,
please acknowledge both StatLib and the original contributor of the
material.
Last modified: Tue Jul 19 09:03:38 EDT 2005
By
Pantelis Vlachos