Using R packages "ff" and "filehash" to Handle Big Data
Link to the last RSS article here: Un-modeled confounders: Don’t get burned by Simpson’s Paradox. -- Ed.
By Dr. Rich Herrington, Research and Statistical Support Consultant
This month we take a look at the R packages "ff", "filehash" and "biglm" as fairly straightforward approaches to handling large data sets in R (>1 GB). The CRAN URL's for these packages can be found at
The default behaviour of R is to create and store all objects (data, R scripts, functions, etc.) within RAM (working memory). Within R, the function "memory.size()" displays the amount of RAM being used by R at the memory.size query time. The function "gc()" (garbage can collection) will recover allocated memory and provide additional information about R memory usage.
A critical difference between the two packages is how objects are dealt with as they are referenced - "filehash" will load the referenced objects into RAM only when they are actually needed for a calculation, whereas ff will load "pieces of an object" as the pieces are needed (or, these pieces are being paged in and out of memory as they are needed). Potentially, package "ff" can handle much larger data sets more efficiently without depleting RAM. The goal of package "filehash" is to only work with the complete object pieces (i.e. variables) rather than an entire collection of objects (i.e. a dataframe or list). Practically I have found both packages to work fairly well with moderately large data files (approx. 1 GB), with a slight edge favoring package "ff".
The data set that is used to illustrate these two packages is a simulated dataset created by my collegue Jon Starkweather - found at:
The dataset is approximately 890 MB with 62 columns (variables) and 1,023,027 rows. I purposely ran this demonstration on a fairly modest computer setup - an older computer with the following specs: Intel Pentium 4 (1 x CPU) 3.20 GHz; 2.9 GB available RAM; running MS Windows XP Professional SP3.
Additionally, we use a modified "ls()" function script posted at "StackOverFlow":
Additionally, we have posted this script at:
This script will need to be placed in the working R directory so that it can be # read in at the beginning of the session. This function "lsos()" will provide more # readable output concerning object sizes and their memory resources, than the stock# "ls()" function.
We first demonstrate the "ff" package:
Click here to see the program code. Note: Comments are preceeded by "#"
Next month, we'll look at how "ff" and "filehash" perform with analysis methods that are known to use large amounts of memory e.g. clustering, tree partitioning, bootstrap model selection, bayesian model selection, etc.