When you are dealing with large amounts of big data sets it is much more efficient to organizes those in database tables instead of CVSs or other files. Just yesterday I set up R for fetching data from a MySQL DBMS loading a table of stock quotes consisting of more than 300’000 rows into a data frame within seconds. That is pretty cool – and if necessary you can join huge tables in no time benefting of the indexing infrastructure of the DBMS of your choice.
First of all I am using Windows 7 64bit, MySQL 5 32bit and R 32bit for this example – but even if your environment differs from mine the approach will be basically the same. What we need is an ODBC Connector for MySQL. Now here comes a pitfall – the bits have to match those of your R edition not your windows version’s. So let’s download and install the 32bit ODBC connector for MySQL.
I am not sure if it is also of importance whether your MySQL DBMS is 32 or 64bit – anyway, you find out this way:
show variables like 'version_compile_machine';
If it shows you x86 like in my case then you are using 32bit.
The 32bit data source manager in Windows 7 64bit
The next step is to register your MySQL database as a data source in windows. In Windows 7 64bit you have to explicitely start the data source manager for 32bit connectors because the one you will find in your system preferences will only manage 64bit connectors. The 32bit data source manager you are going to find here:
The rest is a cinch – you click “Add..” on the User-DSN sheet, choose the “MySQL ODBC 5.x driver”, fill in the credentials, choose a database and a data source name and you’re done.
The R package RODBC
Now you can go back to R install the RODBC package, load it and connect to your new data source:
# install the RODBC package
# load the package
# connect to you new data source
db <- odbcConnect("data_source_name", uid="root", pwd="password")
# whoe the names of the available tables
# load a_table into a data fram
df <- sqlQuery(db, "select * from a_table")
# close the connection
Information about the package’s functionality you will find here.
Note though that some advanced (My)SQL features won’t be available when connecting through ODBC. In PowerPivot I encountered a situation where I wasn’t able to fire a query including some tricks with variables. But joining, grouping, ordering and all that stuff is of course possible.