Writing on AI, education, and running a business
This is the initial post about the algorithm. See updates [1](http://viksalgorithms.blogspot.com/2012/06/predicting-nba-playoff-games- results.html), [2](http://viksalgorithms.blogspot.com/2012/06/nba-playoff- predictions-update-2-and.html), and [3](http://viksalgorithms.blogspot.com/2012/06/nba-playoff-predictions- update-3-4-2.html) for more. The algorithm is currently 4-2 in the playoffs!
In R, the traditional way to load packages can sometimes lead to situations where several lines of code need to be written just to load packages. These lines can cause errors if the packages are not installed, and can also be hard to maintain, particularly during deployment.
I have posted previously about the open data available on Socrata (https://opendata.socrata.com/), and I was looking at the site again today when I stumbled upon a listing of levels of various radioactive isotopes by US city and state. The data is available at https://opendata.socrata.com/Government/Sorted-RadNet-Laboratory-Analysis /w9fb-tgv6 . You will need to click export, and then download it as a csv.
The foreach package for R is excellent, and allows for code to easily be run in parallel. One problem with foreach is that it creates new RScript instances for each iteration of the loop, which prevents status messages from being logged to the console output. This is particularly frustrating during long- running tasks, when we are often unsure how much longer we need to wait, or even if the code is doing what it is intended to. The solution to this can be found in the sink() function. This function redirects output to a file. I will show you a simple example of this using the iris data set. The code below will execute without printing any status messages, even though do.trace is enabled, which periodically displays the status of the randomForest. The random forest code is slightly adapted from one of the foreach package examples.
LaTeX is a typesetting system that can easily be used to create reports and scientific articles, and has excellent formatting options for displaying code and mathematical formulas. Sweave is a package in base R that can execute R code embedded in LaTeX files and display the output. This can be used to generate reports and quickly fix errors when needed.
Modifying R code to run in parallel can lead to huge performance gains. Although a significant amount of code can easily be run in parallel, there are some learning techniques, such as the Support Vector Machine, that cannot be easily parallelized. However, there is an often overlooked way to speed up these and other models. It involves executing the code that generates predictions and other analytics in parallel, instead of executing the model building phase in parallel, which is sometimes impossible. I will show you how this can be done in this post.
As I was exploring open data sources, I came across USA spending. This site contains information on US government contract awards and other disbursements, such as grants and loans. In this post, we will look at data on contracts awarded in the state of Maryland in the fiscal year 2011, which is available by selecting “Maryland” as the state where the contract was received and awarded here. I will use Maryland as a proxy for the nation, as the data set for the whole nation will be a bit more unwieldy to analyze, and the USA spending site appears to need a significant amount of time to generate the data file for it. We may take a look at the data for the whole nation later on.
Linear regression can be a fast and powerful tool to model complex phenomena. However, it makes several assumptions about your data, and quickly breaks down when these assumptions, such as the assumption that a linear relationship exists between the predictors and the dependent variable, break down. In this post, I will introduce some diagnostics that you can perform to ensure that your regression does not violate these basic assumptions. To begin with, I highly suggest reading this articleon the major assumptions that linear regression is predicated on.
I was searching for open data recently, and stumbled on Socrata. Socrata has a lot of interesting data sets, and while I was browsing around, I found a data set on federal bailout recipients. Here is the data set. However, data sets on Socrata are not always the most recent versions, so I followed a link to the data source at Propublica, where I was able to find a data set that was last updated on January 17, 2012. I downloaded the data in csv format. In the rest of this post, I will perform basic analysis on this data, and show that R can be used to do the same analysis as Excel in a much simpler and more powerful way.