Don’t Alienate Open Source Contributors

A few years ago, I wrote an R package that wrapped some of the text mining and supervised learning packages on CRAN with APIs to simplify the process of document classification. I have applied a few small patches since then, but the package held up surprisingly well over the years without maintenance.

On New Year’s Day 2014, I began receiving emails from users that the package was no longer installable via CRAN. According to CRAN, the package had been “archived on 2014-01-01 as import conflicts were not corrected.” I downloaded the latest version of R and ran R CMD CHECK – sure enough, R 3.0 had added more stringent checks and the package was not passing. I applied the necessary patches and submitted the package to CRAN for approval. Within a few hours of submitting the patch, I received the following response from one of the CRAN maintainers:

“You were asked to correct this in 2013, failed to even reply, and it has been archived. Please explain your inaction and lack of the most basic manners.”

I checked my university email, which I had not looked at since taking leave more than a year ago, and noticed an automated email sent to 22 maintainers to fix import issues that were now being enforced in R 3.0. Perhaps I overlooked the email, but I would not have replied to an automated message, especially one that clearly reads:

“Do NOT reply to this email to submit an update!”

I eventually got in touch with a CRAN maintainer who approved my patch without hesitation, but the experience brought to mind a few lessons:

  1. Don’t alienate volunteers — everyone in the R community is a volunteer, and it doesn’t benefit the community when you’re unnecessarily rude.
  2. Understand volunteers have other commitments — while the core R team is doing an excellent job building a statistical computing platform, not everyone can make the same commitment to an open-source project.
  3. Open-source has limited resources — every contribution helps.
  4. Be patient — not everyone can operate on the same level, and new members will need to be brought up to speed on best practices.

With that in mind, I recently submitted a tiny syntax correction to GraphLab and promptly got this message from the repository owner:

“Just merged your request. We are always excited getting fresh reinforcement to our open source project! Especially nowadays that amount of contribution is significantly rising.”

That’s open source done right.

Photo: Sergey Galyonkin/Flickr

Don’t Alienate Open Source Contributors

This Is How the New Hampshire Primary Unfolded on Twitter

Bernie Sanders Wins New Hampshire Primary

Bernie Sanders and Donald Trump finally had their day.

After maintaining a sizeable advantage in the New Hampshire polls since January, both candidates are showing that they are forces to be reckoned with in the 2016 presidential election. Sanders enjoyed a 21-point lead over Hillary Clinton with 60% of the Democratic vote, while Trump blew the Republican field away with 35.1% of the vote. John Kasich was the Republican runner-up, with 15.9% of the vote.

As an experiment, I collected tweets that mentioned the candidates in real time starting at 9am PST, and used machine learning models to identify the sentiment (e.g. positive or negative) expressed in the tweet. I then plotted the average hourly sentiment for each candidate over a 12-hour period.

Sentiment of tweets mentioning each candidate. Higher values indicate more positive sentiment, and lower values indicate more negative sentiment.

What are the results?

  • Clinton and Rubio garnered support early in the primary, but petered out as rumors of a Sanders and Trump victory surfaced in the mainstream media.
  • Bernie Sanders is enjoying a groundswell of support on Twitter, especially after the victory. Not only did Sanders maintain a large lead over Clinton in the primary — he also maintained a large lead in sentiment on Twitter.
  • Donald Trump remains a polarizing figure in this election, yet sentiment toward him balances out at the higher end of the spectrum.
  • Nobody likes Jeb Bush.

Interestingly, despite Donald Trump winning in the Republican field, folks prefer John Kasich on Twitter. In fact, they love him — almost as much as Sanders.

Why does sentiment matter?

Candidates like Sanders have been painted as black sheep in the mainstream media despite a massive following on social hubs like Twitter and Reddit. This is due in large part to the echo chamber effect — the perception that a small online minority is controlling the narrative for Sanders, and that these supporters are not representative of the electorate. Therefore, the fundamental question being asked by pundits has been: Can Bernie win in 2016?

Establishing a correlation between support in online communities and success in the primaries is key to developing predictive models of the 2016 election. This preliminary analysis provides a first step into establishing such a correlation.

Stay tuned for coverage of the Nevada and South Carolina primaries on February 20th, with a more in-depth analysis of the underlying content of the tweets.

Photo: Gage Skidmore/Flickr

This Is How the New Hampshire Primary Unfolded on Twitter

Tech’s Untapped Talent Pool

Tech's Untapped Talent Pool

When people hear that I left a Ph.D. program in Political Science for a gig as a machine learning engineer, the most common reaction is confused silence followed by: “what?”

It’s no secret that we’re facing a shortage of tech talent in Silicon Valley. As the industry makes efforts to reform antiquated immigration policies and invest in campaigns that encourage everyone to learn how to code, we should also look toward less traditional sources of talent.

The social sciences are undergoing a renaissance of sorts. Funding for traditional social science research is drying up, and researchers are turning to interdisciplinary projects for grants. Among top Ph.D. programs, computational statistics is required, managing databases and constructing SQL queries is standard practice, and writing data processing tools in high-level languages is common. It is not unusual to find researchers with backgrounds in social network analysis, complexity sciences, and machine learning. Taken together, these skills make social scientists prime candidates for roles as data scientists and analysts.

And yet, they’re not getting jobs. In fact, 30% of graduates from Ph.D. programs in the social sciences are unable to find a job upon graduating. From my experience, the lack of job prospects boils down to two problems: social science graduates don’t know how to market themselves in industry circles, and recruiters readily exclude applicants with a degree in the social sciences from technical positions. Changing the status quo will require attitudinal changes from both sides.

Traditionally, social science graduate students have transitioned into positions in academia post-graduation. With their growing skill set, they should consider markets outside of academia, and tailor their CVs to emphasize the quantitative skills they acquired during their graduate education. However, their value goes beyond just quantitative skills; social scientists are trained to design research experiments and interpret statistical analyses in environments where there are potentially hundreds or thousands of causal factors. This puts them in a prime position to distill complex systems into parsimonious models, a skill that is highly valuable in the industry.

On the other hand, recruiters should start giving social scientists a chance for technical roles. Granted not all graduates will have a background in engineering, but they often have sufficient technical proficiency to fill data analyst and scientist roles. Furthermore, we need to dispel the notion that a student who did research in a subfield of political science or communication did not employ sophisticated statistical methodology. In fact, social science research often yields a unique and invaluable blend of quantitative and qualitative skills that the tech industry should embrace.

One thing is certain – Silicon Valley has a knack for identifying industries ripe for transformation. Perhaps it’s time to start applying the same principals to talent.

Photo: Tim Lucas/Flickr

Tech’s Untapped Talent Pool

A Simple Solution to Searching FISA Court Public Filings

Since Edward Snowden’s leaks about U.S. government surveillance programs, public attention has focused on the courts that grant the authority to monitor communication of American citizens. Eric Mill’s recent post on Hacker News highlighted the obstacles involved in obtaining any sort of information from the Foreign Intelligence Surveillance Court. Specifically, the official public docket of the court consists of scanned, unsearchable documents thrown together on one page.

The Court invented their public docket on the fly back in June, and I’m very glad it exists, but from a public records standpoint, it’s a mess. The Court publishes scanned image PDFs that are impossible to search through electronically. Some PDFs show up in multiple dockets, their publication times are only discernible from a clerk’s physical stamp, and the links are unpredictable. It’s not even clear whether the links or the site itself are permanent.

I began thinking about the simplest way to at least make these documents searchable, and considered creating a custom web application to OCR and index the docket. However, a much faster, cleaner solution presented itself — Google Custom Search. Below is a custom search engine that will search all PDFs in the docket that are indexed by Google.

Search FISA Court Public Docket:

Feel free to embed the custom search engine wherever you see fit:

<script>
(function() {
var cx = '015317449821808293020:d07qbdm2huu';
var gcse = document.createElement('script');
gcse.type = 'text/javascript';
gcse.async = true;
gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') +
'//www.google.com/cse/cse.js?cx=' + cx;
var s = document.getElementsByTagName('script')[0];
s.parentNode.insertBefore(gcse, s);
})();
</script>
<gcse:search></gcse:search>

A Simple Solution to Searching FISA Court Public Filings

Classifying Breast Cancer as Benign or Malignant Using RTextTools

Two years ago, Dr. Amber Boydstun approached me to develop an R package for semi-automated topic classification. The goal of the RTextTools project, a collaborative effort of researchers from University of California, DavisUniversity of WashingtonSciences Po Paris, and Vrije Universiteit Amsterdam, was to simplify the supervised learning process and make machine learning more accessible to political scientists. The package has gained considerable traction among social scientists, and we’ve seen applications in the natural sciences as well, including the classification of protein sequences, stars, and tumors.

One such application is the classification of breast cancer masses as benign or malignant. Using the Wisconsin Diagnostic Breast Cancer Dataset from UC Irvine, we wrote a script that trains eight classifiers on characteristics including clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses. When trained on the data, the classifiers were able to achieve up to 96% recall accuracy on a randomly sampled training set of 200 patients and test set of 400 patients, nearly matching the best predictive accuracy documented by Wolberg et al. When filtering for 6-algorithm agreement, the ensemble outperforms the best predictive accuracy at 98% while still providing 95% coverage of the test sample.

The source code is available below, and the dataset is automatically downloaded from UC Irvine’s servers.

Classifying Breast Cancer as Benign or Malignant Using RTextTools