Tech’s Untapped Talent Pool

When people hear that I left a Ph.D. program in Political Science for a gig as a machine learning engineer, the most common reaction is confused silence followed by: “what?”

It’s no secret that we’re facing a shortage of tech talent in Silicon Valley. As the industry makes efforts to reform antiquated immigration policies and invest in campaigns that encourage everyone to learn how to code, we should also look toward less traditional sources of talent.

The social sciences are undergoing a renaissance of sorts. Funding for traditional social science research is drying up, and researchers are turning to interdisciplinary projects for grants. Among top Ph.D. programs, computational statistics is required, managing databases and constructing SQL queries is standard practice, and writing data processing tools in high-level languages is common. It is not unusual to find researchers with backgrounds in social network analysis, complexity sciences, and machine learning. Taken together, these skills make social scientists prime candidates for roles as data scientists and analysts.

And yet, they’re not getting jobs. In fact, 30% of graduates from Ph.D. programs in the social sciences are unable to find a job upon graduating. From my experience, the lack of job prospects boils down to two problems: social science graduates don’t know how to market themselves in industry circles, and recruiters readily exclude applicants with a degree in the social sciences from technical positions. Changing the status quo will require attitudinal changes from both sides.

Traditionally, social science graduate students have transitioned into positions in academia post-graduation. With their growing skill set, they should consider markets outside of academia, and tailor their CVs to emphasize the quantitative skills they acquired during their graduate education. However, their value goes beyond just quantitative skills; social scientists are trained to design research experiments and interpret statistical analyses in environments where there are potentially hundreds or thousands of causal factors. This puts them in a prime position to distill complex systems into parsimonious models, a skill that is highly valuable in the industry.

On the other hand, recruiters should start giving social scientists a chance for technical roles. Granted not all graduates will have a background in engineering, but they often have sufficient technical proficiency to fill data analyst and scientist roles. Furthermore, we need to dispel the notion that a student who did research in a subfield of political science or communication did not employ sophisticated statistical methodology. In fact, social science research often yields a unique and invaluable blend of quantitative and qualitative skills that the tech industry should embrace.

One thing is certain – Silicon Valley has a knack for identifying industries ripe for transformation. Perhaps it’s time to start applying the same principals to talent.


Tech’s Untapped Talent Pool

A Simple Solution to Searching FISA Court Public Filings

Since Edward Snowden’s leaks about U.S. government surveillance programs, public attention has focused on the courts that grant the authority to monitor communication of American citizens. Eric Mill’s recent post on Hacker News highlighted the obstacles involved in obtaining any sort of information from the Foreign Intelligence Surveillance Court. Specifically, the official public docket of the court consists of scanned, unsearchable documents thrown together on one page.

The Court invented their public docket on the fly back in June, and I’m very glad it exists, but from a public records standpoint, it’s a mess. The Court publishes scanned image PDFs that are impossible to search through electronically. Some PDFs show up in multiple dockets, their publication times are only discernible from a clerk’s physical stamp, and the links are unpredictable. It’s not even clear whether the links or the site itself are permanent.

I began thinking about the simplest way to at least make these documents searchable, and considered creating a custom web application to OCR and index the docket. However, a much faster, cleaner solution presented itself — Google Custom Search. Below is a custom search engine that will search all PDFs in the docket that are indexed by Google.

Search FISA Court Public Docket:

Feel free to embed the custom search engine wherever you see fit:

(function() {
var cx = '015317449821808293020:d07qbdm2huu';
var gcse = document.createElement('script');
gcse.type = 'text/javascript';
gcse.async = true;
gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') +
'//' + cx;
var s = document.getElementsByTagName('script')[0];
s.parentNode.insertBefore(gcse, s);

A Simple Solution to Searching FISA Court Public Filings

Classifying Breast Cancer as Benign or Malignant Using RTextTools

Two years ago, Dr. Amber Boydstun approached me to develop an R package for semi-automated topic classification. The goal of the RTextTools project, a collaborative effort of researchers from University of California, DavisUniversity of WashingtonSciences Po Paris, and Vrije Universiteit Amsterdam, was to simplify the supervised learning process and make machine learning more accessible to political scientists. The package has gained considerable traction among social scientists, and we’ve seen applications in the natural sciences as well, including the classification of protein sequences, stars, and tumors.

One such application is the classification of breast cancer masses as benign or malignant. Using the Wisconsin Diagnostic Breast Cancer Dataset from UC Irvine, we wrote a script that trains eight classifiers on characteristics including clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses. When trained on the data, the classifiers were able to achieve up to 96% recall accuracy on a randomly sampled training set of 200 patients and test set of 400 patients, nearly matching the best predictive accuracy documented by Wolberg et al. When filtering for 6-algorithm agreement, the ensemble outperforms the best predictive accuracy at 98% while still providing 95% coverage of the test sample.

The source code is available below, and the dataset is automatically downloaded from UC Irvine’s servers.

Classifying Breast Cancer as Benign or Malignant Using RTextTools