Skip to content
Tyler Field edited this page Aug 12, 2017 · 24 revisions

Data Science Tools

NOTE: If you would like to suggest a tool to be added to this page please send a DM to Rocio on Slack with a link to the tool and a short description.

  • Please limit suggestions to tools you have had some experience using

R

IDEs

  • R-Studio Just don't use R without this. Just don't..

Packages

  • Hadley Wickham Anything that this guy has made. Over the past 10 years, he's made a bunch of tools that have made R a much less clunky language.
    • ggplot2 The best plotting.
    • ggvis An upcoming alternative to ggplot2; offers some nice features at the moment for web displays (including interactivity).
    • dplyr An incredibly useful data.frame manipulation package. Supports all sorts of things like aggregation, grouping, and even lets you lazily evaluate manipulations of connections to SQL databases (or BQ!)
    • tidyr For making your data tidy. An extension of reshape2.
    • httr Simple manipulation of HTTP.
    • rvest Simple web scraping.
  • bigrquery A decent interface to BigQuery.
  • magrittr Understand this as soon as possible. It will make your life much easier.
  • pipeR A competing version of the magrittr package. Do the tutorial.
  • rlist Like dplyr but for lists.
  • data.table Offers an alternative to data.frames, is very fast and incorporates some of the features of dplyr in its DF manipulation syntax. Do the tutorial.
  • purrr Functional programming additions for R. Lets you do a lot of useful function composition/application easily.
  • sparkTable Makes Tufte-style spark-* charts or tables. Compatible with shiny.
  • ShinyJS Great for incorporating interactive javascript into Shiny apps and markdowns via R code.
  • caret Functional and easy package for prototyping and comparing different machine learning models. Streamlines, pre-processing, cross-validation, hyper-parameter tuning etc with minimal code

Python

IDEs

  • jupyter notebooks Interactive workbooks for data analysis in Python
  • Rodeo Promising IDE similar to RStudio for data analysis in Python. Still a fairly new project so may be buggy
  • PyCharm Python IDE with integrated terminal and neat features such as smart autocomplete and SQL database interfaces

Packages

  • pandas Data wrangling/manipulation
  • numpy and scipy Data analysis and statistics tools
  • matplotlib Most commonly used library for data visualization and plotting

  • seaborn For creating 'prettier' data visualizations

  • scikit-learn Commonly used machine learning library

  • psycopg PostgreSQL adapter for Python. Easy to use and reliable

  • nltk Extensive library for doing natural language processing (NLP). However to take advantage of different corpora they need to be downloaded separately using nltk.download()

  • itertools Extremely useful library for faster/efficient looping in Python. Not the easiest for beginners, but read this and give it a shot

Spark

APIs and Misc Resources

  • GeoNames REST API that returns ZIP Codes, and other properties, from Lat/Lon points
  • Data Science Toolkit Variety of tools including Street Address to Coordinates, Coordinates to Political Areas, Coordinates to Statistics, and several Text parsing/sentiment tools.
  • FCC Census Block Conversions API for getting the census tract from a Lat/Lon
  • SoQL Socrata, which hosts SF Open Data, has API access for every data set and a variety of SQL-like functions that make queries powerful
  • CitySDK The Census Bureau has a SDK package and API. You'll have to sign up for a key, but it's free.

R vs. Python - Which to use?

  • You may have come across this so called wars between the two languages and endless forums/blog posts/articles discussing the merits between the two. Most people who argue one is "better" than the other usually has way more experience in the language they are rooting for. Read this blog post by Wes Mckinney (avid python and pandas developer) on the topic.

    • This KD Nuggets article provides a nice overview on the pros and cons of the two languages.
    • In short there is no clear answer and it depends on your level and career goals.
    • Personally, I code in both and use the language I think is best/fastest for the job at hand. I feel that this has been advantageous since I can take advantage of the strengths each individual language has to offer.
    • If you are new to coding I recommend learning Python first since it is much more interpretable than R and is a true scripting language, which makes it easier to learn.