1485344153_Text_Editor.png

data mining, Web scraping, and using APis

A collection of tools for web scraping, interacting with APIs, and data mining.

Tutorials to get you
started
Road
tested
Python There are many websites for learning Python. Mode Analytics
Code Academy
SQL Creating, accessing, and manipulating relational databases through SQL is standard practise in industry. There are many websites for learning SQL. Mode Analytics
Code Academy
Khan Academy
Mining the Social Web A fantastic resource for data mining the social web. Includes chapter on mining Twitter, Facebook, LinkedIn, Google+, Webpages, GitHub and Mailboxes.
Arcas Arcas is a python tool designed to help with collecting academic articles from various APIs.
Tabula Tabula is a tool for scraping data tables locked inside PDF files.
pandas pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
pyNASA pyNASA provides a simple interface to obtain NASA datasets and returns them as a pandas dataframe ready to use.
Apache Tika The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.
PDFtables Accurately convert PDF tables to Excel.
morph.io Over 5500 public scrapers, with lots of data, available for you to reuse, for free. Download data as a CSV or use the super-simple API. Scrapers can be written in Ruby, PHP, Python, Perl or Node.js. Getting Started
Scrapy An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way
Kimono Web text scraper - lets you turn websites into APIs in seconds
OpenRefine A powerful tool for working with messy data, cleaning it; transforming it from one format into another; and extending it with web services and external data.
Paperweight A Python package for hacking LaTeX documents

Selected Tutorials


News & Resources