data mining, Web scraping, and using APis
A collection of tools for web scraping, interacting with APIs, and data mining.
|Tutorials to get you
|Python||There are many websites for learning Python.||Mode Analytics
|SQL||Creating, accessing, and manipulating relational databases through SQL is standard practise in industry. There are many websites for learning SQL.||Mode Analytics
|Mining the Social Web||A fantastic resource for data mining the social web. Includes chapter on mining Twitter, Facebook, LinkedIn, Google+, Webpages, GitHub and Mailboxes.|
|Arcas||Arcas is a python tool designed to help with collecting academic articles from various APIs.|
|Tabula||Tabula is a tool for scraping data tables locked inside PDF files.|
|pandas||pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.|
|pyNASA||pyNASA provides a simple interface to obtain NASA datasets and returns them as a pandas dataframe ready to use.|
|Apache Tika||The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.|
|PDFtables||Accurately convert PDF tables to Excel.|
|morph.io||Over 5500 public scrapers, with lots of data, available for you to reuse, for free. Download data as a CSV or use the super-simple API. Scrapers can be written in Ruby, PHP, Python, Perl or Node.js.||Getting Started|
|Scrapy||An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way|
|Kimono||Web text scraper - lets you turn websites into APIs in seconds|
|OpenRefine||A powerful tool for working with messy data, cleaning it; transforming it from one format into another; and extending it with web services and external data.|
|Paperweight||A Python package for hacking LaTeX documents|
Explore and download data from MAST using the MAST API or astroquery library. Tutorial created by Ivelina Momcheva (@iva_momcheva). MAST API and astroquery modules developed by Clara Brasseur (@cebrasseur).
Tutorials from the IPS/NAOJ Data to Dome workshop, held March 2-3, 2017 on the NAOJ campus.
News & Resources
The University of Washington recently launched a new Data Intensive Research in Astrophysics & Cosmology (DIRAC) Institute. The DIRAC Institute is a world leading, interdisciplinary research centre that addresses fundamental questions about the origins and evolution of our universe.
A great resource for mining text in R. It contains several case studies to work through. One focusses specifically on NASA metadata.
In this tutorial, Jean-Nicholas Hould shares how he scraped the craft beer dataset he published on Kaggle for anyone to enjoy and analyze. The tutorial uses urlopen, BeautifulSoup4, pandas, and re for regular expressions.
A six-week course developed by astronomers at the University of Sydney. The course covers big-data algorithms, querying data with SQL, managing data, regression and classification techniques.
ThinkToStart is a blog focusing on the topics data science and R.
Jake VanderPlas, Senior Data Scientist and Director of Research recently published the Python Data Science Handbook. This is a detailed guide to the most important Python tools for data science, covering IPython, Jupyter, NumPy, Pandas, Matplotlib, Scikit-Learn, and other tools.
Around this time last year BIDS fellow Katy Huff launched her new book Effective Computation in Physics: Field Guide to Research in Python, written with her co-author Anthony Scopatz and published through O’Reilly. Worth checking out if you haven't already.