Web Scrape with Python
What is Web Scraping?
a technique to automatically access and extract large amounts of information from a website, which can save a huge amount of time and effort.
Important notes about web scraping
- check the legal stuff about using the data, read the Terms and Conditions of the website.
- do not frequently and rapidly download data, as to avoid breaking the web or even getting blocked from website.
Inspecting the Website
It important to scan and check the website for the pupose of finding relevant pieces of code that contains our data. then simply do the “inspect”.
Python Code
- librarie to use:
import requests import urllib.request import time from bs4 import BeautifulSoup
- set the url to the website and access the site with our requests library
url = ‘http://web.mta.info/developers/turnstile.html’ response = requests.get(url)
- parse the html with BeautifulSoup
soup = BeautifulSoup(response.text, “html.parser”)
- **locate all of our tags**
soup.findAll(‘a’)
-extract the actual link
one_a_tag = soup.findAll(‘a’)[38] link = one_a_tag[‘href’]
download_url = ‘http://web.mta.info/developers/’+ link urllib.request.urlretrieve(download_url,’./’+link[link.find(‘/turnstile_’)+1:])
time.sleep(1)