Web Scraping A Telegram Channel [with Requests and BeautufulSoup]
Hello! It has been a while since our last tutorial. I’ve finally decided to make this quick tutorial.
Web Scraping
According to Wikipedia, Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser.
So we will basically use Python to make requests to web servers and manipulate the data it returns. But why would we need to scrape websites?
Some websites don’t offer API support or have too expensive pricing. So, if a website doesn’t forbid web scraping (or at least legally forbid like you accepting agreement during signup) you can scrape their data. Let’s get into it.
[Note: The video tutorial is available at the bottom of this tutorial.]
Steps:
Step 1:
Firstly we need to install two Python packages (assuming that you already have Python installed). We will use requests to make HTTP requests to certain web servers and get the responses. Later on, use the responses to parse HTML content. So, go ahead and run:
pip install requests bs4
Step 2:
We will be scraping JayBeeBots. Now, we need to decide what data we actually need. We are interested in the channel name, description, and number of subscribers. Let’s code it up. Open your Code Editor and create a Python file.
1. Import requests and bs4
2. Choosing a URL to make requests
We will be making GET request to https://t.me/jaybeebots
using the requests package and checking the status code. The status code should be 2xx. HTTP status codes in the 2xx range indicate successful response.
3. Parsing HTML Content using BeautifulSoup (bs4)
We created bs as an alias of BeautifulSoup. We create a BeautifulSoup object and pass the response that we got using requests along with ‘html.parser’. We can now manipulate this object to get HTML contents (link contents inside a div, p, h1 tags) using the find() method like this:
Code Snippet:
Here is the GIST in case the snippet does work.