Full tutorial with Python code samples: https://www.jcchouinard.com/python-requests/
Support my work: https://www.buymeacoffee.com/jcchouinard
Subscribe to the Python course waitlist:
https://ba995fe7.sibforms.com/serve/MUIEAPv50TFkNuoknCNvzRUKLLhvvZd5jzCwEZvi9BNjtkhVuEtLpEG58-Khdquyg9v5V-1qeGDwIXy739I4eFIVcCqeqqV3UUisW_-hAd5ljC1dGMrQvXHC7JvORh9TbnLA1CHqWro4N7YVZ4730-D5dXGxqd3CbaVHSJpS5fyylPMVzCe1_y9xOTl2-CsvEuhO01E0Ytv59HEJ
Subscribe to this channel:
https://www.youtube.com/channel/UCthJJqyUGdI5EA0YjnjXthw?sub_confirmation=1
Follow me: https://twitter.com/ChouinardJC
https://www.linkedin.com/in/jeanchristophechouinard/
https://www.buymeacoffee.com/jcchouinard
https://www.youtube.com/channel/UCthJJqyUGdI5EA0YjnjXthw?sub_confirmation=1
The Python requests library is one of the most-used libraries to make HTTP requests using Python.
In this tutorial, you will learn how to:
Show More Show Less View Video Transcript
0:00
Hello, my name is Jacques Rastaf Schwinnard and today we're going to talk about Python requests
0:04
We're going to look at the Python request library and how we can use it to make get and post requests to a web server
0:11
So you can find all the information on this video at my blog using the Python request path
0:19
What we're going to have to do first is to install Python and the required packages
0:25
So you can look how to install Python in your machine here and you can also
0:29
install these requests in your terminal by using PIP. Then we're going to cover
0:36
what are the basic method that you can use with requests. So we're going to
0:42
make get requests and post requests. So the way the web works is by calling a
0:47
web server you make a get request to the web server in order to receive the
0:51
content from that page. And you can use post requests in order to add stuff to a
0:58
web server. For example, when you post a status on social media or whenever you fill a form. But in this situation, in
1:07
this case, what we're going to do is we're going to do these things by using Python instead
1:12
of the browser, for example. So let's make the first simple get request. The first step is to
1:19
import the request library. And then what we do is we initialize a response variable and we are
1:29
going to use the request library and the get method from the request library in order to add a
1:36
URL. So the URL we're going to define here which is crawler test that I'm using because it
1:44
was built to experiment on your crawler. So here what we want to do is we want to print the URL
1:52
We want to print the response status code and we want to print the
1:59
response it headers here you can see that when you run this you can see that you've successfully
2:10
fetched a page it returned 200 status code and you have the
2:15
HTTP headers here there are other things you can do like making a post request
2:21
so in order to make a post request what you do is again
2:25
we're going to use another website that was used to help you
2:29
on post requests and we're going to define what we want so for example let's say we want to fill in a form
2:36
with my own information in the form How I'm going to do that is I'm going to use the requests
2:44
library, but I'm going to use a post method instead this time and I'm going to fetch the URL
2:50
But this time I'm going to add the data parameter and add the payload to the parameter and then I'm going to look at the response
2:59
Jason coming from it. And now what you can see is that my last name, first name
3:07
and website were added to my post request. So this is how you use the post request
3:14
Now let's look at the method and attributes. So let's come back to get requests. And for the rest of this tutorial I'm going to
3:24
focus on get requests. So we import a request module and we define what we define what we
3:29
what you are we want to crawl. So in this case, I'm getting the same URL again
3:35
and I'm making the request get to the URL. This time we want to know, okay, what can I do with this response
3:44
So if I look at the response, I run this, I can see that, okay, it's 200, but what does it tell us
3:52
In order to investigate what is inside this, you can use the help method and look at the response
3:59
In that case, you get all the information about the methods that you can use, the attributes and stuff like this, like the descriptors and what they mean
4:12
So now we know that we're going to mostly focus on the one that I think are important for beginners
4:19
So we're going to look at some data descriptors and attributes, and we're going to look at the JSON method
4:24
So in order to access attributes, the way we do to access attribute is usually using the dot notation and the name of the attribute
4:36
So for example, when we want to look at the status code again, we can run this and it's a 200 status code
4:44
We can look at the text which will return the content of the page in a string format
4:51
format. Or we can look at the content which return the response in bytes format, which the B den notation tells you that it's in bytes
5:01
Or you can again look at the response headers. But we also can do other very cool stuff by doing this is we can look for out for red directions, for example
5:15
So if you build a crawler, you may want to follow the redirections and you may want to see from more
5:21
where it started to where it went. So let's look at that URL that is creating some redirections
5:29
and see what's happening. So we make the request. And then what we want to look is look at the response history
5:43
attribute. That gives you a list with each of the response. So what you want to do here is
5:51
here is create a for loop that would say for each redirect in redirect history in the
6:03
response history print the redirect URL and the redirect status code
6:13
And then you print the end URL on which you ended up
6:19
So then you get that that redirect chain that you see that we started from this one
6:24
We were a 301 redirect and then this one and then we ended up on the target which returned a 200 status code
6:31
The other things that you can do, you may want to disable this
6:36
So if you build a crawler and you don't want to follow redirects, for example, you can just use the allow redirect parameter and set it to false
6:46
In this case, what we're going to do is just fetch the first URL and don't bother with the rest
6:51
If you want to access to a method, you access this with the dot notation again, but this time you use the parameter
7:01
in the end to access it. What's happening here is that we're not fetching JSON
7:09
we're fetching HTML with the previous page. But whenever we call an API, usually we return something like the JSON
7:19
So if we fetch this and this time, since we're fetching an API
7:24
if we use the JSON method, we can actually show what's on the page
7:30
page. If you go to that URL into your browser, you're going to see this on the page
7:35
And that's how we fetched it with the request. So now we've learned the methods and attributes
7:44
We're going to start trying to process the response. So in order to process that response, what we're going to build is we're going to recrawl the URL
7:56
which is crawler test, and we're going to get the URL. and we're gonna get that URL
8:02
So if we look at the art text, so we can see that we have the content of the page here
8:08
But this is a string, so you want, you cannot do anything with that string
8:13
So what you wanna do is you want to import Beautiful Soup from BS4
8:20
And Beautiful Soup, what it does is it actually parses the page
8:25
And parsing is, it will try to understand each of the HTML
8:29
tags and where they sit. And in order to parse the response from that page, we're going to create that soup object and we're going to call the beautiful soup and put in the response text that we had. And we're going to define this as a HTML parser to be able to parse that response. And then when we look at that soup, instead of getting
8:59
I made an error, joy of live coding. So when you look at that soup
9:09
you can see each of these tags now is understood by the, by, it's well understood with each of these tag
9:19
And now you can start to do things like, like getting the soup title, for example
9:27
So what you can do is you can use the USUF Find method and then you try to get the title from the page
9:38
And by doing this you can see, okay, cool, I get the title
9:43
So we can see that the soup object can help you understand the content of the page
9:49
You can also use the find all method to look at the meta tags
9:55
and then you can see you have a lot of meta tags. And if you want to go deeper in this
10:00
and just select, for example, the meta description here, you can select some of the attributes from this
10:09
by using this attributes description and you select the name is equal to description
10:21
So you use a dictionary to say which attributes you want to do
10:24
And in this case, you're just selecting the description and not all of them
10:29
And you can potentially just select the first method description if that's what you want
10:36
And there it is. So now that we know this, now that we know this, what we're going to do is
10:49
try to go a little bit deeper by finding all the SEO tags on the page
10:58
So we're going to start again and put that kind of startup and parse the page
11:04
And then what we're going to do is we're going to set up the subtitles that we've just discussed
11:10
We're going to try to get the title. We're going to try to get the H1
11:16
We're going to get the description, the meta description. what you see on Google when you search for a web page we're gonna get the meta robots
11:26
which define if a page is indexable and the canonical tag which defines what is the important page
11:34
what is the the good version of a page so what you can do is for example you can see find title
11:42
you can soup find the H1 and you can select some of the meta attributes that I would discuss like the name description that we just discuss or the name instead of having the name description we can use the name robots and we can use and for example we can see find the link which has rel canonical
12:12
So cool so now we have all of this Let look at what the description look like
12:23
So we have a list, so we have our meta description, and what we want to select is not the actual entire tag, we want to select the text from the description here
12:36
So what we want to do is we want to use the, we want to use the, we want to
12:42
use the description content. So then we can easily select this and this is
12:49
possible because of beautiful suit. We may want to select something else like
12:55
the title and using the get text method because the title doesn't have the same
13:01
HTML tag. If we look at the title it's in there. So what we want to use is
13:07
instead we want to use the get text method in order to get that that text with
13:12
in it compared to the content which we wanted to have the attribute from the page
13:18
But what happened is if we want to have a canonical H-Rift and it doesn't exist
13:27
So what will happen here is we're going to have a non-type saying that it's not scriptable
13:32
So what we want to do here is instead of just calling the text, instead of just calling it
13:41
we want to put a if else. So if the title exists, else do something and remove it
13:49
If the description exists, get the content, else do nothing. And then what we want to do is repeat the process for the Canon Call and the description
14:04
And we want to repeat the process for the Meta Robot tags and for the H1
14:11
that we just defined. So we have all of these. We want to set up these into variables
14:17
So bear with me here, I'm setting up my variables. Description equals, canonical
14:27
and meta robot sequels. So in this case, whenever I'm trying to print all my tags here
14:37
what will happen is instead of throwing an error, it would just return the tags with an empty string. So this is the kind of things that you want to do. So building a crawler is not simple. You have to think of all these use cases where it can break
14:54
Now let's try to get all the links on the page. So we are still going to import the beautiful and request package, but this time we're also going to import a new
15:04
you. No, not right now. Let's start and make the request to the URL and put the URL and
15:11
parse again the page like we just did. What we want to do is do something like to find all
15:21
the links on the page. What we're going to use is the final method that we just discussed
15:27
about. And we're going to try to get the A tags. What will happen here is we're probably
15:37
going to get a lot of A tags that are not related to
15:41
links but images so if you just want to have the actual links and not the
15:48
images links then you set up the H-RF-C goal to true and then what you can do
15:54
is you want to extract the link here from the H-Rift tag so what we're gonna
16:02
have to do is we're gonna have to loop through each of the the tags so for
16:06
link in soup final and we're going to have to do is
16:11
going to initialize a links list that we're going to append the links to
16:18
So what we're going to do is we're going to say, okay, print the link H-RF
16:28
So here what we're going to do is this, let's run this and say, okay, print the link H-RF
16:37
Okay, now I have those links, but these are relative URAF. They don't contain the domain name
16:44
So what we want to do here is we want to import your lib parse and import URL join
16:53
And URL join what it will do is if the HREF is not in there, if the domain URL is not in the link, it will append it to it
17:08
If it isn't there, it will not. So it will provide you to, it will help you combining these links
17:17
And then instead of printing, we're going to append it to the ULs
17:26
And then what we're going to do, we're going to print the first 10 links
17:32
So now that we run this, you see, they're not relative URL, but absolute U.S
17:36
now. So this is what you want to do because you want to keep crawling the web so this is very useful
17:44
So we've covered a few of the things that we can do using the request library. Now what we want to do is to make sure that we handle errors
17:55
So for example, if I get a URL that is badly formatted, so this doesn't exist, we cannot fetch that page. So if we make a request to that
18:06
page we're gonna get an error thrown off to us saying okay that URL doesn't exist
18:13
so what you can sometimes you may want to keep crawling the web but you don't want
18:18
your crawler to break just because of one URL that is badly formatted so what you gonna do is you gonna wrap this in a try and accept and it always good practice So as you start building using this library
18:33
you will always use that try and accept exception as a try and accept solution
18:43
in order to prevent those error and then you are saying, okay, print the error
18:48
So in this case, what you're trying to do is, Okay, but you can keep doing things
18:53
So if you just want to say, print this, in this case
18:59
you realize that your code can keep running, even if you hit something that break it
19:05
So this is a very good practice to have. Then the next thing is you may want to change your user agent
19:12
So you can look at what is your user agent and find out what it is
19:17
This is mine. So what you want to do is if you want
19:21
to add a user agent to your response because some application will require user agent to be available
19:28
then you set up your headers and you go in your headers and you use a user agent
19:39
and you define it as your user agent and then when you make your requests
19:48
URL, then you say you define the headers as headers. And that's it
19:56
So you define your response and you can show the JSON from the response and you're going
20:04
to see that it works and whenever Reddit look at their logs they're going to see that
20:10
I came and tried to scrape their page. So this is a good practice to all
20:18
always define who is your user agent whenever you use a web crawler. You may want to add a
20:26
timeout. So if a page takes 10 hours to load, you may want to skip that page. And the
20:33
way you're going to skip that page is again, let's follow the good practice and put the
20:38
exception in there and print the exception
20:48
if there is an error and now what we're gonna say is we're gonna try to get that
20:53
request make that request but this time what we're gonna do is we're gonna get
20:58
the URL and we're gonna add a timeout on in case the page takes too long to
21:04
load so whenever we make that request we can say that okay cool there was an error
21:10
and the error was because I put a connect time out error into there so this
21:15
is a good this can be useful to build a cross- that is efficient and doesn't get stopped in the loop because a page doesn't want to load
21:26
Now next you may want to use proxies. So use a different IP than yours to crawl a website
21:34
This can be useful for black hat people or it can be useful to to scrape a website that doesn't want you to scrape it
21:42
So what you can do is actually you set up the proxy HTTP to
21:48
the IP that you want it to be and then you make that request
21:54
get URL and you define the proxy as proxies well, alright really well here, proxies as proxies
22:05
and then you can make that request. Okay, we've kind of covered this
22:13
adding headers to the request, but this can be very useful whenever you fetch an
22:18
that requires you to have access token that is authorized. So here we can set up our access token, for example, having a bearer token
22:29
And if you don't understand, don't worry, I'm just saying if you want to add a header to your request
22:36
that's how you do, and you use the header access token and you use the headers parameter
22:44
and you can add any headers that you want. see here that there was authorization key that was added to my headers
22:54
Where it's interesting is sometimes you will try to fetch that API a lot of time
23:03
and you don't necessarily want to add the headers each time. So in this case you can use the session object
23:11
So instead of making a request you can make a session so you make requests and then you need a
23:16
and then you initialize a session object. And that session object can then be used to crawl, to add
23:25
So you add only your access token once to your session. So what you're doing is you're going to the session
23:33
and you define the headers and you say update the header by adding the access token
23:42
And now what's interesting is, here is you can make as many requests as you want by using instead of using the
23:53
requests here you use a session. get method and then you can essentially if you
24:00
look at your response we can essentially fetch that response and you're going to see
24:05
that the headers was added to any kind of requests we made using that session so this
24:11
can be useful to fetch an API a lot of time without requesting more
24:16
So if you like that video, subscribe to my channel and share a lot and just go to my blog
24:22
subscribe to my newsletter, and just subscribe to everything I do. You're going to be very satisfied by that
24:28
So enjoy your day or night or whatever you are and have a good one
#Programming
#Programming
#Scripting Languages
#Web Stats & Analytics

