Crawling after login in Python

Question

I am studying crawling using Python.

My goal is to download the file.

I am studying login now and it is very difficult.

For example, I need to log in to download files from this site.

I looked up various information.

Login to website using python

But the site I want seems a bit different.

I was able to crawl most sites that do not require login.

However, I can not crawl sites that require login.

So I really want to study that part.

My goal is to log in and then view the code in html for crawling.

Below is my code. Is this the right thing to do?

from requests import session

# ex) ID = abcd  / PW = 1234


from requests import session

# ex) ID = abcd  / PW = 1234

payload = {
    'ctl00$ContentPlaceHolder1$tbxLoginID': 'abcd',
    'ctl00$ContentPlaceHolder1$tbxLoginPW': '1234'
}

with session() as c:
    c.post('LOGIN_URL_HERE', data=payload)
    response = c.get('PROTECTED_PAGE_URL_HERE')
    print(response.headers)
    print(response.text)

Priyaj · Answer 1 · Sep 14, 2018

You missed a few login data forms, here is how the payload should look like

payload = { 
    '__LASTFOCUS': '',#empty
    '__VIEWSTATE': 'get this value from the login page source',
    '__VIEWSTATEGENERATOR': 'get this value from the login page source',
    '__EVENTTARGET': '',#empty
    '__EVENTARGUMENT': '',#empty
    '__EVENTVALIDATION': 'get this value from the login page source',
    'ctl00$agentPlatform': '1',
    'ctl00$menu_nav1$tbxSearchWord': '',#empty
    'ctl00$ContentPlaceHolder1$radiobutton':    '0',
    'ctl00$ContentPlaceHolder1$tbxLoginID': 'abcd',
    'ctl00$ContentPlaceHolder1$tbxLoginPW': '1234',
    'ctl00$ContentPlaceHolder1$ibtnLogin.x': '36', #i think this is the mouse cursor position
    #when clicked on login, not sure if its necessary
    'ctl00$ContentPlaceHolder1$ibtnLogin.y': '25'
}