Skip to content Skip to sidebar Skip to footer

What Is The Best Way To Get The Html For Html Agiligy Pack To Process?

I can't seem to get the HTML from a few sites, but can from many others. Here are 2 sites I am having issues with: https://www.rei.com https://www.homedepot.com I am building an

Solution 1:

This is your helper class, refactored to support most the web responses that a HttpWebResponse can handle.

A note: never do this kind of setups if you don't have Option Explicit and Option Strict set to True: you'll never get it right. Automatic inference is not your friend here (well, actually never is; you really need to know what objects you're dealing with).

What has been modified and what is important handle:

  • TLS handling: extend support to only TLS 1.1 and TLS 1.2

  • WebRequest.ServicePoint.Expect100Continue = False: you never want this kind of response, unless you're ready to comply. But it's never necessary.

  • [AutomaticDecompression][1] is required, unless you want to handle the GZip or Deflate streams manually. It's almost never required (only if you want to analyze the original stream before decompressing it).

  • The CookieContainer is rebuilt every time. This has not been modified, but you could store a static object and reuse the Cookies with each request: some sites may set the cookies when the Tls handshake is performed and redirect to a login page. A WebRequest can be used to POST authentication parameters (except captchas), but you need to preserve the Cookies, otherwise any further request won't be authenticated.

  • The Response Stream ReadToEnd() method is also as left as is, but you should modify it to read a buffer. It would allow to show the download progress, for example, and also to cancel the operation, if required.

  • Important: the UserAgent cannot be set to a recent version of any existing Browser. Some web sites, when detect that a User Agent supports the HSTS protocol, will activate it and wait for interaction. WebRequest knows nothing about HSTS and will timeout. I set the UserAgent to Internet Explorer 11. It works fine with all sites.

  • Http Redirection is set to automatic, but sometimes it's necessary to follow it manually. This could improve the reliability of this procedure. You could, for example, forbid redirections to out-of-scope destinations. Or a HTTP protocol change that you don't support.

A suggestion: this class would benefit from the async version of the HttpWebRequest methods: you'd be able to issue a number of concurrent requests instead of waiting each and all of them to complete synchronously. Only a few modifications are required to turn this class into an async version.

This class should now support most Html pages that don't use Scripts to build the content asynchronously. As already described in comments, a Lazy HttpClient can handle some (not all) of these pages, but it requires a completely different setup.


Imports System
Imports System.IO
Imports System.Net
Imports System.Net.Security
Imports System.Security.Cryptography.X509Certificates
Imports System.TextPublicClass WebRequestHelper
    Private m_ResponseUri As Uri
    Private m_StatusCode As HttpStatusCode
    Private m_StatusDescription AsStringPrivate m_ContentSize AsLongPrivate m_WebException As WebExceptionStatus
    PublicProperty SiteCookies As CookieContainer
    PublicProperty UserAgent AsString = "Mozilla / 5.0(Windows NT 6.1; WOW32; Trident / 7.0; rv: 11.0) like Gecko"PublicProperty Timeout AsInteger = 30000PublicReadOnlyProperty ContentSize AsLongGetReturn m_ContentSize
        EndGetEndPropertyPublicReadOnlyProperty ResponseUri As Uri
        GetReturn m_ResponseUri
        EndGetEndPropertyPublicReadOnlyProperty StatusCode AsIntegerGetReturn m_StatusCode
        EndGetEndPropertyPublicReadOnlyProperty StatusDescription AsStringGetReturn m_StatusDescription
        EndGetEndPropertyPublicReadOnlyProperty WebException AsIntegerGetReturn m_WebException
        EndGetEndPropertySubNew()
        SiteCookies = New CookieContainer()
    EndSubPublicFunction GetSiteResponse(ByVal siteUri As Uri) AsStringDim response AsString = String.Empty

        ServicePointManager.DefaultConnectionLimit = 50
        ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls11 Or SecurityProtocolType.Tls12
        ServicePointManager.ServerCertificateValidationCallback = AddressOf TlsValidationCallback

        Dim Http As HttpWebRequest = WebRequest.CreateHttp(siteUri.ToString)
        With Http
            .Accept = "ext/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
            .AllowAutoRedirect = True
            .AutomaticDecompression = DecompressionMethods.GZip Or DecompressionMethods.Deflate
            .CookieContainer = Me.SiteCookies
            .Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip, deflate")
            .Headers.Add(HttpRequestHeader.AcceptLanguage, "en-US,en;q=0.7")
            .Headers.Add(HttpRequestHeader.CacheControl, "no-cache")
            ' Default
            .KeepAlive = True
            .MaximumAutomaticRedirections = 50
            .ServicePoint.Expect100Continue = False
            .ServicePoint.MaxIdleTime = Me.Timeout
            .Timeout = Me.Timeout
            .UserAgent = Me.UserAgent
        EndWithTryUsing webResponse As HttpWebResponse = DirectCast(Http.GetResponse, HttpWebResponse)
                Me.m_ResponseUri = webResponse.ResponseUri
                Me.m_StatusCode = webResponse.StatusCode
                Me.m_StatusDescription = webResponse.StatusDescription
                Dim contentLength AsString = webResponse.Headers.Get("Content-Length")
                Me.m_ContentSize = If(String.IsNullOrEmpty(contentLength), 0, Convert.ToInt64(contentLength))

                Using responseStream As Stream = webResponse.GetResponseStream()
                    If webResponse.StatusCode = HttpStatusCode.OK ThenDim reader As StreamReader = New StreamReader(responseStream, Encoding.Default)
                        Me.m_ContentSize = webResponse.ContentLength
                        response = reader.ReadToEnd()
                        Me.m_ContentSize = If(Me.m_ContentSize = -1, response.Length, Me.m_ContentSize)
                    EndIfEndUsingEndUsingCatch exW As WebException
            If exW.Response IsNotNothingThenMe.m_StatusCode = CType(exW.Response, HttpWebResponse).StatusCode
            EndIfMe.m_StatusDescription = "WebException: " & exW.Message
            Me.m_WebException = exW.Status
        EndTryReturn response
    EndFunctionPrivateFunction TlsValidationCallback(sender AsObject, CACert As X509Certificate, CAChain As X509Chain, SslPolicyErrors As SslPolicyErrors) AsBoolean' If you trust the (known) Server, you could just return TrueIf SslPolicyErrors = SslPolicyErrors.None ThenReturnTrueDim Certificate AsNew X509Certificate2(CACert)

        CAChain.Build(Certificate)
        ForEach CACStatus As X509ChainStatus In CAChain.ChainStatus
            If (CACStatus.Status <> X509ChainStatusFlags.NoError) And
                (CACStatus.Status <> X509ChainStatusFlags.UntrustedRoot) ThenReturnFalseEndIfNextReturnTrueEndFunctionEndClass

Post a Comment for "What Is The Best Way To Get The Html For Html Agiligy Pack To Process?"