What Is The Best Way To Get The Html For Html Agiligy Pack To Process?
Solution 1:
This is your helper class, refactored to support most the web responses that a HttpWebResponse can handle.
A note: never do this kind of setups if you don't have Option Explicit
and Option Strict
set to True
: you'll never get it right. Automatic inference is not your friend here (well, actually never is; you really need to know what objects you're dealing with).
What has been modified and what is important handle:
TLS handling: extend support to only TLS 1.1 and TLS 1.2
WebRequest.ServicePoint.Expect100Continue = False
: you never want this kind of response, unless you're ready to comply. But it's never necessary.[AutomaticDecompression][1]
is required, unless you want to handle the GZip or Deflate streams manually. It's almost never required (only if you want to analyze the original stream before decompressing it).The
CookieContainer
is rebuilt every time. This has not been modified, but you could store a static object and reuse the Cookies with each request: some sites may set the cookies when the Tls handshake is performed and redirect to a login page. A WebRequest can be used to POST authentication parameters (except captchas), but you need to preserve the Cookies, otherwise any further request won't be authenticated.The Response Stream
ReadToEnd()
method is also as left as is, but you should modify it to read a buffer. It would allow to show the download progress, for example, and also to cancel the operation, if required.Important: the UserAgent cannot be set to a recent version of any existing Browser. Some web sites, when detect that a User Agent supports the HSTS protocol, will activate it and wait for interaction. WebRequest knows nothing about
HSTS
and will timeout. I set the UserAgent to Internet Explorer 11. It works fine with all sites.Http Redirection is set to automatic, but sometimes it's necessary to follow it manually. This could improve the reliability of this procedure. You could, for example, forbid redirections to out-of-scope destinations. Or a HTTP protocol change that you don't support.
A suggestion: this class would benefit from the async
version of the HttpWebRequest methods: you'd be able to issue a number of concurrent requests instead of waiting each and all of them to complete synchronously.
Only a few modifications are required to turn this class into an async
version.
This class should now support most Html pages that don't use Scripts to build the content asynchronously. As already described in comments, a Lazy HttpClient can handle some (not all) of these pages, but it requires a completely different setup.
Imports System
Imports System.IO
Imports System.Net
Imports System.Net.Security
Imports System.Security.Cryptography.X509Certificates
Imports System.TextPublicClass WebRequestHelper
Private m_ResponseUri As Uri
Private m_StatusCode As HttpStatusCode
Private m_StatusDescription AsStringPrivate m_ContentSize AsLongPrivate m_WebException As WebExceptionStatus
PublicProperty SiteCookies As CookieContainer
PublicProperty UserAgent AsString = "Mozilla / 5.0(Windows NT 6.1; WOW32; Trident / 7.0; rv: 11.0) like Gecko"PublicProperty Timeout AsInteger = 30000PublicReadOnlyProperty ContentSize AsLongGetReturn m_ContentSize
EndGetEndPropertyPublicReadOnlyProperty ResponseUri As Uri
GetReturn m_ResponseUri
EndGetEndPropertyPublicReadOnlyProperty StatusCode AsIntegerGetReturn m_StatusCode
EndGetEndPropertyPublicReadOnlyProperty StatusDescription AsStringGetReturn m_StatusDescription
EndGetEndPropertyPublicReadOnlyProperty WebException AsIntegerGetReturn m_WebException
EndGetEndPropertySubNew()
SiteCookies = New CookieContainer()
EndSubPublicFunction GetSiteResponse(ByVal siteUri As Uri) AsStringDim response AsString = String.Empty
ServicePointManager.DefaultConnectionLimit = 50
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls11 Or SecurityProtocolType.Tls12
ServicePointManager.ServerCertificateValidationCallback = AddressOf TlsValidationCallback
Dim Http As HttpWebRequest = WebRequest.CreateHttp(siteUri.ToString)
With Http
.Accept = "ext/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
.AllowAutoRedirect = True
.AutomaticDecompression = DecompressionMethods.GZip Or DecompressionMethods.Deflate
.CookieContainer = Me.SiteCookies
.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip, deflate")
.Headers.Add(HttpRequestHeader.AcceptLanguage, "en-US,en;q=0.7")
.Headers.Add(HttpRequestHeader.CacheControl, "no-cache")
' Default
.KeepAlive = True
.MaximumAutomaticRedirections = 50
.ServicePoint.Expect100Continue = False
.ServicePoint.MaxIdleTime = Me.Timeout
.Timeout = Me.Timeout
.UserAgent = Me.UserAgent
EndWithTryUsing webResponse As HttpWebResponse = DirectCast(Http.GetResponse, HttpWebResponse)
Me.m_ResponseUri = webResponse.ResponseUri
Me.m_StatusCode = webResponse.StatusCode
Me.m_StatusDescription = webResponse.StatusDescription
Dim contentLength AsString = webResponse.Headers.Get("Content-Length")
Me.m_ContentSize = If(String.IsNullOrEmpty(contentLength), 0, Convert.ToInt64(contentLength))
Using responseStream As Stream = webResponse.GetResponseStream()
If webResponse.StatusCode = HttpStatusCode.OK ThenDim reader As StreamReader = New StreamReader(responseStream, Encoding.Default)
Me.m_ContentSize = webResponse.ContentLength
response = reader.ReadToEnd()
Me.m_ContentSize = If(Me.m_ContentSize = -1, response.Length, Me.m_ContentSize)
EndIfEndUsingEndUsingCatch exW As WebException
If exW.Response IsNotNothingThenMe.m_StatusCode = CType(exW.Response, HttpWebResponse).StatusCode
EndIfMe.m_StatusDescription = "WebException: " & exW.Message
Me.m_WebException = exW.Status
EndTryReturn response
EndFunctionPrivateFunction TlsValidationCallback(sender AsObject, CACert As X509Certificate, CAChain As X509Chain, SslPolicyErrors As SslPolicyErrors) AsBoolean' If you trust the (known) Server, you could just return TrueIf SslPolicyErrors = SslPolicyErrors.None ThenReturnTrueDim Certificate AsNew X509Certificate2(CACert)
CAChain.Build(Certificate)
ForEach CACStatus As X509ChainStatus In CAChain.ChainStatus
If (CACStatus.Status <> X509ChainStatusFlags.NoError) And
(CACStatus.Status <> X509ChainStatusFlags.UntrustedRoot) ThenReturnFalseEndIfNextReturnTrueEndFunctionEndClass
Post a Comment for "What Is The Best Way To Get The Html For Html Agiligy Pack To Process?"