get headers - PHP - `get_headers` returns "400 Bad Request" and "403 Forbidden" for valid URLs? -


working solution @ bottom of description!

i running php 5.4, , trying headers of list of urls.

for part, working fine, there 3 urls causing issues (and more, more extensive testing).

'http://www.alealimay.com' 'http://www.thelovelist.net' 'http://www.bleedingcool.com' 

all 3 sites work fine in browser, , produce following header responses:

(from safari)

successful headers

note 3 header responses code = 200

but retrieving headers via php, using get_headers...

stream_context_set_default(array('http' => array('method' => "head"))); $headers = get_headers($url, 1); stream_context_set_default(array('http' => array('method' => "get"))); 

... returns following:

url  ......  "http://www.alealimay.com"  headers |    0  ............................  "http/1.0 400 bad request" |    content-length  ...............  "378" |    x-synthetic  ..................  "true" |    expires  ......................  "thu, 01 jan 1970 00:00:00 utc" |    pragma  .......................  "no-cache" |    cache-control  ................  "no-cache, must-revalidate" |    content-type  .................  "text/html; charset=utf-8" |    connection  ...................  "close" |    date  .........................  "wed, 24 aug 2016 01:26:21 utc" |    x-contextid  ..................  "qifb0i8v/xstfmreg" |    x-via  ........................  "1.0 echo109"    url  ......  "http://www.thelovelist.net"  headers |    0  ............................  "http/1.0 400 bad request" |    content-length  ...............  "378" |    x-synthetic  ..................  "true" |    expires  ......................  "thu, 01 jan 1970 00:00:00 utc" |    pragma  .......................  "no-cache" |    cache-control  ................  "no-cache, must-revalidate" |    content-type  .................  "text/html; charset=utf-8" |    connection  ...................  "close" |    date  .........................  "wed, 24 aug 2016 01:26:22 utc" |    x-contextid  ..................  "ankvf2rb/bimjwyjw" |    x-via  ........................  "1.0 echo103"    url  ......  "http://www.bleedingcool.com"  headers |    0  ............................  "http/1.1 403 forbidden" |    server  .......................  "sucuri/cloudproxy" |    date  .........................  "wed, 24 aug 2016 01:26:22 gmt" |    content-type  .................  "text/html" |    content-length  ...............  "5311" |    connection  ...................  "close" |    vary  .........................  "accept-encoding" |    etag  .........................  "\"57b7f28e-14bf\"" |    x-xss-protection  .............  "1; mode=block" |    x-frame-options  ..............  "sameorigin" |    x-content-type-options  .......  "nosniff" |    x-sucuri-id  ..................  "11005" 

this case regardless of changing stream_context

//stream_context_set_default(array('http' => array('method' => "head"))); $headers = get_headers($url, 1); //stream_context_set_default(array('http' => array('method' => "get"))); 

produces same result.

no warnings or errors thrown of these (normally have errors suppressed @get_headers, there no difference either way).

i have checked php.ini, , have allow_url_fopen set on.

i headed towards stream_get_meta_data, and not interested in curl solutions. stream_get_meta_data (and accompanying fopen) fail in same spot get_headers, fixing 1 fix both in case.

usually, if there redirects, output looks like:

url  ......  "http://www.startingurl.com/"  headers |    0  ............................  "http/1.1 301 moved permanently" |    1  ............................  "http/1.1 200 ok" |    date |    |    "wed, 24 aug 2016 02:02:29 gmt" |    |    "wed, 24 aug 2016 02:02:32 gmt" |     |    server |    |    "apache" |    |    "apache" |     |    location  .....................  "http://finishingurl.com/" |    connection |    |    "close" |    |    "close" |     |    content-type |    |    "text/html; charset=utf-8" |    |    "text/html; charset=utf-8" |     |    link  .........................  "; rel=\"https://api.w.org/\", ; rel=shortlink" 

how come sites work in browsers, fail when using get_headers?

there various posts discussing same thing, solution of them doesn't pertain case:

post requires content-length (i'm sending head request, no content returned)

url contains utf-8 data (the chars in these urls latin alphabet)

cannot send url spaces in it (these urls space-free, , ordinary in every way)

solution!

(thanks max in answers below pointing me on right track.)

the issue because there no pre-defined user_agent, without either setting on in php.ini, or declaring in code.

so, change user_agent mimic browser, deed, , revert stating value (likely blank).

$originaluseragent = ini_get('user_agent'); ini_set('user_agent', 'mozilla/5.0');  $headers = @get_headers($url, 1);  ini_set('user_agent', $originaluseragent); 

user agent change found here.

it happens because 3 these sites checking useragent header of request , response error in case if not matched. get_headers function not send header. may try curl , code snippet getting content of sites:

$url = 'http://www.alealimay.com'; $c = curl_init($url); curl_setopt($c, curlopt_useragent, 'curl/7.48.0'); curl_exec($c); var_dump(curl_getinfo($c)); 

upd: it's not necessary use curl setting user agent header. can done ini_set('user_agent', 'mozilla/5.0'); , get_headers function use configured value.


Comments

Popular posts from this blog

mysql - Dreamhost PyCharm Django Python 3 Launching a Site -

java - Sending SMS with SMSLib and Web Services -

java - How to resolve The method toString() in the type Object is not applicable for the arguments (InputStream) -