I have read the question Why do you need to encode URLs
but I still confused:
This seems like it should be simple, but I'm having trouble figuring it out. I want to send a
wget request similar to the following,
but I don't want it to URL-encode the quotes (or anything else, for that matter). Is this possible?
And just to make it clear, I am a) dealing with a server that doesn't URL-decode -- nothing I can do about that -- and b) aware that I can do this with a Python script, or Burp, or many other things. I just need to know if
wget can do it.
I am having an issue with BS4/Python 2.7.12 reading links and files that have been URL encoded already when I downloaded them using wget to archive my Drupal website.
For example, a link that exists on the live website would be:
https://mywebsite.org/content/prime's-and-"doubleprimes"-in-it (I know this is incorrect grammar because the 's example is possessive not plural)
The downloaded file would be:
(This is helpful in identifying different typography: http://www.w3schools.com/TAGS/ref_urlencode.asp)
My script loops through each file and flattens the site by adding ".html" to all links. However, in using BS4 to do this, it is actually changing the link path because it seems to try to re-interpret the already URL-encoded links. So as a result it would change the above link to:
And thus it wouldn't work. You can see the
%25 it is trying to use to encode the
% signs beginning
%E2, for example.
There have been many questions regarding encoding with BS4, but most of them specifically with regard to utf-8 with BS4. I understand that BS4 will automatically read the "soup" into utf-8, but I'm unsure why it is trying to re-URL encode links that are already encoded. I have tried
soup = BeautifulSoup(text.read().decode('utf-8','ignore')) as suggested here, which fixed an issue where BS4 was trying to interpret
%E2 as a unicode character, however I haven't seen anything for re-encoding of already-URL encoded characters. I have also tried adding
formatter="html" to my
soup.prettify command, but this did not work either, as the files had already been read and interpreted at that point.
Im using the Wikimedia Commons downloader tool to get some pictures from Wikimedia (https://pypi.python.org/pypi/CommonsDownloader). Working out just fine, except some of the pictures are called "pictureName.x-www-form-urlencoded". Any idea how i can turn them back to .png or .jpg files? Thanks!
If I encode a string like this:
var escapedString = originalString.stringByAddingPercentEscapesUsingEncoding(NSUTF8StringEncoding)
it doesn't escape the slashes
I've searched and found this Objective C code:
NSString *encodedString = (NSString *)CFURLCreateStringByAddingPercentEscapes( NULL, (CFStringRef)unencodedString, NULL, (CFStringRef)@"!*'();:@&=+$,/?%#", kCFStringEncodingUTF8 );
Is there an easier way to encode an URL and if not, how do I write this in Swift?