Get the filename from a path (URL) sorted without repeated result

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • wnvhbop
    New Member
    • Jul 2020
    • 1

    Get the filename from a path (URL) sorted without repeated result

    I have a large text file which has a lot of links and I need python script to extract all the names of the files which end with .pdf format and (sorted without repeated result) ?

    sample example from the file:

    http://www.123.com/file.pdf http://www.123.com/pdfhello
    http://www.456.com/hello/one.file.pdf http://www.123.com http://www.123.com



    I need the final result to look like this:

    file.pdf
    one.file.pdf
  • dev7060
    Recognized Expert Contributor
    • Mar 2017
    • 656

    #2
    What have you done so far? Many ways I can think of.

    - Store the part of the link after the final '/' in a string and check if the extension is .pdf.
    - Store the whole link in a string, start reading from the back and check if that reads fdp.
    - Search for ".pdf" in the entire file. If there's an occurrence, keep copying the chars backward to a string until '/' if found (since filenames can't use slash).
    etc...

    sorted without repeated result
    Keep storing the required names in a string array and when it's all done apply the logic of duplicate elements deletion (or you can check if the value is already present in the array before insertion) and then sorting can be done.

    Comment

    • SioSio
      Contributor
      • Dec 2019
      • 272

      #3
      The only annoyance with this process is that there are multiple URLs on one line.
      After reading the entire text, replace the line feed code with a space and split into spaces and store in the array.
      After that, write the code as shown by dev7060.
      Code:
      f = open(r'url.txt','r')
      line = f.read()
      f.close
      url_list = line.replace('\n',' ').split(' ')
      url_list.pop(-1)
      There are three ways to extract the file name from the URL.
      Code:
          list = url.split('/')
          fname = list[len(list)-1]
      Code:
          fname = url.rsplit('/',1)[1]
      Code:
          fname = url[url.rfind('/')+1:]
      Sort(sorted or sort) and duplicate elements deletion(set) use this functions.
      Code:
      new_file_list = sorted(set(file_list))

      Comment

      Working...