is_ascii() or is_binary() for files?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Brad

    is_ascii() or is_binary() for files?

    Is there a way to determine whether a file is plain ascii text or not
    using standard C++?
  • osmium

    #2
    Re: is_ascii() or is_binary() for files?

    "Brad" wrote:
    Is there a way to determine whether a file is plain ascii text or not
    using standard C++?
    No. It's in the eye of the beholder. You can make a very good guess by
    looking by counting control characters that wouldn't likely be in text. But
    the possibility exists that a binary file might not have any of them either.


    Comment

    • Sherman Pendley

      #3
      Re: is_ascii() or is_binary() for files?

      Brad <brad@16systems .comwrites:
      Is there a way to determine whether a file is plain ascii text or not
      using standard C++?
      Sure, just read its contents and look for any byte that's 127. If
      you find one, the file's contents are not plain ASCII.

      sherm--

      --
      My blog: http://shermspace.blogspot.com
      Cocoa programming in Perl: http://camelbones.sourceforge.net

      Comment

      • Medvedev

        #4
        Re: is_ascii() or is_binary() for files?

        On Jul 5, 11:22 am, Sherman Pendley <spamt...@dot-app.orgwrote:
        Brad <b...@16systems .comwrites:
        Is there a way to determine whether a file is plain ascii text or not
        using standard C++?
        >
        Sure, just read its contents and look for any byte that's 127. If
        you find one, the file's contents are not plain ASCII.
        if he try to test in a text file which contain non-English text , he
        will fail!!
        because non-English char are 127

        Comment

        • red floyd

          #5
          Re: is_ascii() or is_binary() for files?

          Medvedev wrote:
          On Jul 5, 11:22 am, Sherman Pendley <spamt...@dot-app.orgwrote:
          >Brad <b...@16systems .comwrites:
          >>Is there a way to determine whether a file is plain ascii text or not
          >>using standard C++?
          >Sure, just read its contents and look for any byte that's 127. If
          >you find one, the file's contents are not plain ASCII.
          >
          if he try to test in a text file which contain non-English text , he
          will fail!!
          because non-English char are 127
          OP specified ASCII, not non-English text.

          Comment

          • Medvedev

            #6
            Re: is_ascii() or is_binary() for files?

            On Jul 5, 11:45 am, Medvedev <3D.v.Wo...@gma il.comwrote:
            On Jul 5, 11:22 am, Sherman Pendley <spamt...@dot-app.orgwrote:
            >
            Brad <b...@16systems .comwrites:
            Is there a way to determine whether a file is plain ascii text or not
            using standard C++?
            >
            Sure, just read its contents and look for any byte that's 127. If
            you find one, the file's contents are not plain ASCII.
            >
            if he try to test in a text file which contain non-English text , he
            will fail!!
            because non-English char are 127
            sorry man , u r right
            i found non-English represented by negative sign
            and binary is the file which it's byte MAY BE 127
            as it can hold 256-bit pattern

            source:

            Comment

            • Sherman Pendley

              #7
              Re: is_ascii() or is_binary() for files?

              Medvedev <3D.v.World@gma il.comwrites:
              On Jul 5, 11:22 am, Sherman Pendley <spamt...@dot-app.orgwrote:
              >Brad <b...@16systems .comwrites:
              Is there a way to determine whether a file is plain ascii text or not
              using standard C++?
              >>
              >Sure, just read its contents and look for any byte that's 127. If
              >you find one, the file's contents are not plain ASCII.
              >
              if he try to test in a text file which contain non-English text , he
              will fail!!
              Exactly as it should.
              because non-English char are 127
              In other words, they're not plain ASCII. :-)

              sherm--

              --
              My blog: http://shermspace.blogspot.com
              Cocoa programming in Perl: http://camelbones.sourceforge.net

              Comment

              • James Kanze

                #8
                Re: is_ascii() or is_binary() for files?

                On Jul 5, 9:45 pm, Medvedev <3D.v.Wo...@gma il.comwrote:
                On Jul 5, 11:22 am, Sherman Pendley <spamt...@dot-app.orgwrote:
                Brad <b...@16systems .comwrites:
                Is there a way to determine whether a file is plain ascii text or not
                using standard C++?
                Sure, just read its contents and look for any byte that's 127. If
                you find one, the file's contents are not plain ASCII.
                if he try to test in a text file which contain non-English
                text , he will fail!! because non-English char are 127
                ASCII is a seven bit code, so no characters are greater than
                127 in it.

                Of course, just because you don't find any characters greater
                than 127 doesn't mean that it is ASCII. It could still be ISO
                8859-1, or UTF-8, in which, by chance, none of the characters
                happen to be greater than 127. (Or it could be that plain char
                is signed on your machine, in which case, it can't contain a
                value greater that 127, regardless of the encoding:-).)

                --
                James Kanze (GABI Software) email:james.kan ze@gmail.com
                Conseils en informatique orientée objet/
                Beratung in objektorientier ter Datenverarbeitu ng
                9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

                Comment

                • Brad

                  #9
                  Re: is_ascii() or is_binary() for files?

                  Stefan Ram wrote:
                  Brad <brad@16systems .comwrites:
                  >Is there a way to determine whether a file is plain ascii text
                  >or not using standard C++?
                  >
                  If someone can define in words when a file is deemed to be a
                  »a plain ascii text« without ambiguity and for each possible
                  file, I am sure that then this newsgroup will be able to
                  help to implement a test for it in C++.
                  ...
                  Thanks for all the responses. The program recurses through a directory
                  processing files. I do not know beforehand what type of files the
                  program may encounter. The processing is simply reading the file and
                  passing its content to a regular expression to search for certain strings.

                  Binary files cause problems, so I thought if I could just skip them and
                  only read ASCII and perhaps UTF-8 encoded files, things would be better.
                  That lead to my initial question. Later I could learn how to deal with
                  binary files that I may want to search like PDF and MS Office documents.
                  Just curious if standard C++ had some built-in function that made this easy.

                  Thanks again,

                  Brad

                  Comment

                  • Sam

                    #10
                    Re: is_ascii() or is_binary() for files?

                    Brad writes:
                    That lead to my initial question. Later I could learn how to deal with
                    binary files that I may want to search like PDF and MS Office documents.
                    Just curious if standard C++ had some built-in function that made this easy.
                    No. The only 'built-in' function of any kind is one to test if a single
                    character belongs in a given character class: isascii() and its equivalents.
                    It's up to you to scan the entire contents of the file, to classify it.

                    In POSIX, you might be able to get away with opening a file, stat()ing its
                    contents, to get the file's size, mmap-ing the file into memory, then using
                    std::find_if() to search for non-ascii bytes. Of course, if you hit a 4gb
                    file, that might cause ...problems.


                    -----BEGIN PGP SIGNATURE-----
                    Version: GnuPG v1.4.9 (GNU/Linux)

                    iEYEABECAAYFAkh wJWwACgkQx9p3GY HlUOKRiQCfecGK9 31qQSjLwg/zLXXth6jg
                    J9gAnRTyl3xwtwG TLp9HdwfvpjEaO8 tF
                    =K1um
                    -----END PGP SIGNATURE-----

                    Comment

                    • =?UTF-8?B?RXJpayBXaWtzdHLDtm0=?=

                      #11
                      Re: is_ascii() or is_binary() for files?

                      On 2008-07-06 02:48, Brad wrote:
                      Stefan Ram wrote:
                      >Brad <brad@16systems .comwrites:
                      >>Is there a way to determine whether a file is plain ascii text
                      >>or not using standard C++?
                      >>
                      > If someone can define in words when a file is deemed to be a
                      > »a plain ascii text« without ambiguity and for each possible
                      > file, I am sure that then this newsgroup will be able to
                      > help to implement a test for it in C++.
                      ...
                      >
                      Thanks for all the responses. The program recurses through a directory
                      processing files. I do not know beforehand what type of files the
                      program may encounter. The processing is simply reading the file and
                      passing its content to a regular expression to search for certain strings.
                      >
                      Binary files cause problems, so I thought if I could just skip them and
                      only read ASCII and perhaps UTF-8 encoded files, things would be better.
                      That lead to my initial question. Later I could learn how to deal with
                      binary files that I may want to search like PDF and MS Office documents.
                      Just curious if standard C++ had some built-in function that made this easy.
                      The simplest way to solve your problem is probably to impose some
                      additional constraints, such as requiring that text files have a name
                      ending with ".txt" or that you only guarantee correct operation if no
                      none ASCII files are in the directory.

                      If you are running on a POSIX system you can also use the 'file' program
                      which tries to figure out what kind of contents a file has.

                      --
                      Erik Wikström

                      Comment

                      • James Kanze

                        #12
                        Re: is_ascii() or is_binary() for files?

                        On Jul 6, 3:52 am, Sam <s...@email-scan.comwrote:
                        Brad writes:
                        That lead to my initial question. Later I could learn how to
                        deal with binary files that I may want to search like PDF
                        and MS Office documents. Just curious if standard C++ had
                        some built-in function that made this easy.
                        No. The only 'built-in' function of any kind is one to test if
                        a single character belongs in a given character class:
                        isascii() and its equivalents. It's up to you to scan the
                        entire contents of the file, to classify it.
                        There is no isascii function, and the other isxxx functions are
                        locale dependent (and don't really work for narrow characters
                        anyway). There are heuristics for "guessing" the type of
                        contents of a file, but they're just that, heuristics, and none
                        are 100% certain.

                        Most systems have various conventions which may reveal the type,
                        but those are also just conventions, and individual files may
                        actually violate them: you can give a text file an name ending
                        with .exe under Windows, and there's nothing to prevent a binary
                        file from starting with something that looks like like
                        "<!DOCTYPE. .." on any system.
                        In POSIX, you might be able to get away with opening a file,
                        stat()ing its contents, to get the file's size, mmap-ing the
                        file into memory, then using std::find_if() to search for
                        non-ascii bytes. Of course, if you hit a 4gb file, that might
                        cause ...problems.
                        Under most Unix systems, you'd probably read the first N bytes
                        (maybe 512, although that's a lot more than would typically be
                        necessary), and then exploit magic. For that matter,
                        *generally*, reading the first 512 bytes, then looking for
                        characters outside the set 0x07-0x0D and 0x20-0x7E, is probably
                        a pretty good heuristic; the probability of your guessing wrong
                        is pretty slim (but of course, it will treat non-ascii text
                        files as binary).

                        --
                        James Kanze (GABI Software) email:james.kan ze@gmail.com
                        Conseils en informatique orientée objet/
                        Beratung in objektorientier ter Datenverarbeitu ng
                        9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

                        Comment

                        • James Kanze

                          #13
                          Re: is_ascii() or is_binary() for files?

                          On Jul 6, 11:18 am, Erik Wikström <Erik-wikst...@telia. comwrote:
                          On 2008-07-06 02:48, Brad wrote:
                          If you are running on a POSIX system you can also use the
                          'file' program which tries to figure out what kind of contents
                          a file has.
                          Note that the information output by file is not guaranteed to be
                          correct (except in specific cases: the file doesn't exist, isn't
                          a regular file, or is empty). (On the other hand, it also works
                          under Windows, if you've installed it correctly.)

                          --
                          James Kanze (GABI Software) email:james.kan ze@gmail.com
                          Conseils en informatique orientée objet/
                          Beratung in objektorientier ter Datenverarbeitu ng
                          9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

                          Comment

                          • Juha Nieminen

                            #14
                            Re: is_ascii() or is_binary() for files?

                            Sherman Pendley wrote:
                            Sure, just read its contents and look for any byte that's 127. If
                            you find one, the file's contents are not plain ASCII.
                            Actually there are certain characters with values < 32 which can be a
                            sign of non-ascii file if present, 0 being the most prominent one.

                            Comment

                            • James Kanze

                              #15
                              Re: is_ascii() or is_binary() for files?

                              On Jul 6, 4:58 pm, Juha Nieminen <nos...@thanks. invalidwrote:
                              Sherman Pendley wrote:
                              Sure, just read its contents and look for any byte that's >
                              127. If you find one, the file's contents are not plain
                              ASCII.
                              Actually there are certain characters with values < 32 which
                              can be a sign of non-ascii file if present, 0 being the most
                              prominent one.
                              Technically, 0 is the encoding of the character nul in ASCII.
                              ASCII defines "characters " for all encodings in the range 0-127.

                              Practically, I don't think he really means ASCII per se, but
                              rather text encoded using ASCII. Or rather files that can be
                              interpreted as such---it's been years since I've seen a file
                              encoded as "ASCII" (but a lot of files created as ISO 8859-1 or
                              UTF-8 can probably be read as ASCII, if the file only contains
                              characters from the basic character set).

                              --
                              James Kanze (GABI Software) email:james.kan ze@gmail.com
                              Conseils en informatique orientée objet/
                              Beratung in objektorientier ter Datenverarbeitu ng
                              9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

                              Comment

                              Working...