> Home > Charset Detect Stream Reader

Charset Detect Stream Reader

The CharsetDetectStreamReader class takes in a byte stream (InputStream), automatically detects the character encoding of the byte stream, and turns it into a character stream (Reader) using that encoding. If the detected character encoding is UTF-8, UTF-16, or UTF-32, and contains a byte-order mark (BOM), the BOM will be removed from the byte stream and not appear in the character stream. The CharsetDetectStreamReader class can be used in place of the InputStreamReader class provided in the Java API.

The algorithm used by CharsetDetectStreamReader to detect the character encoding of the byte stream is to successively apply the following methods in the order shown until one of these methods is able to detect the character encoding.

  1. Read in the first few bytes of the byte stream, and attempts to detect the BOM that uniquely identifies the UTF-8, UTF-16, or UTF-32 character encodings.
  2. Reads in a small amount of data from the byte stream and attempt to detect the XML declaration <?xml ... ?> together with a possible character encoding. If a XML declaration is detected and it contains encoding information, then that encoding is returned instead provided that it can be verified to be reasonably correct.
  3. Use the CharsetDetector class of the International Components for Unicode for Java (ICU4J) package to detect the most likely character encoding of the byte stream, by reading in a small amount of data from the byte stream.

The above algorithm is by necessity a heuristic and is not guranteed to always detect the correct encoding.

This class is written to support the search engine wrapper package, and is written partially in response to Sun Java developers closing bug report 4508058 with a "Will Not Fix".

The charset detect stream reader is open source and licensed under the GNU General Public License, either version 3 or (at your option) any later version. Sun Java 1.6 is required to use the package.

Status

This package is under general release.

Downloads

Release: 2009 September 30

Added a new character encoding detection method, which attempts to find the XML declaration <?xml ... ?> as well as its encoding information where available.

Release: 2009 July 17

Fixed a bug where an UnsupportedEncodingException is thrown for particular input streams, especially those from very small input files, when ICU4J returned an encoding that is unsupported in Sun Java.

Release: 2009 June 25

This is the initial release of the charset detect stream reader.