> Home > Downloads > Charset Detect Stream Reader

Charset Detect Stream Reader

The charset detect stream reader is a Java Reader class that takes in a byte stream (InputStream), automatically detects the most likely character encoding of the byte stream, and turns it into a character stream (Reader) using that encoding. The CharsetDetectStreamReader class can be used in place of the InputStreamReader class provided in the Java API.

The default character encoding detector used is a sequential combination of several other detectors. The default sequence of detectors that are combined is as follows, in which they are applied successively on the input byte stream in the given order, until one of them is able to detect a character encoding.

  1. The BOM detector attempts to detect the BOM that uniquely identifies the UTF-8, UTF-16, or UTF-32 character encodings.
  2. The XML detector attempts to detect the XML declaration <?xml ... ?>, which may contain a character encoding.
  3. The HTML detector attempts to detect the HTML <meta ... > tag, which may contain a character encoding.
  4. The ASCII detector reads in a small amount of data from the input byte stream, and checks that the bytes read are all 7-bit US-ASCII characters.
  5. The ICU detector uses the International Components for Unicode for Java (ICU4J) package to detect the most likely character encoding.

If the BOM detector is applied and a Unicode byte-order mark (BOM) is found, then it is removed from the byte stream and not appear in the character stream. Removing the BOM from the byte stream also overcomes a bug in Java that was closed with a "Will Not Fix".

Please note that character encoding detection is by necessity a heuristic and is not guaranteed to always detect the correct encoding.

This class is written to support the search engine wrapper package.

Oracle Java 1.6 is required to use this package.

License

The charset detect stream reader is open source. Releases made on or after 2013 April 23 are licensed under the Apache License, Version 2.0. Releases made before 2013 April 23 are licensed under the GNU General Public License, either Version 3 or (at your option) any later version.

Status

This package is under general release.

Downloads

Release: 2013 May 27

This release incorporates the following changes:

  1. Added a new HTML detector.
  2. Added a new combined detector that combines one or more other detectors in a sequential manner. The default list of detectors to be combined is given in the introduction above.
  3. Users can now specify which detector to use for the charset detect stream reader.
  4. This release includes an updated component: ICU4J 51.2.

Release: 2013 April 23

This release incorporates the following changes:

  1. License is changed to Apache License, Version 2.0.
  2. A bug fix in one of the CharsetDetectStreamReader constructors.
  3. This release includes an updated component: ICU4J 51.1.

Earlier releases

Click here to show earlier releases.

Development Snapshot

A development snapshot can be obtained from GitHub.

Development snapshots are unstable, may contain bugs, and might not even compile. It is highly recommended to download the latest release instead.