> Home > Charset Detect Stream Reader
Charset Detect Stream Reader
The CharsetDetectStreamReader class takes in a byte stream
(InputStream), automatically detects the character encoding of
the byte stream, and turns it into a character stream (Reader)
using that encoding. If the detected character encoding is UTF-8, UTF-16, or
UTF-32, and contains a byte-order mark (BOM), the BOM will be removed from
the byte stream and not appear in the character stream. The
CharsetDetectStreamReader class can be used in place of the
InputStreamReader class provided in the Java API.
The algorithm used by CharsetDetectStreamReader to detect the
character encoding of the byte stream is to successively apply the following
methods in the order shown until one of these methods is able to detect the
character encoding.
- Read in the first few bytes of the byte stream, and attempts to detect the BOM that uniquely identifies the UTF-8, UTF-16, or UTF-32 character encodings.
- Reads in a small amount of data from the byte stream and attempt to
detect the XML declaration
<?xml ... ?>together with a possible character encoding. If a XML declaration is detected and it contains encoding information, then that encoding is returned instead provided that it can be verified to be reasonably correct. - Use the
CharsetDetectorclass of the International Components for Unicode for Java (ICU4J) package to detect the most likely character encoding of the byte stream, by reading in a small amount of data from the byte stream.
The above algorithm is by necessity a heuristic and is not guranteed to always detect the correct encoding.
This class is written to support the search engine wrapper package, and is written partially in response to Sun Java developers closing bug report 4508058 with a "Will Not Fix".
The charset detect stream reader is open source and licensed under the GNU General Public License, either version 3 or (at your option) any later version. Sun Java 1.6 is required to use the package.
Status
This package is under general release.
Downloads
Release: 2009 September 30
Added a new character encoding detection method, which attempts to find the
XML declaration <?xml ... ?> as well as its encoding
information where available.
Release: 2009 July 17
Fixed a bug where an UnsupportedEncodingException is thrown for
particular input streams, especially those from very small input files, when
ICU4J returned an encoding that is unsupported in Sun Java.
Release: 2009 June 25
This is the initial release of the charset detect stream reader.