mirror of
https://github.com/qpdf/qpdf.git
synced 2024-06-01 01:40:51 +00:00
Add page position information to json
This commit is contained in:
parent
52a0b767c8
commit
76bf863aaa
|
@ -1940,178 +1940,235 @@ outfile.pdf</option>
|
|||
</chapter>
|
||||
<chapter id="ref.json">
|
||||
<title>QPDF JSON</title>
|
||||
<para>
|
||||
Beginning with qpdf version 8.3.0, the <command>qpdf</command>
|
||||
command-line program can produce a json representation of the
|
||||
non-content data in a PDF file. It includes a dump in json format
|
||||
of all objects in the PDF file excluding the content of streams.
|
||||
This json representation makes it very easy to look in detail at
|
||||
the structure of a given PDF file, and it also provides a great way
|
||||
to work with PDF files programmatically from the command-line in
|
||||
languages that can't call or link with the qpdf library directly.
|
||||
Note that stream data can be extracted from PDF files using other
|
||||
qpdf command-line options.
|
||||
</para>
|
||||
<para>
|
||||
The qpdf json representation includes a json serialization of the
|
||||
raw objects in the PDF file as well as some computed information in
|
||||
a more easily extracted format. QPDF provides some guarantees about
|
||||
its json format. These guarantees are designed to simplify the
|
||||
experience of a developer working with the JSON format.
|
||||
<variablelist>
|
||||
<varlistentry>
|
||||
<term>Compatibility</term>
|
||||
<sect1 id="ref.json-overview">
|
||||
<title>Overview</title>
|
||||
<para>
|
||||
Beginning with qpdf version 8.3.0, the <command>qpdf</command>
|
||||
command-line program can produce a json representation of the
|
||||
non-content data in a PDF file. It includes a dump in json format
|
||||
of all objects in the PDF file excluding the content of streams.
|
||||
This json representation makes it very easy to look in detail at
|
||||
the structure of a given PDF file, and it also provides a great way
|
||||
to work with PDF files programmatically from the command-line in
|
||||
languages that can't call or link with the qpdf library directly.
|
||||
Note that stream data can be extracted from PDF files using other
|
||||
qpdf command-line options.
|
||||
</para>
|
||||
</sect1>
|
||||
<sect1 id="ref.json-guarantees">
|
||||
<title>JSON Guarantees</title>
|
||||
<para>
|
||||
The qpdf json representation includes a json serialization of the
|
||||
raw objects in the PDF file as well as some computed information in
|
||||
a more easily extracted format. QPDF provides some guarantees about
|
||||
its json format. These guarantees are designed to simplify the
|
||||
experience of a developer working with the JSON format.
|
||||
<variablelist>
|
||||
<varlistentry>
|
||||
<term>Compatibility</term>
|
||||
<listitem>
|
||||
<para>
|
||||
The top-level json object output is a dictionary. The json
|
||||
output contains various nested dictionaries and arrays. With
|
||||
the exception of dictionaries that are populated by the fields
|
||||
of objects from the file, all instances of a dictionary are
|
||||
guaranteed to have exactly the same keys. Future versions of
|
||||
qpdf are free to add additional keys but not to remove keys or
|
||||
change the type of object that a key points to. The qpdf
|
||||
program validates this guarantee, and in the unlikely event
|
||||
that a bug in qpdf should cause it to generate data that
|
||||
doesn't conform to this rule, it will ask you to file a bug
|
||||
report.
|
||||
</para>
|
||||
<para>
|
||||
The top-level json structure contains a
|
||||
“<literal>version</literal>” key whose value is
|
||||
simple integer. The value of the <literal>version</literal> key
|
||||
will be incremented if a non-compatible change is made. A
|
||||
non-compatible change would be any change that involves removal
|
||||
of a key, a change to the format of data pointed to by a key,
|
||||
or a semantic change that requires a different interpretation
|
||||
of a previously existing key. A strong effort will be made to
|
||||
avoid breaking compatibility.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
<varlistentry>
|
||||
<term>Documentation</term>
|
||||
<listitem>
|
||||
<para>
|
||||
The <command>qpdf</command> command can be invoked with the
|
||||
<option>--json-help</option> option. This will output a json
|
||||
structure that has the same structure as the json output that
|
||||
qpdf generates, except that each field in the help output is a
|
||||
description of the corresponding field in the json output. The
|
||||
specific guarantees are as follows:
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para>
|
||||
A dictionary in the help output means that the corresponding
|
||||
location in the actual json output is also a dictionary with
|
||||
exactly the same keys; that is, no keys present in help are
|
||||
absent in the real output, and no keys will be present in
|
||||
the real output that are not in help.
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
A string in the help output is a description of the item
|
||||
that appears in the corresponding location of the actual
|
||||
output. The corresponding output can have any format.
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
An array in the help output always contains a single
|
||||
element. It indicates that the corresponding location in the
|
||||
actual output is also an array, and that each element of the
|
||||
array has whatever format is implied by the single element
|
||||
of the help output's array.
|
||||
</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
For example, the help output indicates includes a
|
||||
“<literal>pagelabels</literal>” key whose value is
|
||||
an array of one element. That element is a dictionary with keys
|
||||
“<literal>index</literal>” and
|
||||
“<literal>label</literal>”. In addition to
|
||||
describing the meaning of those keys, this tells you that the
|
||||
actual json output will contain a <literal>pagelabels</literal>
|
||||
array, each of whose elements is a dictionary that contains an
|
||||
<literal>index</literal> key, a <literal>label</literal> key,
|
||||
and no other keys.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
<varlistentry>
|
||||
<term>Directness and Simplicity</term>
|
||||
<listitem>
|
||||
<para>
|
||||
The json output contains the value of every object in the file,
|
||||
but it also contains some processed data. This is analogous to
|
||||
how qpdf's library interface works. The processed data is
|
||||
similar to the helper functions in that it allows you to look
|
||||
at certain aspects of the PDF file without having to understand
|
||||
all the nuances of the PDF specification, while the raw objects
|
||||
allow you to mine the PDF for anything that the higher-level
|
||||
interfaces are lacking.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
</variablelist>
|
||||
</para>
|
||||
</sect1>
|
||||
<sect1 id="json.limitations">
|
||||
<title>Limitations of JSON Representation</title>
|
||||
<para>
|
||||
There are a few limitations to be aware of with the json structure:
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para>
|
||||
The top-level json object output is a dictionary. The json
|
||||
output contains various nested dictionaries and arrays. With
|
||||
the exception of dictionaries that are populated by the fields
|
||||
of objects from the file, all instances of a dictionary are
|
||||
guaranteed to have exactly the same keys. Future versions of
|
||||
qpdf are free to add additional keys but not to remove keys or
|
||||
change the type of object that a key points to. The qpdf
|
||||
program validates this guarantee, and in the unlikely event
|
||||
that a bug in qpdf should cause it to generate data that
|
||||
doesn't conform to this rule, it will ask you to file a bug
|
||||
report.
|
||||
</para>
|
||||
<para>
|
||||
The top-level json structure contains a
|
||||
“<literal>version</literal>” key whose value is
|
||||
simple integer. The value of the <literal>version</literal> key
|
||||
will be incremented if a non-compatible change is made. A
|
||||
non-compatible change would be any change that involves removal
|
||||
of a key, a change to the format of data pointed to by a key,
|
||||
or a semantic change that requires a different interpretation
|
||||
of a previously existing key. A strong effort will be made to
|
||||
avoid breaking compatibility.
|
||||
Strings, names, and indirect object references in the original
|
||||
PDF file are all converted to strings in the json
|
||||
representation. In the case of a “normal” PDF file,
|
||||
you can tell the difference because a name starts with a slash
|
||||
(<literal>/</literal>), and an indirect object reference looks
|
||||
like <literal>n n R</literal>, but if there were to be a string
|
||||
that looked like a name or indirect object reference, there
|
||||
would be no way to tell this from the json output. Note that
|
||||
there are certain cases where you know for sure what something
|
||||
is, such as knowing that dictionary keys in objects are always
|
||||
names and that certain things in the higher-level computed data
|
||||
are known to contain indirect object references.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
<varlistentry>
|
||||
<term>Documentation</term>
|
||||
<listitem>
|
||||
<para>
|
||||
The <command>qpdf</command> command can be invoked with the
|
||||
<option>--json-help</option> option. This will output a json
|
||||
structure that has the same structure as the json output that
|
||||
qpdf generates, except that each field in the help output is a
|
||||
description of the corresponding field in the json output. The
|
||||
specific guarantees are as follows:
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para>
|
||||
A dictionary in the help output means that the corresponding
|
||||
location in the actual json output is also a dictionary with
|
||||
exactly the same keys; that is, no keys present in help are
|
||||
absent in the real output, and no keys will be present in
|
||||
the real output that are not in help.
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
A string in the help output is a description of the item
|
||||
that appears in the corresponding location of the actual
|
||||
output. The corresponding output can have any format.
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
An array in the help output always contains a single
|
||||
element. It indicates that the corresponding location in the
|
||||
actual output is also an array, and that each element of the
|
||||
array has whatever format is implied by the single element
|
||||
of the help output's array.
|
||||
</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
For example, the help output indicates includes a
|
||||
“<literal>pagelabels</literal>” key whose value is
|
||||
an array of one element. That element is a dictionary with keys
|
||||
“<literal>index</literal>” and
|
||||
“<literal>label</literal>”. In addition to
|
||||
describing the meaning of those keys, this tells you that the
|
||||
actual json output will contain a <literal>pagelabels</literal>
|
||||
array, each of whose elements is a dictionary that contains an
|
||||
<literal>index</literal> key, a <literal>label</literal> key,
|
||||
and no other keys.
|
||||
The json format doesn't support binary data very well. Mostly
|
||||
the details are not important, but they are presented here for
|
||||
information. When qpdf outputs a string in the json
|
||||
representation, it converts the string to UTF-8, assuming usual
|
||||
PDF string semantics. Specifically, if the original string is
|
||||
UTF-16, it is converted to UTF-8. Otherwise, it is assumed to
|
||||
have PDF doc encoding, and is converted to UTF-8 with that
|
||||
assumption. This causes strange things to happen to binary
|
||||
strings. For example, if you had the binary string
|
||||
<literal><038051></literal>, this would be output to the
|
||||
json as <literal>\u0003•Q</literal> because
|
||||
<literal>03</literal> is not a printable character and
|
||||
<literal>80</literal> is the bullet character in PDF doc
|
||||
encoding and is mapped to the Unicode value
|
||||
<literal>2022</literal>. Since <literal>51</literal> is
|
||||
<literal>Q</literal>, it is output as is. If you wanted to
|
||||
convert back from here to a binary string, would have to
|
||||
recognize Unicode values whose code points are higher than
|
||||
<literal>0xFF</literal> and map those back to their
|
||||
corresponding PDF doc encoding characters. There is no way to
|
||||
tell the difference between a Unicode string that was originally
|
||||
encoded as UTF-16 or one that was converted from PDF doc
|
||||
encoding. In other words, it's best if you don't try to use the
|
||||
json format to extract binary strings from the PDF file, but if
|
||||
you really had to, it could be done. Note that qpdf's
|
||||
<option>--show-object</option> option does not have this
|
||||
limitation and will reveal the string as encoded in the original
|
||||
file.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
<varlistentry>
|
||||
<term>Directness and Simplicity</term>
|
||||
</itemizedlist>
|
||||
</para>
|
||||
</sect1>
|
||||
<sect1 id="json.considerations">
|
||||
<title>JSON: Special Considerations</title>
|
||||
<para>
|
||||
For the most part, the built-in JSON help tells you everything you
|
||||
need to know about the JSON format, but there are a few
|
||||
non-obvious things to be aware of:
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para>
|
||||
The json output contains the value of every object in the file,
|
||||
but it also contains some processed data. This is analogous to
|
||||
how qpdf's library interface works. The processed data is
|
||||
similar to the helper functions in that it allows you to look
|
||||
at certain aspects of the PDF file without having to understand
|
||||
all the nuances of the PDF specification, while the raw objects
|
||||
allow you to mine the PDF for anything that the higher-level
|
||||
interfaces are lacking.
|
||||
While qpdf guarantees that keys present in the help will be
|
||||
present in the output, those fields may be null or empty if the
|
||||
information is not known or absent in the file. Also, if you
|
||||
specify <option>--json-keys</option>, the keys that are not
|
||||
listed will be excluded entirely except for those that
|
||||
<option>--json-help</option> says are always present.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
</variablelist>
|
||||
</para>
|
||||
<para>
|
||||
There are a few limitations to be aware of with the json structure:
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para>
|
||||
Strings, names, and indirect object references in the original
|
||||
PDF file are all converted to strings in the json
|
||||
representation. In the case of a “normal” PDF file,
|
||||
you can tell the difference because a name starts with a slash
|
||||
(<literal>/</literal>), and an indirect object reference looks
|
||||
like <literal>n n R</literal>, but if there were to be a string
|
||||
that looked like a name or indirect object reference, there
|
||||
would be no way to tell this from the json output. Note that
|
||||
there are certain cases where you know for sure what something
|
||||
is, such as knowing that dictionary keys in objects are always
|
||||
names and that certain things in the higher-level computed data
|
||||
are known to contain indirect object references.
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
The json format doesn't support binary data very well. Mostly
|
||||
the details are not important, but they are presented here for
|
||||
information. When qpdf outputs a string in the json
|
||||
representation, it converts the string to UTF-8, assuming usual
|
||||
PDF string semantics. Specifically, if the original string is
|
||||
UTF-16, it is converted to UTF-8. Otherwise, it is assumed to
|
||||
have PDF doc encoding, and is converted to UTF-8 with that
|
||||
assumption. This causes strange things to happen to binary
|
||||
strings. For example, if you had the binary string
|
||||
<literal><038051></literal>, this would be output to the
|
||||
json as <literal>\u0003•Q</literal> because
|
||||
<literal>03</literal> is not a printable character and
|
||||
<literal>80</literal> is the bullet character in PDF doc
|
||||
encoding and is mapped to the Unicode value
|
||||
<literal>2022</literal>. Since <literal>51</literal> is
|
||||
<literal>Q</literal>, it is output as is. If you wanted to
|
||||
convert back from here to a binary string, would have to
|
||||
recognize Unicode values whose code points are higher than
|
||||
<literal>0xFF</literal> and map those back to their
|
||||
corresponding PDF doc encoding characters. There is no way to
|
||||
tell the difference between a Unicode string that was originally
|
||||
encoded as UTF-16 or one that was converted from PDF doc
|
||||
encoding. In other words, it's best if you don't try to use the
|
||||
json format to extract binary strings from the PDF file, but if
|
||||
you really had to, it could be done. Note that qpdf's
|
||||
<option>--show-object</option> option does not have this
|
||||
limitation and will reveal the string as encoded in the original
|
||||
file.
|
||||
</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
</para>
|
||||
<para>
|
||||
For specific details on the information provided in the json
|
||||
output, please run <command>qpdf --json-help</command>.
|
||||
</para>
|
||||
<listitem>
|
||||
<para>
|
||||
In a few places, there are keys with names containing
|
||||
<literal>pageposfrom1</literal>. The values of these keys are
|
||||
null or an integer. If an integer, they point to a page index
|
||||
within the file numbering from 1. Note that json indexes from
|
||||
0, and you would also use 0-based indexing using the API.
|
||||
However, 1-based indexing is easier in this case because the
|
||||
command-line syntax for specifying page ranges is 1-based. If
|
||||
you were going to write a program that looked through the json
|
||||
for information about specific pages and then use the
|
||||
command-line to extract those pages, 1-based indexing is
|
||||
easier. Besides, it's more convenient to subtract 1 from a
|
||||
program in a real programming language than it is to add 1 from
|
||||
shell code.
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
The image information included in the <literal>page</literal>
|
||||
section of the json output includes the key
|
||||
“<literal>filterable</literal>”. Note that the
|
||||
value of this field may depend on the
|
||||
<option>--decode-level</option> that you invoke qpdf with. The
|
||||
json output includes a top-level key
|
||||
“<literal>parameters</literal>” that indicates the
|
||||
decode level used for computing whether a stream was
|
||||
filterable. For example, jpeg images will be shown as not
|
||||
filterable by default, but they will be shown as filterable if
|
||||
you run <command>qpdf --json --decode-level=all</command>.
|
||||
</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
</para>
|
||||
</sect1>
|
||||
</chapter>
|
||||
<chapter id="ref.design">
|
||||
<title>Design and Library Notes</title>
|
||||
|
|
37
qpdf/qpdf.cc
37
qpdf/qpdf.cc
|
@ -338,6 +338,9 @@ static JSON json_schema(std::set<std::string>* keys = 0)
|
|||
outline.addDictionaryMember(
|
||||
"dest",
|
||||
JSON::makeString("outline destination dictionary"));
|
||||
page.addDictionaryMember(
|
||||
"pageposfrom1",
|
||||
JSON::makeString("position of page in document numbering from 1"));
|
||||
}
|
||||
if (all_keys || keys->count("pagelabels"))
|
||||
{
|
||||
|
@ -371,6 +374,10 @@ static JSON json_schema(std::set<std::string>* keys = 0)
|
|||
outlines.addDictionaryMember(
|
||||
"open",
|
||||
JSON::makeString("whether the outline is displayed expanded"));
|
||||
outlines.addDictionaryMember(
|
||||
"destpageposfrom1",
|
||||
JSON::makeString("position of destination page in document"
|
||||
" numbered from 1; null if not known"));
|
||||
}
|
||||
return schema;
|
||||
}
|
||||
|
@ -2813,6 +2820,7 @@ static void do_json_pages(QPDF& pdf, Options& o, JSON& j)
|
|||
j_outline.addDictionaryMember(
|
||||
"dest", (*oiter).getDest().getJSON(true));
|
||||
}
|
||||
j_page.addDictionaryMember("pageposfrom1", JSON::makeInt(1 + pageno));
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -2847,7 +2855,8 @@ static void do_json_page_labels(QPDF& pdf, Options& o, JSON& j)
|
|||
}
|
||||
|
||||
static void add_outlines_to_json(
|
||||
std::list<QPDFOutlineObjectHelper> outlines, JSON& j)
|
||||
std::list<QPDFOutlineObjectHelper> outlines, JSON& j,
|
||||
std::map<QPDFObjGen, int>& page_numbers)
|
||||
{
|
||||
for (std::list<QPDFOutlineObjectHelper>::iterator iter = outlines.begin();
|
||||
iter != outlines.end(); ++iter)
|
||||
|
@ -2858,17 +2867,39 @@ static void add_outlines_to_json(
|
|||
jo.addDictionaryMember("title", JSON::makeString(ol.getTitle()));
|
||||
jo.addDictionaryMember("dest", ol.getDest().getJSON(true));
|
||||
jo.addDictionaryMember("open", JSON::makeBool(ol.getCount() >= 0));
|
||||
QPDFObjectHandle page = ol.getDestPage();
|
||||
JSON j_destpage = JSON::makeNull();
|
||||
if (page.isIndirect())
|
||||
{
|
||||
QPDFObjGen og = page.getObjGen();
|
||||
if (page_numbers.count(og))
|
||||
{
|
||||
j_destpage = JSON::makeInt(page_numbers[og]);
|
||||
}
|
||||
}
|
||||
jo.addDictionaryMember("destpageposfrom1", j_destpage);
|
||||
JSON j_kids = jo.addDictionaryMember("kids", JSON::makeArray());
|
||||
add_outlines_to_json(ol.getKids(), j_kids);
|
||||
add_outlines_to_json(ol.getKids(), j_kids, page_numbers);
|
||||
}
|
||||
}
|
||||
|
||||
static void do_json_outlines(QPDF& pdf, Options& o, JSON& j)
|
||||
{
|
||||
std::map<QPDFObjGen, int> page_numbers;
|
||||
QPDFPageDocumentHelper dh(pdf);
|
||||
std::vector<QPDFPageObjectHelper> pages = dh.getAllPages();
|
||||
int n = 0;
|
||||
for (std::vector<QPDFPageObjectHelper>::iterator iter = pages.begin();
|
||||
iter != pages.end(); ++iter)
|
||||
{
|
||||
QPDFObjectHandle oh = (*iter).getObjectHandle();
|
||||
page_numbers[oh.getObjGen()] = ++n;
|
||||
}
|
||||
|
||||
JSON j_outlines = j.addDictionaryMember(
|
||||
"outlines", JSON::makeArray());
|
||||
QPDFOutlineDocumentHelper odh(pdf);
|
||||
add_outlines_to_json(odh.getTopLevelOutlines(), j_outlines);
|
||||
add_outlines_to_json(odh.getTopLevelOutlines(), j_outlines, page_numbers);
|
||||
}
|
||||
|
||||
static void do_json(QPDF& pdf, Options& o)
|
||||
|
|
Loading…
Reference in New Issue
Block a user