Vectorizers
A vectorizer transforms a list of tokens to a vector where we can
make mathematical computations. The vectorizers are generic packages
that must be instantiated with a sparse array package which describes
the final vector. The first step is to instantiate a sparse array
with a Row_Type
and Column_Type
which represent indices of the
vector and a Value_Type
which represents value of the cell. For example,
a simple counter vector can be declared as follows:
package Counter_Arrays is
new SCI.Sparse.COO_Arrays (Row_Type => Positive,
Column_Type => Positive,
Value_Type => Natural);
The sparse array only records cells which have values. The Row_Type
can
be used to represent a document and the Column_Type
refers to the
occurrence of the token in that document.
Counters
The SCI.Vectorizers.Counters
is a vectorizer that counts the occurrence
of tokens and builds a vector of these counters. The Vectorizer_Type
contains an ordered map of tokens represented by Token_Type
and for
each of them it indicates a column index within the final vector.
When tokens are recorded for a document, it looks for a token in the map,
finds the associated column and increments a counter for the document and
token. If the token is not found, it is inserted in the map and associated
with a new column in the vector. For example, the package can be
instantiated as follows:
package Token_Counters is
new SCI.Vectorizers.Counters (Token_Type => Unbounded_String,
Arrays => Counter_Arrays,
"<" => "<");
An instance of the vectorizer is declared and configured:
V : Token_Counters.Vectorizer_Type;
...
V.Counters.Default := 0;
The vectorizer is filled with documents and tokens by using the
Add_Token
procedure with the Row
representing the document.
Because the vector cell can be any Ada private type, it is necessary
to provide an Increment
function that gets the current value and
increment it.
function Increment (Value : in Natural) return Natural is (Value + 1);
Token_Counters.Add_Token (Into => V,
Token => Item,
Increment => Increment'Access);
After filling the vectorizer instance, it will contain in V.Counters
the cells which count the token occurrence per document scanned.
The result can be used to compute similarities between different
documents (known as rows).
Indefinite counters
The SCI.Vectorizers.Indefinite_Counters
is similar to the
SCI.Vectorizers.Counters
package but allows to use indefinite types for
the Token_Type
. For example, it can be used to use a String
for the
Token_Type
as follows:
package Token_Counters is
new SCI.Vectorizers.Indefinite_Counters (Token_Type => String,
Arrays => Counter_Arrays,
"<" => "<");
Transformers
The SCI.Vectorizers.Transformers
transforms a count matrix to a normalized
tf
or tf-idf
representation. Tf means term-frequency while tf-idf means
term-frequency times inverse document-frequency. This is a common term
weighting scheme in information retrieval, that has also found good use in
document classification.
The Frequency_Type
defines the floating type to represent the frequency.
A Convert
function must be provided to convert the counter number used
by the sparse array into a Frequency_Type
. The transformer is then
instantiated:
function To_Float (Value : Natural) return Float is (Float (Value));
package Counter_Transformers is
new SCI.Vectorizers.Transformers (Frequency_Type => Float,
Arrays => Counter_Arrays,
Convert => To_Float);
Given the counters computed by the SCI.Vectorizers.Counters
package,
the tf-idf
values are computed as follows:
F : Counter_Transformers.Frequency_Arrays.Array_Type;
...
Counter_Transformers.TIDF (From => V.Counters, Into => F);