Vectorizers
A vectorizer transforms a list of tokens to a vector where we can
make mathematical computations. The vectorizers are generic packages
that must be instantiated with a sparse array package which describes
the final vector. The first step is to instantiate a sparse array
with a Row_Type and Column_Type which represent indices of the
vector and a Value_Type which represents value of the cell. For example,
a simple counter vector can be declared as follows:
package Counter_Arrays is
new SCI.Sparse.COO_Arrays (Row_Type => Positive,
Column_Type => Positive,
Value_Type => Natural);
The sparse array only records cells which have values. The Row_Type can
be used to represent a document and the Column_Type refers to the
occurrence of the token in that document.
Counters
The SCI.Vectorizers.Counters is a vectorizer that counts the occurrence
of tokens and builds a vector of these counters. The Vectorizer_Type
contains an ordered map of tokens represented by Token_Type and for
each of them it indicates a column index within the final vector.
When tokens are recorded for a document, it looks for a token in the map,
finds the associated column and increments a counter for the document and
token. If the token is not found, it is inserted in the map and associated
with a new column in the vector. For example, the package can be
instantiated as follows:
package Token_Counters is
new SCI.Vectorizers.Counters (Token_Type => Unbounded_String,
Arrays => Counter_Arrays,
"<" => "<");
An instance of the vectorizer is declared and configured:
V : Token_Counters.Vectorizer_Type;
...
V.Counters.Default := 0;
The vectorizer is filled with documents and tokens by using the
Add_Token procedure with the Row representing the document.
Because the vector cell can be any Ada private type, it is necessary
to provide an Increment function that gets the current value and
increment it.
function Increment (Value : in Natural) return Natural is (Value + 1);
Token_Counters.Add_Token (Into => V,
Token => Item,
Increment => Increment'Access);
After filling the vectorizer instance, it will contain in V.Counters
the cells which count the token occurrence per document scanned.
The result can be used to compute similarities between different
documents (known as rows).
Indefinite counters
The SCI.Vectorizers.Indefinite_Counters is similar to the
SCI.Vectorizers.Counters package but allows to use indefinite types for
the Token_Type. For example, it can be used to use a String for the
Token_Type as follows:
package Token_Counters is
new SCI.Vectorizers.Indefinite_Counters (Token_Type => String,
Arrays => Counter_Arrays,
"<" => "<");
Transformers
The SCI.Vectorizers.Transformers transforms a count matrix to a normalized
tf or tf-idf representation. Tf means term-frequency while tf-idf means
term-frequency times inverse document-frequency. This is a common term
weighting scheme in information retrieval, that has also found good use in
document classification.
The Frequency_Type defines the floating type to represent the frequency.
A Convert function must be provided to convert the counter number used
by the sparse array into a Frequency_Type. The transformer is then
instantiated:
function To_Float (Value : Natural) return Float is (Float (Value));
package Counter_Transformers is
new SCI.Vectorizers.Transformers (Frequency_Type => Float,
Arrays => Counter_Arrays,
Convert => To_Float);
Given the counters computed by the SCI.Vectorizers.Counters package,
the tf-idf values are computed as follows:
F : Counter_Transformers.Frequency_Arrays.Array_Type;
...
Counter_Transformers.TIDF (From => V.Counters, Into => F);