Indexing System: Difference between revisions
imported>Nicolas mNo edit summary |
imported>Nicolas |
||
(15 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
{{ | {{TOCright}} | ||
This document aims at helping developers getting into the code of the ''index'' plug-in. | This document aims at helping developers getting into the code of the '''index''' plug-in. | ||
Please refer to [[Rodin Index Design]] for more general considerations about the purpose and design. | Please refer to [[Rodin Index Design]] for more general considerations about the purpose and design. | ||
Line 7: | Line 7: | ||
The indexer plug-in is made of the following packages: | The indexer plug-in is made of the following packages: | ||
* ''org.rodinp.core. | * '''org.rodinp.core.indexer''': public interfaces and Plugin class | ||
* ''org.rodinp.internal.core. | * '''org.rodinp.internal.core.indexer''': implementation of public interfaces and main classes (including '''IndexManager''') | ||
* ''org.rodinp.internal.core. | * '''org.rodinp.internal.core.indexer.tables''': index tables (memory storage) | ||
* ''org.rodinp.internal.core. | * '''org.rodinp.internal.core.indexer.sort''': generical sorting support (used to manage file dependencies) | ||
* ''org.rodinp.internal.core. | * '''org.rodinp.internal.core.indexer.persistence''': index saving main classes (disk storage) | ||
* ''org.rodinp.internal.core. | * '''org.rodinp.internal.core.indexer.persistence.xml''': index saving in xml (Strategy pattern) | ||
==The Index Manager== | ==The Index Manager== | ||
''IndexManager'' is the master class of the plug-in. There resides the | '''IndexManager''' is the master class of the plug-in. There resides the ''start()'' method called in a background job at plug-in startup. Firstly, it restores saved index data from previous session (see [[#Persistence|Persistence]] hereafter). Then it loops, waiting for database deltas to be enqueued and launching the indexing of changed files via a user-cancellable job (see [[#Concurrency|Concurrency]] hereafter). | ||
Indexing is done in two steps, project per project: | Indexing is done in two steps, project per project: | ||
* updating file [[#Dependencies]] to compute indexing order | * updating file [[#Dependencies|dependencies]] to compute indexing order | ||
* indexing files in computed order to update [[#Index Tables]] | * indexing files in computed order to update [[#Index Tables|index tables]] | ||
Finally, when the session closes, index data is saved (see [[#Persistence]]) and the background job is cancelled. | Finally, when the session closes, index data is saved (see [[#Persistence|Persistence]] hereafter) and the background job is cancelled. | ||
The IndexManager is also the entry point for index | The IndexManager is also the entry point for index querying. | ||
===Project Index Manager=== | ===Project Index Manager=== | ||
A ''ProjectIndexManager'' (PIM) is responsible for maintaining [[#Index Tables]] of one project. The IndexManager has a ''PerProjectPIM'', that is simply an association between projects and their respective PIMs. | A '''ProjectIndexManager''' (PIM) is responsible for maintaining [[#Index Tables|index tables]] of one project. The IndexManager has a '''PerProjectPIM''', that is simply an association between projects and their respective PIMs. | ||
A PIM is called by the IndexManager at two moments: | A PIM is called by the IndexManager at two moments: | ||
* when a file has just changed (''fileChanged()''), to update file [[#Dependencies]] | * when a file has just changed (''fileChanged()''), to update file [[#Dependencies|dependencies]] | ||
* when the indexing job is scheduled (''doIndexing()''), to index files and update its [[#Index Tables]] | * when the indexing job is scheduled (''doIndexing()''), to index files and update its [[#Index Tables|index tables]] | ||
It delegates indexer calling to the [[#File Indexing Manager]]. | It delegates indexer calling to the [[#File Indexing Manager|file indexing manager]]. | ||
===Indexer Registry=== | ===Indexer Registry=== | ||
The ''IndexerRegistry'' (Singleton) records registered indexers. It maps file types on lists of indexers, so that when a given file has to be indexed, it can be requested to give the corresponding registered indexers, in the order they have to be called. | The '''IndexerRegistry''' (Singleton) records registered indexers. It maps file types on lists of indexers, so that when a given file has to be indexed, it can be requested to give the corresponding registered indexers, in the order they have to be called. | ||
===File Indexing Manager=== | ===File Indexing Manager=== | ||
Actual indexer calls are achieved by the ''FileIndexingManager'' (FIM), which implements Singleton pattern. Being given a file, it gets appropriate indexers from the [[#Indexer Registry]], asks them either to get [[#Dependencies]] of the file or to index it, then returns the result to the calling [[#Project Index Manager|PIM]]. | Actual indexer calls are achieved by the '''FileIndexingManager''' (FIM), which implements Singleton pattern. Being given a file, it gets appropriate indexers from the [[#Indexer Registry|indexer registry]], asks them either to get [[#Dependencies|dependencies]] of the file or to index it, then returns the result to the calling [[#Project Index Manager|PIM]]. | ||
==Index Tables== | ==Index Tables== | ||
Line 52: | Line 51: | ||
===Rodin Index=== | ===Rodin Index=== | ||
''RodinIndex'' is the main index table. It stores the entities together with their occurrences. More precisely, it maps each indexed IInternalElement to a ''Descriptor''. | '''RodinIndex''' is the main index table. It stores the entities together with their occurrences. More precisely, it maps each indexed IInternalElement to a '''Descriptor'''. | ||
A Descriptor is a ''Declaration'' (handle + name) of an element plus a set of ''Occurrence''s of this element. | A Descriptor is a '''Declaration''' (handle + name) of an element plus a set of '''Occurrence'''s of this element. | ||
An Occurrence has an ''OccurrenceKind'' (defined by indexer plug-in implementers) and a ''RodinLocation'' (a location of the element within another in some file). | An Occurrence has an '''OccurrenceKind''' (defined by indexer plug-in implementers) and a '''RodinLocation''' (a location of the element within another in some file). | ||
===Export Table=== | ===Export Table=== | ||
The ''ExportTable'' records declarations of elements exported by each indexed file. | The '''ExportTable''' records declarations of elements exported by each indexed file. | ||
===File Table=== | ===File Table=== | ||
The ''FileTable'' gives, for each file, the elements indexed in it. It contains no more information than the RodinIndex (can be constructed entirely from it), but is intended to give faster access to indexed elements of a given file, rather than iterating through the whole index. | The '''FileTable''' gives, for each file, the elements indexed in it. It contains no more information than the RodinIndex (can be constructed entirely from it), but is intended to give faster access to indexed elements of a given file, rather than iterating through the whole index. | ||
===Name Table=== | ===Name Table=== | ||
The ''NameTable'' follows the same optimization idea: for each indexed user-defined name, it gives a set of elements declared with this name. | The '''NameTable''' follows the same optimization idea: for each indexed user-defined name, it gives a set of elements declared with this name. | ||
==Dependencies== | ==Dependencies== | ||
File dependencies are treated generically in the ''org.rodinp.internal.core. | File dependencies are treated generically in the '''org.rodinp.internal.core.indexer.sort''' package. The final goal is to maintain an iterable [[#Total Order|total order]] on the files of each project. PIMs will then follow this order when indexing their project, using type ''IRodinFile'' for generic parameter T. | ||
===Node=== | ===Node=== | ||
The ''Node'' is to be understood as a node in the dependence graph. It has a label of type T and lists of predecessors and successors. | The '''Node''' is to be understood as a node in the dependence graph. It has a label of type T and lists of predecessors and successors. | ||
Telling that | Telling that ''N1'' is a predecessor of ''N2'' means that ''N2'' depends on ''N1'', so ''N1'' must be indexed before ''N2''. | ||
Nodes also have an order position ( | Nodes also have an order position (''orderPos'' field) that is the position of this node in the total order. Its value can only be known after nodes have been [[#Sorted Nodes|sorted]]. | ||
Not all nodes need to be processed each time. This is the reason why nodes have a | Not all nodes need to be processed each time. This is the reason why nodes have a ''mark'': when iterating over nodes, marked ones are processed (indexed) whereas others are simply ignored. | ||
===Total Order=== | ===Total Order=== | ||
''TotalOrder'' is the main class. It implements | '''TotalOrder''' is the main class. It implements ''Iterator'', providing ''next()'', ''hasNext()'' and ''remove()'' methods, along with other specific facilities. | ||
It delegates graph maintenance to ''Graph'' and node sorting and iterating to ''SortedNodes''. | It delegates graph maintenance to '''Graph''' and node sorting and iterating to '''SortedNodes'''. | ||
When the graph changes, nodes are sorted again, but only if iterating is requested. This avoids sorting at each change. | When the graph changes, nodes are sorted again, but only if iterating is requested. This avoids sorting at each change. | ||
Graph change events are received through a listener. This might seem strange at first: as the TotalOrder class calls for graph modification itself, it should know when it changes. The problem is that, for instance, predecessors can be set to exactly the same list of nodes as previously. In this case, the graph remains unchanged but TotalOrder cannot guess it and will unnecessarily sort the whole graph again (well, actually it could guess, but we prefer delegating change check to ''Graph''). Same problem when creating a node that already exists. | Graph change events are received through a listener. This might seem strange at first: as the TotalOrder class calls for graph modification itself, it should know when it changes. The problem is that, for instance, predecessors can be set to exactly the same list of nodes as previously. In this case, the graph remains unchanged but TotalOrder cannot guess it and will unnecessarily sort the whole graph again (well, actually it could guess, but we prefer delegating change check to '''Graph'''). Same problem when creating a node that already exists. | ||
===Graph=== | ===Graph=== | ||
The ''Graph'' stores nodes, applies changes to their links and maintains their coherence: if N1 is a predecessor of N2 then N2 is a successor of N1. | The '''Graph''' stores nodes, applies changes to their links and maintains their coherence: if N1 is a predecessor of N2 then N2 is a successor of N1. | ||
As explained above, it fires change events when a method call actually modifies the graph. | As explained above, it fires change events when a method call actually modifies the graph. | ||
Line 102: | Line 101: | ||
===Sorted Nodes=== | ===Sorted Nodes=== | ||
''SortedNodes'' maintains a list of iterable sorted nodes. It delegates sorting to a ''Sorter''. | '''SortedNodes''' maintains a list of iterable sorted nodes. It delegates sorting to a '''Sorter'''. | ||
An important feature is that, if the graph is modified during iteration, it restarts from the first node that differs between previous order and new order. To that end, iterated nodes are recorded. | |||
==Indexing Bridge== | |||
The '''IndexingBridge''' acts as a kind of a go-between during the indexing of a single file. It is built by the FIM, then sent to indexers. Indexers get the file to index from it, then they fill it with declarations and occurrences. From these, the IndexingBridge builds an '''IndexingResult''' that is sent back to the FIM, then processed by the PIM. | |||
Thus, IndexingBridge is a path from and to the indexing system, from indexers point of view. | |||
This role gives it the responsibility to enforce constraints like: | |||
* an element cannot have an occurrence if it has not been declared before | |||
* an element cannot be declared more than once | |||
* declared elements must be local to the file | |||
* occurring elements must be either local or imported | |||
* declared elements should have one or more occurrences when indexing completes | |||
This last constraint aims at avoiding to store declarations that are never referred to. An element with no occurrence will therefore be removed from the indexing result before returning it to the FIM. | |||
==Persistence== | |||
Index data persistence is implemented following the 'Strategy' pattern. Current implemention provides an xml strategy, however any other means could be used instead. | |||
Saved and restored data consist in current deltas plus, for each PIM: | |||
* index table | |||
* export table | |||
* file sort order | |||
* iterating state | |||
The '''PersistenceManager''' controls restoring and saving, being called by the IndexManager at startup and listening to session and project closure as a '''ISaveParticipant'''. | |||
Actual xml persistence is achieved by the '''XMLPersistor''' that implements the required '''IPersistor''' interface. Then each type of persistent object (index table, ...) is decomposed into specific "persistors" that provide both ''save()'' and ''restore()'' methods for their assigned object. | |||
Persistent object are passed to persistors as '''PersistentIndexManager''', '''PersistentSortedNodes''', '''PersistentTotalOrder'''. These objects contain the state and/or data of IndexManager, SortedNodes, TotalOrder, that they have to save and from which they can be restored. | |||
==Concurrency== | ==Concurrency== | ||
== | ===Delta Queue=== | ||
The '''DeltaQueue''', used by the IndexManager, is the place where deltas sent by the database are temporarily stored, waiting to be processed. '''DeltaQueuer''' is the listener. Actually, deltas are not enqueued "as is", but as '''IndexDelta''', that is a IRodinElement (file or project) plus a '''Kind''' (file changed, project closed …). | |||
To properly handle index queries, we have to be in a stable state, waiting until the queue is empty. This is achieved by a '''CountUpDownLatch''' that counts the number of elements in the queue. It is incremented when an element is added and decremented when an element has been removed '''and processed'''. | |||
===Jobs and Locks=== | |||
The IndexManager starts as a background job with low priority, then launches a user-cancelable indexing job at each delta processing. | |||
During indexing, resources must be left unchanged. However, it would be too costly to lock the whole RodinDB. Therefore, we only use the scheduling rule of the file to index, forcing the indexing job to wait until it is released. If another file is changed meanwhile, a delta will be generated and the modification will eventually be taken into account by the index manager. | |||
Restoring and saving index data (see [[#Persistence|Persistence]] here before) may lead to concurrency issues if deltas or queries were to be processed at the same time. Protection is achieved by synchronized methods in IndexManager (persistence and queries) and in PIMs (read/write access to index tables). | |||
[[Category:Design]] |
Latest revision as of 17:41, 9 March 2009
This document aims at helping developers getting into the code of the index plug-in. Please refer to Rodin Index Design for more general considerations about the purpose and design.
General structure of the package
The indexer plug-in is made of the following packages:
- org.rodinp.core.indexer: public interfaces and Plugin class
- org.rodinp.internal.core.indexer: implementation of public interfaces and main classes (including IndexManager)
- org.rodinp.internal.core.indexer.tables: index tables (memory storage)
- org.rodinp.internal.core.indexer.sort: generical sorting support (used to manage file dependencies)
- org.rodinp.internal.core.indexer.persistence: index saving main classes (disk storage)
- org.rodinp.internal.core.indexer.persistence.xml: index saving in xml (Strategy pattern)
The Index Manager
IndexManager is the master class of the plug-in. There resides the start() method called in a background job at plug-in startup. Firstly, it restores saved index data from previous session (see Persistence hereafter). Then it loops, waiting for database deltas to be enqueued and launching the indexing of changed files via a user-cancellable job (see Concurrency hereafter).
Indexing is done in two steps, project per project:
- updating file dependencies to compute indexing order
- indexing files in computed order to update index tables
Finally, when the session closes, index data is saved (see Persistence hereafter) and the background job is cancelled.
The IndexManager is also the entry point for index querying.
Project Index Manager
A ProjectIndexManager (PIM) is responsible for maintaining index tables of one project. The IndexManager has a PerProjectPIM, that is simply an association between projects and their respective PIMs.
A PIM is called by the IndexManager at two moments:
- when a file has just changed (fileChanged()), to update file dependencies
- when the indexing job is scheduled (doIndexing()), to index files and update its index tables
It delegates indexer calling to the file indexing manager.
Indexer Registry
The IndexerRegistry (Singleton) records registered indexers. It maps file types on lists of indexers, so that when a given file has to be indexed, it can be requested to give the corresponding registered indexers, in the order they have to be called.
File Indexing Manager
Actual indexer calls are achieved by the FileIndexingManager (FIM), which implements Singleton pattern. Being given a file, it gets appropriate indexers from the indexer registry, asks them either to get dependencies of the file or to index it, then returns the result to the calling PIM.
Index Tables
Those tables store indexing result data.
Rodin Index
RodinIndex is the main index table. It stores the entities together with their occurrences. More precisely, it maps each indexed IInternalElement to a Descriptor.
A Descriptor is a Declaration (handle + name) of an element plus a set of Occurrences of this element.
An Occurrence has an OccurrenceKind (defined by indexer plug-in implementers) and a RodinLocation (a location of the element within another in some file).
Export Table
The ExportTable records declarations of elements exported by each indexed file.
File Table
The FileTable gives, for each file, the elements indexed in it. It contains no more information than the RodinIndex (can be constructed entirely from it), but is intended to give faster access to indexed elements of a given file, rather than iterating through the whole index.
Name Table
The NameTable follows the same optimization idea: for each indexed user-defined name, it gives a set of elements declared with this name.
Dependencies
File dependencies are treated generically in the org.rodinp.internal.core.indexer.sort package. The final goal is to maintain an iterable total order on the files of each project. PIMs will then follow this order when indexing their project, using type IRodinFile for generic parameter T.
Node
The Node is to be understood as a node in the dependence graph. It has a label of type T and lists of predecessors and successors.
Telling that N1 is a predecessor of N2 means that N2 depends on N1, so N1 must be indexed before N2.
Nodes also have an order position (orderPos field) that is the position of this node in the total order. Its value can only be known after nodes have been sorted.
Not all nodes need to be processed each time. This is the reason why nodes have a mark: when iterating over nodes, marked ones are processed (indexed) whereas others are simply ignored.
Total Order
TotalOrder is the main class. It implements Iterator, providing next(), hasNext() and remove() methods, along with other specific facilities.
It delegates graph maintenance to Graph and node sorting and iterating to SortedNodes.
When the graph changes, nodes are sorted again, but only if iterating is requested. This avoids sorting at each change.
Graph change events are received through a listener. This might seem strange at first: as the TotalOrder class calls for graph modification itself, it should know when it changes. The problem is that, for instance, predecessors can be set to exactly the same list of nodes as previously. In this case, the graph remains unchanged but TotalOrder cannot guess it and will unnecessarily sort the whole graph again (well, actually it could guess, but we prefer delegating change check to Graph). Same problem when creating a node that already exists.
Graph
The Graph stores nodes, applies changes to their links and maintains their coherence: if N1 is a predecessor of N2 then N2 is a successor of N1.
As explained above, it fires change events when a method call actually modifies the graph.
Sorted Nodes
SortedNodes maintains a list of iterable sorted nodes. It delegates sorting to a Sorter.
An important feature is that, if the graph is modified during iteration, it restarts from the first node that differs between previous order and new order. To that end, iterated nodes are recorded.
Indexing Bridge
The IndexingBridge acts as a kind of a go-between during the indexing of a single file. It is built by the FIM, then sent to indexers. Indexers get the file to index from it, then they fill it with declarations and occurrences. From these, the IndexingBridge builds an IndexingResult that is sent back to the FIM, then processed by the PIM.
Thus, IndexingBridge is a path from and to the indexing system, from indexers point of view.
This role gives it the responsibility to enforce constraints like:
- an element cannot have an occurrence if it has not been declared before
- an element cannot be declared more than once
- declared elements must be local to the file
- occurring elements must be either local or imported
- declared elements should have one or more occurrences when indexing completes
This last constraint aims at avoiding to store declarations that are never referred to. An element with no occurrence will therefore be removed from the indexing result before returning it to the FIM.
Persistence
Index data persistence is implemented following the 'Strategy' pattern. Current implemention provides an xml strategy, however any other means could be used instead.
Saved and restored data consist in current deltas plus, for each PIM:
- index table
- export table
- file sort order
- iterating state
The PersistenceManager controls restoring and saving, being called by the IndexManager at startup and listening to session and project closure as a ISaveParticipant.
Actual xml persistence is achieved by the XMLPersistor that implements the required IPersistor interface. Then each type of persistent object (index table, ...) is decomposed into specific "persistors" that provide both save() and restore() methods for their assigned object.
Persistent object are passed to persistors as PersistentIndexManager, PersistentSortedNodes, PersistentTotalOrder. These objects contain the state and/or data of IndexManager, SortedNodes, TotalOrder, that they have to save and from which they can be restored.
Concurrency
Delta Queue
The DeltaQueue, used by the IndexManager, is the place where deltas sent by the database are temporarily stored, waiting to be processed. DeltaQueuer is the listener. Actually, deltas are not enqueued "as is", but as IndexDelta, that is a IRodinElement (file or project) plus a Kind (file changed, project closed …).
To properly handle index queries, we have to be in a stable state, waiting until the queue is empty. This is achieved by a CountUpDownLatch that counts the number of elements in the queue. It is incremented when an element is added and decremented when an element has been removed and processed.
Jobs and Locks
The IndexManager starts as a background job with low priority, then launches a user-cancelable indexing job at each delta processing.
During indexing, resources must be left unchanged. However, it would be too costly to lock the whole RodinDB. Therefore, we only use the scheduling rule of the file to index, forcing the indexing job to wait until it is released. If another file is changed meanwhile, a delta will be generated and the modification will eventually be taken into account by the index manager.
Restoring and saving index data (see Persistence here before) may lead to concurrency issues if deltas or queries were to be processed at the same time. Protection is achieved by synchronized methods in IndexManager (persistence and queries) and in PIMs (read/write access to index tables).