A simple crawler which searchs for the appearence of nodes from a graph in a downloaded text.
-
crawl?url={url}
downloads a web page and searches for words which appear in the page and in the graph. Found words are is sorted by classes and instances. -
addNode?label={label}&type={type}
adds a node to the graph (Type can be everything in principal but is pratically limited to 'Class' and 'Instance' at the moment) -
addEdge?fromNode={fromNode}&toNode={toNode}&predicate={predicate}
adds a new edge to the graph. -
findNodes?labelStartsWith={labelStartsWith}
finds nodes by matching the first characters of a node's label. -
findEdges?labelStartsWith={labelStartsWith}
finds edges by matching the first characters of a edges's label. -
findEdgesFromNode?fromNode={fromNode}
finds all edges which lead from the given node to antoher -
findEdgesToNode?toNode={toNode}
finds all edges which lead to the given node -
loadNodes?nodes={nodes}
loads all node data for all given node IDs -
loadEdges?edges={edges}
loads all edge data for all given edge IDs -
loadClientCode/apps/{app}
loads static files from the configured "client_code" directory
Prerequisites
- Mono/.Net 4.5
Configuration
All configuration is done in the Liv.io.GraphCrawler.ControlService's App.config file.
Parameters are:
- DataDirectory: A string which points to the directory where the graph-files and downloaded resources should be stored
- EdgesFilename: The name of the file in which to store the edges
- NodesFilename: The name of the file in which to store the nodes
- ResourcesTableFilename: The name of the file in which to store the reource-entries (metdadata for downloaded sites)
- ResourcesFolder: The name of the folder in which the downloaded resources should be stored (must lie within the DataDirectory)
- ClientCodeFolder: A folder in which static files can be stored. They can be delivered directly by the application.
File format
The tool uses csv files for persistence. All columns are seperated by pipes '|'.
nodes.csv Id|Label|Type
Example row:
694|Canada|Instance
edges.csv Id|Source|Target|Type|Label|Weight
Example row:
196|85|99|Directed|lies-in|1
resources.csv Uri|Title|FilesystemLocation
Example row http://de.wikipedia.org/wiki/Markdow|Markdown|/var/crawler/resources/5b7c8499-6c32-478c-abd6-cf33153d0967