Troubleshooting / FAQ

Source System: Microsoft SharePoint Online

Locked Site Collections

In Microsoft SharePoint Online, site collections can be locked. Sites locked with NoAccess will be excluded from any traversal.

Content Traversal

Content Type Hierarchy

The Microsoft SharePoint Online connector iterates over the following hierarchy for every discovered site collection:

Site Collection
- Root Site
  - Lists
    
    Items & Folders
    
    Attachments
  - Document Libraries
    
    Files & Folders
  - Sub Sites
    
    …

Each content traversal considers some advanced source system settings implicitly.

Traversal Flags

If set, the following flags result in the exclusion of documents:

Flag Context Description

Flag	Context	Description
`NoCrawl`	Sites & Lists	The advanced site and list settings provide an option to prevent content from being indexed into search.
`Hidden`	Lists	Microsoft SharePoint Online lists can be configured as hidden.
`IsPrivate`	Lists	Microsoft SharePoint Online lists can be configured as private.
`IsCatalog`	Lists	Internal lists which are used for the content management are marked as catalogs.

NoCrawl

Sites & Lists

The advanced site and list settings provide an option to prevent content from being indexed into search.

Hidden

Lists

Microsoft SharePoint Online lists can be configured as hidden.

IsPrivate

Lists

Microsoft SharePoint Online lists can be configured as private.

IsCatalog

Lists

Internal lists which are used for the content management are marked as catalogs.

List Advanced Settings

Setting: Item-level Permissions

In order to restrict the access to list items Item-level Permissions can be defined. If the property Read items that were created by the user is checked, the traversed list documents can only be read by the author and the site collection admins. Cancel Checkout permissions are not supported by the connector.

List Settings → Advanced Settings → Item-level Permissions

List Versioning Settings

Setting: Item Versioning

Lists can be configured to generate versions in case items are altered. The connector does only traverse the latest version of a given item. If, however, content approval is enabled, only the latest published / approved version of an item will be traversed.

List Settings → Versioning Settings → Content Approval

Principal Traversal

The ACL for documents extracted from Microsoft SharePoint Online contain the following types of principals:

Type Example Description

Type	Example	Description
Site collection group	https://domain.sharepoint.com/sites/sc:<siteGroupId>;	Connector defined reference to a SharePoint Online site collection specific group.
Entra ID group	1b6943f1-5983-45a5-a259-3a553de4c79f	The object ID of an Entra ID group.
Entra ID user	user@domain.onmicrosoft.com	The user principal name of an Entra ID user.
Everyone, except external	spo-grid-all-users	Microsoft SharePoint Online specific role, which includes all active and `Member` typed Entra ID users.
Everyone	all	Role which includes all Entra ID users.

Site collection group

https://domain.sharepoint.com/sites/sc:<siteGroupId>;

Connector defined reference to a SharePoint Online site collection specific group.

Entra ID group

1b6943f1-5983-45a5-a259-3a553de4c79f

The object ID of an Entra ID group.

Entra ID user

user@domain.onmicrosoft.com

The user principal name of an Entra ID user.

Everyone, except external

spo-grid-all-users

Microsoft SharePoint Online specific role, which includes all active and Member typed Entra ID users.

Everyone

all

Role which includes all Entra ID users.

The principal synchronization performs following tasks:

Resolution: Site collection groups (memberships)
Resolution: Entra ID groups (member- & ownerships)
Resolution: Roles (memberships)
Streamlining: Adds additional principals, so the Entra ID synchronization can be shared between connectors of different source systems.

Optimization: Principal Traversal

In case multiple Entra ID based connectors should be run in parallel, the Entra ID synchronization tasks would be run multiple times. This can be prevented by setting up one connector to perform the complete Entra ID synchronization and all other connectors to focus only on the source system specific principals.

If the Microsoft SharePoint Online connector should perform the full Entra ID synchronization, the principal synchronization algorithm Full SharePoint Online and all Azure AD groups and users can be configured. Otherwise, only Microsoft SharePoint Online specific principals can be resolved via the algorithm Only SharePoint Online groups.

Target System: Apache Solr

Schema Handling

Before each content traversal the Apache Solr connector will fetch the schema for the collection to index documents into. Only metadata defined in the schema, will be indexed for each document. The connector considers explicit and dynamic schema fields.

The following field type classes will result in a special preprocessing:

Class Preprocessing

Class	Preprocessing
`solr.DatePointField`	Metadata field values which match a schema field of this type are formatted to ISO-8601 standard with second precision.

solr.DatePointField

Metadata field values which match a schema field of this type are formatted to ISO-8601 standard with second precision.

Schemaless mode is not supported!

In case all extracted metadata should be indexed, the following dynamic field definition can be added to the schema:

<dynamicField name="*"  type="text_general"  indexed="true"  stored="true"/>

This way all not explicitly defined fields will be interpreted as text fields.

Special Metadata

Between connectors a defined set of metadata is streamlined. These standard metadata will take precedence before metadata extracted from the source system. In case a field name extracted from the source system clashes with the name of a standard metadata field, only the standard metadata field is indexed.

Standard Metadata Field	Description
title	The (interpreted) title of the document.
sourceName	The name of the source system.
sourceUrl	The URL pointing to the source system.
sourceType	The source system type.
itemType	The type of the item / document.
clickUrl	The clickable URL referencing the document.
lastModifiedDate	The last modification date.
createdDate	The creation date.
previewUrl	A URL which points to a preview of the document.
mimeType	The mime type of the document.
authorId	The ID of the author.
authorName	The display name of the author.
fileExtension	The file extension.
fileType	The pretty name of the file type.
contributorIds	A list of contributor IDs.
contributorNames	A list of contributor names.
breadcrumbNames	A list of names referencing items in the hierarchy above the current document.
breadcrumbUrls	A list of URLs referencing items in the hierarchy above the current document.
languages	A list of languages.
keywords	A list of keywords.
content	The document’s content as part of the metadata.
docAllowAcl	The allow document ACL.
docDenyAcl	The deny document ACL.

Standard Metadata Field

Description

title

The (interpreted) title of the document.

sourceName

The name of the source system.

sourceUrl

The URL pointing to the source system.

sourceType

The source system type.

itemType

The type of the item / document.

clickUrl

The clickable URL referencing the document.

lastModifiedDate

The last modification date.

createdDate

The creation date.

previewUrl

A URL which points to a preview of the document.

mimeType

The mime type of the document.

authorId

The ID of the author.

authorName

The display name of the author.

fileExtension

The file extension.

fileType

The pretty name of the file type.

contributorIds

A list of contributor IDs.

contributorNames

A list of contributor names.

breadcrumbNames

A list of names referencing items in the hierarchy above the current document.

breadcrumbUrls

A list of URLs referencing items in the hierarchy above the current document.

languages

A list of languages.

keywords

A list of keywords.

content

The document’s content as part of the metadata.

docAllowAcl

The allow document ACL.

docDenyAcl

The deny document ACL.

Scheduling Commits

Commits to the index will not be triggered by the connector. In order to make data searchable, either commit manually or schedule automatic commits. For further information see the Official Apache Solr Documentation.

Unable to access jarfile error (installation on Windows)

This error occurs when the installation path exceeds the maxium Windows path length of 260 characters. Ensure that the full path to bin\connector.bat does not exceed 260 characters.